PyTorch implementation of Neural Combinatorial Optimization with Reinforcement Learning.

Overview

neural-combinatorial-rl-pytorch

PyTorch implementation of Neural Combinatorial Optimization with Reinforcement Learning.

I have implemented the basic RL pretraining model with greedy decoding from the paper. An implementation of the supervised learning baseline model is available here. Instead of a critic network, I got my results below on TSP from using an exponential moving average critic. The critic network is simply commented out in my code right now. From correspondence with a few others, it was determined that the exponential moving average critic significantly helped improve results.

My implementation uses a stochastic decoding policy in the pointer network, realized via PyTorch's torch.multinomial(), during training, and beam search (not yet finished, only supports 1 beam a.k.a. greedy) for decoding when testing the model.

Currently, there is support for a sorting task and the planar symmetric Euclidean TSP.

See main.sh for an example of how to run the code.

Use the --load_path $LOAD_PATH and --is_train False flags to load a saved model.

To load a saved model and view the pointer network's attention layer, also use the --plot_attention True flag.

Please, feel free to notify me if you encounter any errors, or if you'd like to submit a pull request to improve this implementation.

Adding other tasks

This implementation can be extended to support other combinatorial optimization problems. See sorting_task.py and tsp_task.py for examples on how to add. The key thing is to provide a dataset class and a reward function that takes in a sample solution, selected by the pointer network from the input, and returns a scalar reward. For the sorting task, the agent received a reward proportional to the length of the longest strictly increasing subsequence in the decoded output (e.g., [1, 3, 5, 2, 4] -> 3/5 = 0.6).

Dependencies

  • Python=3.6 (should be OK with v >= 3.4)
  • PyTorch=0.2 and 0.3
  • tqdm
  • matplotlib
  • tensorboard_logger

PyTorch 0.4 compatibility is available on branch pytorch-0.4.

TSP Results

Results for 1 random seed over 50 epochs (each epoch is 10,000 batches of size 128). After each epoch, I validated performance on 1000 held out graphs. I used the same hyperparameters from the paper, as can be seen in main.sh. The dashed line shows the value indicated in Table 2 of Bello, et. al for comparison. The log scale x axis for the training reward is used to show how the tour length drops early on.

TSP 20 Train TSP 20 Val TSP 50 Train TSP 50 Val

Sort Results

I trained a model on sort10 for 4 epochs of 1,000,000 randomly generated samples. I tested it on a dataset of size 10,000. Then, I tested the same model on sort15 and sort20 to test the generalization capabilities.

Test results on 10,000 samples (A reward of 1.0 means the network perfectly sorted the input):

task average reward variance
sort10 0.9966 0.0005
sort15 0.7484 0.0177
sort20 0.5586 0.0060

Example prediction on sort10:

input: [4, 7, 5, 0, 3, 2, 6, 8, 9, 1]
output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Attention visualization

Plot the pointer network's attention layer with the argument --plot_attention True

TODO

  • Add RL pretraining-Sampling
  • Add RL pretraining-Active Search
  • Active Search
  • Asynchronous training a la A3C
  • Refactor USE_CUDA variable
  • Finish implementing beam search decoding to support > 1 beam
  • Add support for variable length inputs

Acknowledgements

Special thanks to the repos devsisters/neural-combinatorial-rl-tensorflow and MaximumEntropy/Seq2Seq-PyTorch for getting me started, and @ricgama for figuring out that weird bug with clone()

Comments
  • PyTorch 0.4.0 compatibility

    PyTorch 0.4.0 compatibility

    To run the code with PyTorch 0.4.0 I had to change some lines. This may be helpful for others, too. It is not compatible with older PyTorch Versions, as I did not include a version-check yet.

    Changes:

    • Replaced multinomial by Categorical (https://hub.packtpub.com/pytorch-0-3-0-released/)
    • data[0] -> .item()
    • Added batch-dimension
    • explicit in-place use of clip_grad_norm_
    opened by maltedeckers 2
  • RuntimeError

    RuntimeError

    Hi, when I run python3 trainer.py There comes an error:

    Traceback (most recent call last): File "trainer.py", line 203, in R, probs, actions, actions_idxs = model(bat) File "/home/maroon/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/maroon/PycharmProjects/neural-combinatorial-rl-pytorch/neural_combinatorial_rl.py", line 539, in forward R = self.objective_fn(actions, self.use_cuda) File "/home/maroon/PycharmProjects/neural-combinatorial-rl-pytorch/sorting_task.py", line 49, in reward current += res.float()
    RuntimeError: invalid argument 1: the number of sizes provided must be greater or equal to the number of dimensions in the tensor at /pytorch/torch/lib/TH/generic/THTensor.c:299

    How can I solve it? Thank you.

    opened by marooncn 2
  • Meet lots of Deprecated warning with higher version of Pytorch. Do you have any idea?

    Meet lots of Deprecated warning with higher version of Pytorch. Do you have any idea?

    Hi, thank you for your repo! It's an awesome implement and helps me a lot. However when running it with higher version of pytorch, it raises lots of deprecated warning as this:

    IndexingUtils.h:20: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead.
    

    I searched around the project but find uint8 nowhere. Do you have any idea how to fix this? Or do I have to run it at version of 0.4. (I use 1.2 currently since the API didn't change a lot since 0.4)

    opened by caidwang 1
  • TSP_20 task reward function

    TSP_20 task reward function

    Hi, thanks for sharing your code.

    In the file "tsp_task.py", I see the TSP reward defined in the usual way as the sum of distances between consecutive vertices in the cylce. But it looks like you also started to generalize this, or maybe had to manually tune things a bit for tsp_20? I see this:

    # For TSP_20 - map to a number between 0 and 1
    # min_len = 3.5
    # max_len = 10.
    # TODO: generalize this for any TSP size
    #tour_len = -0.1538*tour_len + 1.538 
    #tour_len[tour_len < 0.] = 0.
    return tour_len
    

    So for the results you showed for the tsp_20 and tsp_50 tasks, did you use the usual sum of distances that would run as is in the current uncommented code, or did you find it helped to do those affine then clip operations you have commented out? Are these values derived theoretically or empirically?

    Any other thoughts here in terms of generalizing the reward to work with arbitrary tour length?

    Thanks!

    opened by wasd12345 1
  • variable length inputs

    variable length inputs

    Hi, thanks a lot for your code. Haven't had a chance to look through it in detail yet but it seems like a pretty solid implementation. A question though: in your to do, I see an item:

    "Add support for variable length inputs"

    Just wanted to clarify what this means since in general a pointer net could deal with arbitrary size inputs, so what inputs is it referring to which apparently have a fixed length?

    Thanks!

    opened by wasd12345 1
  • plot_reward.py:  reward_csv?

    plot_reward.py: reward_csv?

    Hi, plot_reward.py looks like the script you used to create the figures in the README showing the average TSP tour length as a function of training progress. In that script, there are two arguments which are required (paths to two things):

    • data
    • reward_csv

    But in trainer.py I don't see these things being saved out (for example I don't see any pandas import or csv saving). So then maybe you had another script to parse the logfiles and make this reward_csv? If so, could you provide it?

    Thanks!

    opened by wasd12345 0
  • Pytorch 0.4.0 compatibility

    Pytorch 0.4.0 compatibility

    To run the code with PyTorch 0.4.0 I had to change some lines. This may be helpful for others, too. It is not compatible with older PyTorch Versions, as I did not include a version-check yet.

    Changes:

    • Replaced multinomial by Categorical (https://hub.packtpub.com/pytorch-0-3-0-released/)
    • data[0] -> .item()
    • Added batch-dimension
    • explicit in-place use of clip_grad_norm_
    opened by maltedeckers 0
  • Subset selection

    Subset selection

    Is this implementation useable for tasks where not all of the input is selected, but some subset of the input? I have seen partial implementation (such as the finish_symbol variable), but I believe it's not finished?

    opened by GiilDe 0
  • n_epochs > 1 : multinomial sampling error probs<0

    n_epochs > 1 : multinomial sampling error probs<0

    Hi, for both the sorting task and the TSP task, when I run with any n_epochs > 1, toward the beginning of the 2nd epoch I get the following error in the stochastic decoder in neural_combinatorial_rl.py related to multinomial distribution sampling. When I print out "probs" I see that they very quickly converge to all 0's and a 1 distribution, then the following iteration are all nans. Any idea what's going on here, or did you just end up avoiding this by running with 1 epoch with many iterations?

    Thanks!

    (using the pytorch-0.4 branch)

    (below is shown for the sort_10 task)

    probs tensor([[0.0000, 0.0000, 0.0000, 0.5416, 0.4584, 0.0000], [0.0000, 0.0000, 0.5079, 0.4921, 0.0000, 0.0000], [0.0000, 0.5320, 0.0000, 0.0000, 0.0000, 0.4680], [0.0000, 0.0000, 0.0000, 0.5146, 0.4854, 0.0000], [0.0000, 0.0000, 0.0000, 0.0000, 0.5275, 0.4725], [0.0000, 0.0000, 0.4794, 0.0000, 0.0000, 0.5206], [0.4945, 0.5055, 0.0000, 0.0000, 0.0000, 0.0000], [0.0000, 0.0000, 0.5195, 0.0000, 0.4805, 0.0000], [0.5135, 0.0000, 0.0000, 0.0000, 0.4865, 0.0000], [0.4877, 0.0000, 0.5123, 0.0000, 0.0000, 0.0000], [0.5207, 0.0000, 0.0000, 0.0000, 0.0000, 0.4793], [0.0000, 0.5231, 0.0000, 0.0000, 0.0000, 0.4769]], grad_fn=) probs tensor([[0., 0., 0., 1., 0., 0.], [0., 0., 0., 1., 0., 0.], [0., 0., 0., 0., 0., 1.], [0., 0., 0., 1., 0., 0.], [0., 0., 0., 0., 0., 1.], [0., 0., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0.], [0., 0., 0., 0., 1., 0.], [0., 0., 1., 0., 0., 0.], [0., 0., 0., 0., 0., 1.], [0., 0., 0., 0., 0., 1.]], grad_fn=) probs tensor([[nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan]], grad_fn=)

    ... leads to...

    Traceback (most recent call last): File "trainer.py", line 295, in R, probs, actions, actions_idxs = model(bat) File "C:\Users\me\Anaconda3\envs\myenv\lib\site-packages\torch\nn\modules \module.py", line 477, in call result = self.forward(*input, **kwargs) File "C:\Users\me\Desktop\neural-combinatorial-rl-pytorch\neural_combinatorial_rl.py", line 514, in forward probs_, action_idxs = self.actor_net(embedded_inputs) File "C:\Users\me\Anaconda3\envs\myenv\lib\site-packages\torch\nn\modules \module.py", line 477, in call result = self.forward(*input, **kwargs) File "C:\Users\me\Desktop\neural-combinatorial-rl-pytorch\neural_combinatorial_rl.py", line 374, in forward enc_h) File "C:\Users\me\Anaconda3\envs\myenv\lib\site-packages\torch\nn\modules \module.py", line 477, in call result = self.forward(*input, **kwargs) File "C:\Users\me\Desktop\neural-combinatorial-rl-pytorch\neural_combinatorial_rl.py", line 189, in forward selections) File "C:\Users\me\Desktop\neural-combinatorial-rl-pytorch\neural_combinatorial_rl.py", line 255, in decode_stochastic idxs = c.sample() File "C:\Users\me\Anaconda3\envs\myenv\lib\site-packages\torch\distributions\categorical.py", line 90, in sample sample_2d = torch.multinomial(probs_2d, 1, True) RuntimeError: invalid argument 2: invalid multinomial distribution (encountering probability entry < 0) at c:\programdata\miniconda3\conda-bld\pytorch_153309062 3466\work\aten\src\th\generic/THTensorRandom.cpp:407

    opened by wasd12345 1
  • beam search

    beam search

    The To Do mentions "Finish implementing beam search decoding to support > 1 beam".

    It looks like you have various beam width variables in different places of the code. Was your note because:

    1. you implemented beam search but it didn't seem to work as well when width>1 and you wanted to investigate further, or
    2. there are still some places in the code that would need to be modified to before using beam search?

    If 2), then which places would need to be modified?

    Thanks!

    opened by wasd12345 0
  • mask

    mask

    Hello @pemami4911,

    The problem really was with the mask. I've fixed it and the network started to learn. My Decoder now is:

    class Decoder(nn.Module):
        def __init__(self, feactures_dim,hidden_size, n_layers=1):
            super(Decoder, self).__init__()
                    
            self.W1 = Var(hidden_size, hidden_size)
            self.W2 = Var(hidden_size, hidden_size)
            self.b2 = Var(hidden_size)
            self.V = Var(hidden_size)
            
            self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True, num_layers=n_layers)
    
        def forward(self, input, hidden, enc_outputs,mask, prev_idxs):
            
            w1e = torch.matmul(enc_outputs,self.W1)
            w2h = (torch.matmul(hidden[0][-1], self.W2) + self.b2).unsqueeze(1)
            
            u = F.tanh(w1e + w2h)
            a = torch.matmul(u, self.V)
            
            a, mask = self.apply_mask( a, mask, prev_idxs)
            a = F.softmax(a)
            res, hidden = self.lstm(input, hidden)          
            return a, hidden, mask
        
        def apply_mask(self, attentions, mask, prev_idxs):    
            if mask is None:
                mask = Variable(torch.ones(attentions.size())).cuda()
                
            maskk = mask.clone()
    
            if prev_idxs is not None:
    
                for i,j in zip(range(attentions.size(0)),prev_idxs.data):
                    maskk[i,j[0]] = 0
                    
                masked= maskk*attentions + maskk.log()
            else:
                masked = attentions  
    
            return masked, maskk
    
    

    For n=10, I'm obtaining the following during training the AC version:

    
    Step 0
    Average train model loss:  -81.07959747314453
    Average train critic loss:  29.443866729736328
    Average train pred-reward:  -0.08028505742549896
    Average train reward:  5.2866997718811035
    Average loss:  -15.106804847717285
    ------------------------
    Step  1000
    Average train model loss:  -0.7814755792869255
    Average train critic loss:  0.7740849611759186
    Average train pred-reward:  4.219553744537756
    Average train reward:  4.272005982398987
    Average loss:  -6.201663847446442
    ------------------------
    
    (...)
    
    Step  19000
    Average train model loss:  -0.06441724334075116
    Average train critic loss:  0.1361817416474223
    Average train pred-reward:  3.0573679950237276
    Average train reward:  3.0583059163093567
    Average loss:  -1.5689961900115013
    
    

    I've checked and the returned solutions are all feasible so it seems that it is really converging. I will clean my code and hope that by the end of the week I will have some training history plots and test set validation. If you like, I can share my notebook.

    Best regards

    opened by ricgama 35
Owner
Patrick E.
Machine Learning PhD Candidate at Univ. of Florida. Deep generative models | object-centric representation learning | RL | transportation
Patrick E.
Filtering variational quantum algorithms for combinatorial optimization

Current gate-based quantum computers have the potential to provide a computational advantage if algorithms use quantum hardware efficiently.

null 1 Feb 9, 2022
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Ilya Kostrikov 3k Dec 31, 2022
Ankou: Guiding Grey-box Fuzzing towards Combinatorial Difference

Ankou Ankou is a source-based grey-box fuzzer. It intends to use a more rich fitness function by going beyond simple branch coverage and considering t

SoftSec Lab 54 Dec 24, 2022
Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

Karush Suri 8 Nov 7, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

V-MPO Simple code to demonstrate Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) in Pyt

Nugroho Dewantoro 9 Jun 6, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 3, 2023
library for nonlinear optimization, wrapping many algorithms for global and local, constrained or unconstrained, optimization

NLopt is a library for nonlinear local and global optimization, for functions with and without gradient information. It is designed as a simple, unifi

Steven G. Johnson 1.4k Dec 25, 2022
Racing line optimization algorithm in python that uses Particle Swarm Optimization.

Racing Line Optimization with PSO This repository contains a racing line optimization algorithm in python that uses Particle Swarm Optimization. Requi

Parsa Dahesh 6 Dec 14, 2022
Pytorch implementation of AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networks

AngularGrad Optimizer This repository contains the oficial implementation for AngularGrad: A New Optimization Technique for Angular Convergence of Con

mario 124 Sep 16, 2022
This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CNPs), Neural Processes (NPs), Attentive Neural Processes (ANPs).

The Neural Process Family This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CN

DeepMind 892 Dec 28, 2022
Pytorch implementation of the paper "Optimization as a Model for Few-Shot Learning"

Optimization as a Model for Few-Shot Learning This repo provides a Pytorch implementation for the Optimization as a Model for Few-Shot Learning paper.

Albert Berenguel Centeno 238 Jan 4, 2023
PyTorch implementation of the ExORL: Exploratory Data for Offline Reinforcement Learning

ExORL: Exploratory Data for Offline Reinforcement Learning This is an original PyTorch implementation of the ExORL framework from Don't Change the Alg

Denis Yarats 52 Jan 1, 2023
Learning to Communicate with Deep Multi-Agent Reinforcement Learning in PyTorch

Learning to Communicate with Deep Multi-Agent Reinforcement Learning This is a PyTorch implementation of the original Lua code release. Overview This

Minqi 297 Dec 12, 2022
"Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback"

This is code repo for our EMNLP 2017 paper "Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback", which implements the A2C algorithm on top of a neural encoder-decoder model and benchmarks the combination under simulated noisy rewards.

Khanh Nguyen 131 Oct 21, 2022
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

null 195 Dec 7, 2022
PyTorch implementation of Trust Region Policy Optimization

PyTorch implementation of TRPO Try my implementation of PPO (aka newer better variant of TRPO), unless you need to you TRPO for some specific reasons.

Ilya Kostrikov 366 Nov 15, 2022
Minimal PyTorch implementation of Generative Latent Optimization from the paper "Optimizing the Latent Space of Generative Networks"

Minimal PyTorch implementation of Generative Latent Optimization This is a reimplementation of the paper Piotr Bojanowski, Armand Joulin, David Lopez-

Thomas Neumann 117 Nov 27, 2022