Evolution Strategies in PyTorch

Overview

Evolution Strategies

This is a PyTorch implementation of Evolution Strategies.

Requirements

Python 3.5, PyTorch >= 0.2.0, numpy, gym, universe, cv2

What is this? (For non-ML people)

A large class of problems in AI can be described as "Markov Decision Processes," in which there is an agent taking actions in an environment, and receiving reward, with the goal being to maximize reward. This is a very general framework, which can be applied to many tasks, from learning how to play video games to robotic control. For the past few decades, most people used Reinforcement Learning -- that is, learning from trial and error -- to solve these problems. In particular, there was an extension of the backpropagation algorithm from Supervised Learning, called the Policy Gradient, which could train neural networks to solve these problems. Recently, OpenAI had shown that black-box optimization of neural network parameters (that is, not using the Policy Gradient or even Reinforcement Learning) can achieve similar results to state of the art Reinforcement Learning algorithms, and can be parallelized much more efficiently. This repo is an implementation of that black-box optimization algorithm.

Usage

There are two neural networks provided in model.py, a small neural network meant for simple tasks with discrete observations and actions, and a larger Convnet-LSTM meant for Atari games.

Run python3 main.py --help to see all of the options and hyperparameters available to you.

Typical usage would be:

python3 main.py --small-net --env-name CartPole-v1

which will run the small network on CartPole, printing performance on every training batch. Default hyperparameters should be able to solve CartPole fairly quickly.

python3 main.py --small-net --env-name CartPole-v1 --test --restore path_to_checkpoint

which will render the environment and the performance of the agent saved in the checkpoint. Checkpoints are saved once per gradient update in training, always overwriting the old file.

python3 main.py --env-name PongDeterministic-v4 --n 10 --lr 0.01 --useAdam

which will train on Pong and produce a learning curve similar to this one:

Learning curve

This graph was produced after approximately 24 hours of training on a 12-core computer. I would expect that a more thorough hyperparameter search, and more importantly a larger batch size, would allow the network to solve the environment.

Deviations from the paper

  • I have not yet tried virtual batch normalization, but instead use the selu nonlinearity, which serves the same purpose but at a significantly reduced computational overhead. ES appears to be training on Pong quite well even with relatively small batch sizes and selu.

  • I did not pass rewards between workers, but rather sent them all to one master worker which took a gradient step and sent the new models back to the workers. If you have more cores than your batch size, OpenAI's method is probably more efficient, but if your batch size is larger than the number of cores, I think my method would be better.

  • I do not adaptively change the max episode length as is recommended in the paper, although it is provided as an option. The reasoning being that doing so is most helpful when you are running many cores in parallel, whereas I was using at most 12. Moreover, capping the episode length can severely cripple the performance of the algorithm if reward is correlated with episode length, as we cannot learn from highly-performing perturbations until most of the workers catch up (and they might not for a long time).

Tips

  • If you increase the batch size, n, you should increase the learning rate as well.

  • Feel free to stop training when you see that the unperturbed model is consistently solving the environment, even if the perturbed models are not.

  • During training you probably want to look at the rank of the unperturbed model within the population of perturbed models. Ideally some perturbation is performing better than your unperturbed model (if this doesn't happen, you probably won't learn anything useful). This requires 1 extra rollout per gradient step, but as this rollout can be computed in parallel with the training rollouts, this does not add to training time. It does, however, give us access to one less CPU core.

  • Sigma is a tricky hyperparameter to get right -- higher values of sigma will correspond to less variance in the gradient estimate, but will be more biased. At the same time, sigma is controlling the variance of our perturbations, so if we need a more varied population, it should be increased. It might be possible to adaptively change sigma based on the rank of the unperturbed model mentioned in the tip above. I tried a few simple heuristics based on this and found no significant performance increase, but it might be possible to do this more intelligently.

  • I found, as OpenAI did in their paper, that performance on Atari increased as I increased the size of the neural net.

Your code is making my computer slow help

Short answer: decrease the batch size to the number of cores in your computer, and decrease the learning rate as well. This will most likely hurt the performance of the algorithm.

Long answer: If you want large batch sizes while also keeping the number of spawned threads down, I have provided an old version in the slow_version branch which allows you to do multiple rollouts per thread, per gradient step. This code is not supported, however, and it is not recommended that you use it.

Contributions

Please feel free to make Github issues or send pull requests.

License

MIT

Comments
  • Performance on MountainCar

    Performance on MountainCar

    Does anyone train the model on MountainCar-v0? I can only obtain the minimum reward -200. I tried both smaller sigma and larger sigma but none of both worked.

    opened by wenyeming333 3
  • Question about action selection

    Question about action selection

    In case of policy gradients, we try to approximate a softmax policy from which we sample actions based on probabilities stochastically.

    How about in ES in case of discrete action space? Does the method follow greedy policy or softmax policy? From the code, I could see it is greedy policy, is it the right behavior?

    opened by rajcscw 2
  • IndexError: too many indices for array

    IndexError: too many indices for array

    Hi,

    I trying to run cartpole using this command

    python3 main.py --small-net --env-name CartPole-v1
    

    and I get a screen full of errors like this

    IndexError: too many indices for array
    Process Process-41:
    Traceback (most recent call last):
      File "/home/ajay/anaconda3/envs/py35_pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
        self.run()
      File "/home/ajay/anaconda3/envs/py35_pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/home/ajay/PythonProjects/pytorch-es-master/train.py", line 49, in do_rollouts
        state, reward, done, _ = env.step(action[0, 0])
    IndexError: too many indices for array
    
    opened by AjayTalati 2
  • Support Capability to Use GPUs

    Support Capability to Use GPUs

    Hey Andrew, thanks immensely for putting this together. Very useful example of evolution strategies.

    I was wondering what your thoughts of including GPU support. My thought is that any action done by each model can be ran on the gpu and the environment is then ran on the CPU. Due to the nature of the GPU batching, one idea is that you would batch actions, then let the environment respond, and continue this process.

    I feel that the biggest bottleneck at this point would be pcie lanes depending on how much bandwidth you have to the GPU. The bottom line is that models would be stored on the gpu and would execute on the gpu while the env is ran on the cpu. Does gym allow the actual env to run on the gpu?

    opened by NickShahML 2
  • Made compatible with python 2.7, added option for ADAM optimizer, add…

    Made compatible with python 2.7, added option for ADAM optimizer, add…

    …ed enhanced progress plotting via matplotlib

    After this PR a good improvement in CartPole can be seen with: python main.py --small-net --env-name CartPole-v1 --useAdam --lr 0.15

    opened by lolz0r 1
  • it stuck in selu?

    it stuck in selu?

    hi~, i run CartPole-v1 , and it is ok. But, when i run with other env-name, they all stuck in the same place:

    here in model.py , i add some print to help check where they stuck: def forward(self, inputs): if self.small_net: x = selu(self.linear1(inputs)) x = selu(self.linear2(x)) return self.actor_linear(x) else: print('model: !!!forward!!! big-net(4conv+1lstm)') inputs, (hx, cx) = inputs print('model: !!!after update: input, (hx,cx) = inputs') x = selu(self.conv1(inputs)) x = selu(self.conv2(x)) x = selu(self.conv3(x)) x = selu(self.conv4(x)) print('model: !!!after 4conv end selu process') x = x.view(-1, 3233) print('model: !!!after x reshape: x.view(-1,3233)') ......

    and here below is the output of the " python3 main.py --env-name PongDeterministic-v4 --n 10 --lr 0.01 --useAdam" command: (venv_openai-es) l00221575@F0817-S05:~/venv_openai-es/pytorch-es$ python3 main.py --env-name PongDeterministic-v4 --n 10 --lr 0.01 --useAdam [2018-10-23 22:23:10,929] Making new env: PongDeterministic-v4 Preprocessing env Num params in network 588710 /home/l00221575/venv_openai-es/pytorch-es/train.py:50: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. (Variable(state.unsqueeze(0), volatile=True), model: !!!forward!!! big-net(4conv+1lstm) model: !!!after update: input, (hx,cx) = inputs /home/l00221575/venv_openai-es/pytorch-es/train.py:50: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. (Variable(state.unsqueeze(0), volatile=True), model: !!!forward!!! big-net(4conv+1lstm) model: !!!after update: input, (hx,cx) = inputs /home/l00221575/venv_openai-es/pytorch-es/train.py:50: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. (Variable(state.unsqueeze(0), volatile=True),

     
    and I guess they stuck in selu, and  i add some print in selu and run PongDeterministic-v4 again, but the output stay the same as above, and other env-name like Kangaroo-ram-v0, Skiing-v0, Freeway-v0 and Gravitar-v0 , they all stuck in the same place like I run PongDeterministic-v4.
    
    Please help~~~
    
    def selu(x):
        print('selu begin')
        alpha = 1.6732632423543772848170429916717
        scale = 1.0507009873554804934193349852946
        print('selu ends')
        return scale * F.elu(x, alpha)
    
    opened by ouyangzhuzhu 0
  • Adding Prioritized Experience Replay

    Adding Prioritized Experience Replay

    opened by AjayTalati 0
  • Tic-tac-toe environment for the ES training process

    Tic-tac-toe environment for the ES training process

    Hello, @atgambardella.

    Using as base your code, I have developed a new Tic-tac-toe environment for the ES training process. As this game can be studied in full depth by a classical min-max tree, I've used this classic AI to play against our neural network model in the "step" phase and to return so the reward.

    The last result is a model (a simple "Linear" one) that thanks to the evolutionary computation can reach a zero-perfect game against the classical AI brute force strategy.

    My code is here: https://github.com/Zeta36/pytorch-es-tic-tac-toe

    I simplified also your code a little and I removed thing I knew I was not going to need.

    Thanks for your work, friend.

    opened by Zeta36 0
  • Spawn processes outside of the training loop

    Spawn processes outside of the training loop

    As the code is now, n processes are spawned per gradient step. As python startup time takes a while (~30 ms per process), this causes non-negligible overhead.

    enhancement 
    opened by atgambardella 0
Owner
Andrew Gambardella
Machine Learning DPhil (PhD) student at University of Oxford
Andrew Gambardella
This is the official pytorch implementation of Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation(TESKD)

Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation (TESKD) By Zheng Li[1,4], Xiang Li[2], Lingfeng Yang[2,4], Jian Yang[2], Zh

Zheng Li 9 Sep 26, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 3, 2023
A fast Evolution Strategy implementation in Python

Evostra: Evolution Strategy for Python Evolution Strategy (ES) is an optimization technique based on ideas of adaptation and evolution. You can learn

Mika 251 Dec 8, 2022
Code for the paper Task Agnostic Morphology Evolution.

Task-Agnostic Morphology Optimization This repository contains code for the paper Task-Agnostic Morphology Evolution by Donald (Joey) Hejna, Pieter Ab

Joey Hejna 18 Aug 4, 2022
Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer This repository contains the PyTorch code for Evo-ViT. This work proposes a slow-fas

YifanXu 53 Dec 5, 2022
This implements one of result networks from Large-scale evolution of image classifiers

Exotic structured image classifier This implements one of result networks from Large-scale evolution of image classifiers by Esteban Real, et. al. Req

null 54 Nov 25, 2022
NEATEST: Evolving Neural Networks Through Augmenting Topologies with Evolution Strategy Training

NEATEST: Evolving Neural Networks Through Augmenting Topologies with Evolution Strategy Training

Göktuğ Karakaşlı 16 Dec 5, 2022
Embodied Intelligence via Learning and Evolution

Embodied Intelligence via Learning and Evolution This is the code for the paper Embodied Intelligence via Learning and Evolution Agrim Gupta, Silvio S

Agrim Gupta 111 Dec 13, 2022
An atmospheric growth and evolution model based on the EVo degassing model and FastChem 2.0

EVolve Linking planetary mantles to atmospheric chemistry through volcanism using EVo and FastChem. Overview EVolve is a linked mantle degassing and a

Pip Liggins 2 Jan 17, 2022
GEA - Code for Guided Evolution for Neural Architecture Search

Efficient Guided Evolution for Neural Architecture Search Usage Create a conda e

null 6 Jan 3, 2023
PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

Taesun Whang 47 Nov 22, 2022
Trading Strategies for Freqtrade

Freqtrade Strategies Strategies for Freqtrade, developed primarily in a partnership between @werkkrew and @JimmyNixx from the Freqtrade Discord. Use t

Bryan Chain 242 Jan 7, 2023
Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Modeling High-Frequency Limit Order Book Dynamics Using Machine Learning Framework to capture the dynamics of high-frequency limit order books. Overvi

Chang-Shu Chung 1.3k Jan 7, 2023
Using deep actor-critic model to learn best strategies in pair trading

Deep-Reinforcement-Learning-in-Stock-Trading Using deep actor-critic model to learn best strategies in pair trading Abstract Partially observed Markov

null 281 Dec 9, 2022
Optimize Trading Strategies Using Freqtrade

Optimize trading strategy using Freqtrade Short demo on building, testing and optimizing a trading strategy using Freqtrade. The DevBootstrap YouTube

DevBootstrap 139 Jan 1, 2023
This is the official implementation of TrivialAugment and a mini-library for the application of multiple image augmentation strategies including RandAugment and TrivialAugment.

Trivial Augment This is the official implementation of TrivialAugment (https://arxiv.org/abs/2103.10158), as was used for the paper. TrivialAugment is

AutoML-Freiburg-Hannover 94 Dec 30, 2022
existing and custom freqtrade strategies supporting the new hyperstrategy format.

freqtrade-strategies Description Existing and self-developed strategies, rewritten to support the new HyperStrategy format from the freqtrade-develop

null 39 Aug 20, 2021
How Do Adam and Training Strategies Help BNNs Optimization? In ICML 2021.

AdamBNN This is the pytorch implementation of our paper "How Do Adam and Training Strategies Help BNNs Optimization?", published in ICML 2021. In this

Zechun Liu 47 Sep 20, 2022
My freqtrade strategies

My freqtrade-strategies Hi there! This is repo for my freqtrade-strategies. My name is Ilya Zelenchuk, I'm a lecturer at the SPbU university (https://

null 171 Dec 5, 2022