Evolution Strategies in PyTorch

Andrew Gambardella

Last update: Nov 14, 2022

Related tags

Overview

Evolution Strategies

This is a PyTorch implementation of Evolution Strategies.

Requirements

Python 3.5, PyTorch >= 0.2.0, numpy, gym, universe, cv2

What is this? (For non-ML people)

A large class of problems in AI can be described as "Markov Decision Processes," in which there is an agent taking actions in an environment, and receiving reward, with the goal being to maximize reward. This is a very general framework, which can be applied to many tasks, from learning how to play video games to robotic control. For the past few decades, most people used Reinforcement Learning -- that is, learning from trial and error -- to solve these problems. In particular, there was an extension of the backpropagation algorithm from Supervised Learning, called the Policy Gradient, which could train neural networks to solve these problems. Recently, OpenAI had shown that black-box optimization of neural network parameters (that is, not using the Policy Gradient or even Reinforcement Learning) can achieve similar results to state of the art Reinforcement Learning algorithms, and can be parallelized much more efficiently. This repo is an implementation of that black-box optimization algorithm.

Usage

There are two neural networks provided in model.py, a small neural network meant for simple tasks with discrete observations and actions, and a larger Convnet-LSTM meant for Atari games.

Run python3 main.py --help to see all of the options and hyperparameters available to you.

Typical usage would be:

python3 main.py --small-net --env-name CartPole-v1

which will run the small network on CartPole, printing performance on every training batch. Default hyperparameters should be able to solve CartPole fairly quickly.

python3 main.py --small-net --env-name CartPole-v1 --test --restore path_to_checkpoint

which will render the environment and the performance of the agent saved in the checkpoint. Checkpoints are saved once per gradient update in training, always overwriting the old file.

python3 main.py --env-name PongDeterministic-v4 --n 10 --lr 0.01 --useAdam

which will train on Pong and produce a learning curve similar to this one:

This graph was produced after approximately 24 hours of training on a 12-core computer. I would expect that a more thorough hyperparameter search, and more importantly a larger batch size, would allow the network to solve the environment.

Deviations from the paper

I have not yet tried virtual batch normalization, but instead use the selu nonlinearity, which serves the same purpose but at a significantly reduced computational overhead. ES appears to be training on Pong quite well even with relatively small batch sizes and selu.
I did not pass rewards between workers, but rather sent them all to one master worker which took a gradient step and sent the new models back to the workers. If you have more cores than your batch size, OpenAI's method is probably more efficient, but if your batch size is larger than the number of cores, I think my method would be better.
I do not adaptively change the max episode length as is recommended in the paper, although it is provided as an option. The reasoning being that doing so is most helpful when you are running many cores in parallel, whereas I was using at most 12. Moreover, capping the episode length can severely cripple the performance of the algorithm if reward is correlated with episode length, as we cannot learn from highly-performing perturbations until most of the workers catch up (and they might not for a long time).

Tips

If you increase the batch size, n, you should increase the learning rate as well.
Feel free to stop training when you see that the unperturbed model is consistently solving the environment, even if the perturbed models are not.
During training you probably want to look at the rank of the unperturbed model within the population of perturbed models. Ideally some perturbation is performing better than your unperturbed model (if this doesn't happen, you probably won't learn anything useful). This requires 1 extra rollout per gradient step, but as this rollout can be computed in parallel with the training rollouts, this does not add to training time. It does, however, give us access to one less CPU core.
Sigma is a tricky hyperparameter to get right -- higher values of sigma will correspond to less variance in the gradient estimate, but will be more biased. At the same time, sigma is controlling the variance of our perturbations, so if we need a more varied population, it should be increased. It might be possible to adaptively change sigma based on the rank of the unperturbed model mentioned in the tip above. I tried a few simple heuristics based on this and found no significant performance increase, but it might be possible to do this more intelligently.
I found, as OpenAI did in their paper, that performance on Atari increased as I increased the size of the neural net.

Your code is making my computer slow help

Short answer: decrease the batch size to the number of cores in your computer, and decrease the learning rate as well. This will most likely hurt the performance of the algorithm.

Long answer: If you want large batch sizes while also keeping the number of spawned threads down, I have provided an old version in the slow_version branch which allows you to do multiple rollouts per thread, per gradient step. This code is not supported, however, and it is not recommended that you use it.

Contributions

Please feel free to make Github issues or send pull requests.

License

MIT

Comments

Performance on MountainCar

Does anyone train the model on MountainCar-v0? I can only obtain the minimum reward -200. I tried both smaller sigma and larger sigma but none of both worked.

opened by wenyeming333 3
Question about action selection

In case of policy gradients, we try to approximate a softmax policy from which we sample actions based on probabilities stochastically.

How about in ES in case of discrete action space? Does the method follow greedy policy or softmax policy? From the code, I could see it is greedy policy, is it the right behavior?

opened by rajcscw 2

IndexError: too many indices for array

Hi,

I trying to run cartpole using this command

python3 main.py --small-net --env-name CartPole-v1

and I get a screen full of errors like this

IndexError: too many indices for array
Process Process-41:
Traceback (most recent call last):
  File "/home/ajay/anaconda3/envs/py35_pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/ajay/anaconda3/envs/py35_pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ajay/PythonProjects/pytorch-es-master/train.py", line 49, in do_rollouts
    state, reward, done, _ = env.step(action[0, 0])
IndexError: too many indices for array

opened by AjayTalati 2

Support Capability to Use GPUs

Hey Andrew, thanks immensely for putting this together. Very useful example of evolution strategies.

I was wondering what your thoughts of including GPU support. My thought is that any action done by each model can be ran on the gpu and the environment is then ran on the CPU. Due to the nature of the GPU batching, one idea is that you would batch actions, then let the environment respond, and continue this process.

I feel that the biggest bottleneck at this point would be pcie lanes depending on how much bandwidth you have to the GPU. The bottom line is that models would be stored on the gpu and would execute on the gpu while the env is ran on the cpu. Does gym allow the actual env to run on the gpu?

opened by NickShahML 2
Made compatible with python 2.7, added option for ADAM optimizer, add…

…ed enhanced progress plotting via matplotlib

After this PR a good improvement in CartPole can be seen with: python main.py --small-net --env-name CartPole-v1 --useAdam --lr 0.15

opened by lolz0r 1
it stuck in selu?
hi~, i run CartPole-v1 , and it is ok. But, when i run with other env-name, they all stuck in the same place:

here in model.py , i add some print to help check where they stuck: def forward(self, inputs): if self.small_net: x = selu(self.linear1(inputs)) x = selu(self.linear2(x)) return self.actor_linear(x) else: print('model: !!!forward!!! big-net(4conv+1lstm)') inputs, (hx, cx) = inputs print('model: !!!after update: input, (hx,cx) = inputs') x = selu(self.conv1(inputs)) x = selu(self.conv2(x)) x = selu(self.conv3(x)) x = selu(self.conv4(x)) print('model: !!!after 4conv end selu process') x = x.view(-1, 3233) print('model: !!!after x reshape: x.view(-1,3233)') ......

and here below is the output of the " python3 main.py --env-name PongDeterministic-v4 --n 10 --lr 0.01 --useAdam" command: (venv_openai-es) l00221575@F0817-S05:~/venv_openai-es/pytorch-es$ python3 main.py --env-name PongDeterministic-v4 --n 10 --lr 0.01 --useAdam [2018-10-23 22:23:10,929] Making new env: PongDeterministic-v4 Preprocessing env Num params in network 588710 /home/l00221575/venv_openai-es/pytorch-es/train.py:50: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. (Variable(state.unsqueeze(0), volatile=True), model: !!!forward!!! big-net(4conv+1lstm) model: !!!after update: input, (hx,cx) = inputs /home/l00221575/venv_openai-es/pytorch-es/train.py:50: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. (Variable(state.unsqueeze(0), volatile=True), model: !!!forward!!! big-net(4conv+1lstm) model: !!!after update: input, (hx,cx) = inputs /home/l00221575/venv_openai-es/pytorch-es/train.py:50: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. (Variable(state.unsqueeze(0), volatile=True),

and I guess they stuck in selu, and i add some print in selu and run PongDeterministic-v4 again, but the output stay the same as above, and other env-name like Kangaroo-ram-v0, Skiing-v0, Freeway-v0 and Gravitar-v0 , they all stuck in the same place like I run PongDeterministic-v4. Please help~~~ def selu(x): print('selu begin') alpha = 1.6732632423543772848170429916717 scale = 1.0507009873554804934193349852946 print('selu ends') return scale * F.elu(x, alpha)
opened by ouyangzhuzhu 0
Adding Prioritized Experience Replay

Hi,

I was wondering if you had any ideas how a Prioritized Experience Replay buffer could be added to ES?

They do something similar to that here - Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards, with DDPG for a robotics application

I'm guessing though ES would be more general?

Perhaps OpenAI's prioritized replay_buffer, from the baselines repo could be used?

opened by AjayTalati 0
Tic-tac-toe environment for the ES training process

Hello, @atgambardella.

Using as base your code, I have developed a new Tic-tac-toe environment for the ES training process. As this game can be studied in full depth by a classical min-max tree, I've used this classic AI to play against our neural network model in the "step" phase and to return so the reward.

The last result is a model (a simple "Linear" one) that thanks to the evolutionary computation can reach a zero-perfect game against the classical AI brute force strategy.

My code is here: https://github.com/Zeta36/pytorch-es-tic-tac-toe

I simplified also your code a little and I removed thing I knew I was not going to need.

Thanks for your work, friend.

opened by Zeta36 0
Spawn processes outside of the training loop

As the code is now, n processes are spawned per gradient step. As python startup time takes a while (~30 ms per process), this causes non-negligible overhead.
enhancement

opened by atgambardella 0

Owner

Andrew Gambardella

Machine Learning DPhil (PhD) student at University of Oxford

GitHub

This is the official pytorch implementation of Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation(TESKD)

Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation (TESKD) By Zheng Li[1,4], Xiang Li[2], Lingfeng Yang[2,4], Jian Yang[2], Zh

9 Sep 26, 2022

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

3.7k Jan 3, 2023

A fast Evolution Strategy implementation in Python

Evostra: Evolution Strategy for Python Evolution Strategy (ES) is an optimization technique based on ideas of adaptation and evolution. You can learn

251 Dec 8, 2022

Code for the paper Task Agnostic Morphology Evolution.

Task-Agnostic Morphology Optimization This repository contains code for the paper Task-Agnostic Morphology Evolution by Donald (Joey) Hejna, Pieter Ab

18 Aug 4, 2022

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer This repository contains the PyTorch code for Evo-ViT. This work proposes a slow-fas

53 Dec 5, 2022

This implements one of result networks from Large-scale evolution of image classifiers

Exotic structured image classifier This implements one of result networks from Large-scale evolution of image classifiers by Esteban Real, et. al. Req

54 Nov 25, 2022

NEATEST: Evolving Neural Networks Through Augmenting Topologies with Evolution Strategy Training

16 Dec 5, 2022

Embodied Intelligence via Learning and Evolution

Embodied Intelligence via Learning and Evolution This is the code for the paper Embodied Intelligence via Learning and Evolution Agrim Gupta, Silvio S

111 Dec 13, 2022

An atmospheric growth and evolution model based on the EVo degassing model and FastChem 2.0

EVolve Linking planetary mantles to atmospheric chemistry through volcanism using EVo and FastChem. Overview EVolve is a linked mantle degassing and a

2 Jan 17, 2022

GEA - Code for Guided Evolution for Neural Architecture Search

Efficient Guided Evolution for Neural Architecture Search Usage Create a conda e

6 Jan 3, 2023

PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

47 Nov 22, 2022

Trading Strategies for Freqtrade

Freqtrade Strategies Strategies for Freqtrade, developed primarily in a partnership between @werkkrew and @JimmyNixx from the Freqtrade Discord. Use t

242 Jan 7, 2023

Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Modeling High-Frequency Limit Order Book Dynamics Using Machine Learning Framework to capture the dynamics of high-frequency limit order books. Overvi

1.3k Jan 7, 2023

Using deep actor-critic model to learn best strategies in pair trading

Deep-Reinforcement-Learning-in-Stock-Trading Using deep actor-critic model to learn best strategies in pair trading Abstract Partially observed Markov

281 Dec 9, 2022

Optimize Trading Strategies Using Freqtrade

Optimize trading strategy using Freqtrade Short demo on building, testing and optimizing a trading strategy using Freqtrade. The DevBootstrap YouTube

139 Jan 1, 2023

This is the official implementation of TrivialAugment and a mini-library for the application of multiple image augmentation strategies including RandAugment and TrivialAugment.

Trivial Augment This is the official implementation of TrivialAugment (https://arxiv.org/abs/2103.10158), as was used for the paper. TrivialAugment is

94 Dec 30, 2022

Evolution Strategies in PyTorch

Related tags

Overview

Evolution Strategies

Requirements

What is this? (For non-ML people)

Usage

Deviations from the paper

Tips

Your code is making my computer slow help

Contributions

License

Comments

Owner

Andrew Gambardella

This is the official pytorch implementation of Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation(TESKD)

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

A fast Evolution Strategy implementation in Python

Code for the paper Task Agnostic Morphology Evolution.

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

This implements one of result networks from Large-scale evolution of image classifiers

NEATEST: Evolving Neural Networks Through Augmenting Topologies with Evolution Strategy Training

Embodied Intelligence via Learning and Evolution

An atmospheric growth and evolution model based on the EVo degassing model and FastChem 2.0

GEA - Code for Guided Evolution for Neural Architecture Search

PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

Trading Strategies for Freqtrade

Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Using deep actor-critic model to learn best strategies in pair trading

Optimize Trading Strategies Using Freqtrade

This is the official implementation of TrivialAugment and a mini-library for the application of multiple image augmentation strategies including RandAugment and TrivialAugment.

existing and custom freqtrade strategies supporting the new hyperstrategy format.

How Do Adam and Training Strategies Help BNNs Optimization? In ICML 2021.

My freqtrade strategies