PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Overview

pytorch-a2c-ppo-acktr

Update (April 12th, 2021)

PPO is great, but Soft Actor Critic can be better for many continuous control tasks. Please check out my new RL repository in jax.

Please use hyper parameters from this readme. With other hyper parameters things might not work (it's RL after all)!

This is a PyTorch implementation of

  • Advantage Actor Critic (A2C), a synchronous deterministic version of A3C
  • Proximal Policy Optimization PPO
  • Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation ACKTR
  • Generative Adversarial Imitation Learning GAIL

Also see the OpenAI posts: A2C/ACKTR and PPO for more information.

This implementation is inspired by the OpenAI baselines for A2C, ACKTR and PPO. It uses the same hyper parameters and the model since they were well tuned for Atari games.

Please use this bibtex if you want to cite this repository in your publications:

@misc{pytorchrl,
  author = {Kostrikov, Ilya},
  title = {PyTorch Implementations of Reinforcement Learning Algorithms},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail}},
}

Supported (and tested) environments (via OpenAI Gym)

I highly recommend PyBullet as a free open source alternative to MuJoCo for continuous control tasks.

All environments are operated using exactly the same Gym interface. See their documentations for a comprehensive list.

To use the DeepMind Control Suite environments, set the flag --env-name dm.<domain_name>.<task_name>, where domain_name and task_name are the name of a domain (e.g. hopper) and a task within that domain (e.g. stand) from the DeepMind Control Suite. Refer to their repo and their tech report for a full list of available domains and tasks. Other than setting the task, the API for interacting with the environment is exactly the same as for all the Gym environments thanks to dm_control2gym.

Requirements

In order to install requirements, follow:

# PyTorch
conda install pytorch torchvision -c soumith

# Other requirements
pip install -r requirements.txt

Contributions

Contributions are very welcome. If you know how to make this code better, please open an issue. If you want to submit a pull request, please open an issue first. Also see a todo list below.

Also I'm searching for volunteers to run all experiments on Atari and MuJoCo (with multiple random seeds).

Disclaimer

It's extremely difficult to reproduce results for Reinforcement Learning methods. See "Deep Reinforcement Learning that Matters" for more information. I tried to reproduce OpenAI results as closely as possible. However, majors differences in performance can be caused even by minor differences in TensorFlow and PyTorch libraries.

TODO

  • Improve this README file. Rearrange images.
  • Improve performance of KFAC, see kfac.py for more information
  • Run evaluation for all games and algorithms

Visualization

In order to visualize the results use visualize.ipynb.

Training

Atari

A2C

python main.py --env-name "PongNoFrameskip-v4"

PPO

python main.py --env-name "PongNoFrameskip-v4" --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 0.5 --num-processes 8 --num-steps 128 --num-mini-batch 4 --log-interval 1 --use-linear-lr-decay --entropy-coef 0.01

ACKTR

python main.py --env-name "PongNoFrameskip-v4" --algo acktr --num-processes 32 --num-steps 20

MuJoCo

Please always try to use --use-proper-time-limits flag. It properly handles partial trajectories (see https://github.com/sfujim/TD3/blob/master/main.py#L123).

A2C

python main.py --env-name "Reacher-v2" --num-env-steps 1000000

PPO

python main.py --env-name "Reacher-v2" --algo ppo --use-gae --log-interval 1 --num-steps 2048 --num-processes 1 --lr 3e-4 --entropy-coef 0 --value-loss-coef 0.5 --ppo-epoch 10 --num-mini-batch 32 --gamma 0.99 --gae-lambda 0.95 --num-env-steps 1000000 --use-linear-lr-decay --use-proper-time-limits

ACKTR

ACKTR requires some modifications to be made specifically for MuJoCo. But at the moment, I want to keep this code as unified as possible. Thus, I'm going for better ways to integrate it into the codebase.

Enjoy

Load a pretrained model from my Google Drive.

Also pretrained models for other games are available on request. Send me an email or create an issue, and I will upload it.

Disclaimer: I might have used different hyper-parameters to train these models.

Atari

python enjoy.py --load-dir trained_models/a2c --env-name "PongNoFrameskip-v4"

MuJoCo

python enjoy.py --load-dir trained_models/ppo --env-name "Reacher-v2"

Results

A2C

BreakoutNoFrameskip-v4

SeaquestNoFrameskip-v4

QbertNoFrameskip-v4

beamriderNoFrameskip-v4

PPO

BreakoutNoFrameskip-v4

SeaquestNoFrameskip-v4

QbertNoFrameskip-v4

beamriderNoFrameskip-v4

ACKTR

BreakoutNoFrameskip-v4

SeaquestNoFrameskip-v4

QbertNoFrameskip-v4

beamriderNoFrameskip-v4

Issues
  • LSTM policy

    LSTM policy

    Great work on the implementation! Very comprehensible and straightforward implementation.

    It seems you're performing two forward steps: 1) to choose an action (main.py, line 113), 2) to evaluate the actions (main.py, line 146). Why not save the values, log_probs and entropy in while selecting actions (as you did in a3c)? Are there computational benefits to performing these for all processes at once?

    opened by nadavbh12 22
  • Continuous action space: range not taken into account

    Continuous action space: range not taken into account

    Hello there!

    I'm trying to train a CNN to control a robot with a differential drive. My gym environment has this action space:

            self.action_space = spaces.Box(
                low=-1,
                high=1,
                shape=(2,)
            )
    

    That is, I need the CNN to output two motor velocities in the range [-1, 1]. Unfortunately, at the moment, the low and high range of my action space isn't taken into account. I get outputs as high as 540, which makes the robot spin out of control.

    This seems like it should be an easy problem to fix, but I'm still very new to PyTorch. Could you make the change, or advise me as to how to fix this?

    opened by maximecb 18
  • Rewards suddenly drop to 0 during training with A2C

    Rewards suddenly drop to 0 during training with A2C

    I've been using this repository for a while and it works great, but I noticed that in some cases after some time of training the rewards suddenly drop to 0 and go up again afterward. Does anyone have any idea how to solve this issue? (the learning rate is already very small so I don't think that decreasing it is the solution).

    opened by ShaniGam 15
  • '>' not supported between instances of 'float' and 'NoneType'

    '>' not supported between instances of 'float' and 'NoneType'

    Hello,

    When running PPO various environments, I get the following error message: '>' not supported between instances of 'float' and 'NoneType'. I narrowed the error to the point where the Visdom server is initialized but I can't figure out why this happens.

    Also, I used the following command to launch the training: python main.py --env-name "mreacher-v0" --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --num-steps 128 --num-mini-batch 4 --vis-interval 1 --log-interval 1

    Thanks !

    opened by MoMe36 12
  • About the trainning results

    About the trainning results

    I have tried the PPO for Hopper-v2 and the A2C for BreakoutNoFrameSkip-v4. However, when trainning about 1M steps, the mean reward is far less than your results. For the Hopper-v2, it still remains 6.2 at 0.1M steps; For BreakoutNoFrameSkip-v4, it still remains 50.4 at 2M steps.

    So, I donot know if other settings should be mentioned?

    Regards 9/14

    opened by Nara0731 10
  • Multi-processor performance

    Multi-processor performance

    I am finding that trying to use multiple processors is slowing things down.

    If I try,

    python main.py --env-name "myenv-v0" --num-processes 1 --num-steps 20000 --recurrent-policy

    On my iMac (4 core) I get around 1000 fps. If I try --num-processes 4 then I get about 200 fps, and none of the cores are at full utilisation

    I also tried the same on a Linux GCE instance with 32 cores and got similar results. When running with --num-processes 32 each process is only using about 7% of its CPU.

    Any ideas?

    opened by hammertoe 10
  • Atari games not learning?

    Atari games not learning?

    Tried training on different Atari games. It seems that using default parameters, neither Breakout nor Boxing were able to learn. Tried both acktor and a2c and waited for 10M steps Though Pong managed to learn no problem. Are you still able able to reproduce the results in the graphs in the readme?

    opened by nadavbh12 10
  • MLP recurrent

    MLP recurrent

    This PR consists of two commits related to recurrent policy models.

    The first commit implements the recurrent option for the MLP policy. This results in multiple methods shared by the MLP and CNN model which I extracted in a NNBase class.

    The second commit doesn't add any new functionality but suggests a new naming convention for observations, current obervations and states. From my understanding, a general naming convention in RL is that states is an umbrella term for observations. They can be the same but if for example num_stack=4, one state is made up of 4 observations. With the term states the current code refers to the hidden state of the recurrent layer of a policy. As this might be confusing, I suggest a new naming convention with current_obs to states, states to recurrent_hidden_states and obs remain obs. Inside thes policy class the recurrent hidden states are abbreviated to rnn_hxs.

    opened by timmeinhardt 9
  • Potential bug in storage.py > compute_returns?

    Potential bug in storage.py > compute_returns?

    On the first line, we see that self.value_preds[step] is subtracted, but on the third line it is added back in. So it's actually not doing anything?

    delta = self.rewards[step] + gamma * self.value_preds[step + 1] * self.masks[step + 1] - self.value_preds[step]
    gae = delta + gamma * tau * self.masks[step + 1] * gae
    self.returns[step] = gae + self.value_preds[step]
    
    opened by 0xsamgreen 9
  • Setting random seed

    Setting random seed

    Hi --

    I'm trying to set this up so that it gets the exact same results every time (eg, for regression tests). Even when I set the seeds, I get (slightly) different results on each run. Any ideas what might be going on there?

    Thanks

    opened by bkj 8
  • why PPO needs to store action_log_probs instead of using stop_gradient for better efficiency?

    why PPO needs to store action_log_probs instead of using stop_gradient for better efficiency?

    Hi, I am looking at the PPO implementation, and I am curious about this part (actually many other implementations are using this workflow as well, so I am also curious to see if I miss anything)

    So the action_log_probs is created, removed gradient (by setting requires_gradient=False), and inserted into the storage buffer, this action_log_probs is generated by the following function and then will be referred as old_action_log_probs_batch in PPO

    def act(self, inputs, rnn_hxs, masks, deterministic=False):
            ...
            action_log_probs = dist.log_probs(action)
    
            return value, action, action_log_probs, rnn_hxs
    

    In PPO algorithm, the ratio is calculated by the following, the action_log_probs is from evaluate_actions()

    values, action_log_probs, dist_entropy, _ = self.actor_critic.evaluate_actions(
                        obs_batch, recurrent_hidden_states_batch, masks_batch,
                        actions_batch)
    ratio = torch.exp(action_log_probs - old_action_log_probs_batch)
    

    If I am not understanding wrong, evaluate_actions() and act() will output the same action_log_probs because they are using the same actor_critic and calling log_probs(action), the only difference is the old_action_log_probs_batch has the gradient removed, so backpropagation will not go through it.

    So my question is, why we bother to save old_action_log_probs_batch in the storage, but instead, something like this can be created on the fly.

    values, action_log_probs, dist_entropy, _ = self.actor_critic.evaluate_actions(
                        obs_batch, recurrent_hidden_states_batch, masks_batch,
                        actions_batch)
    old_action_log_probs_batch = action_log_probs.detach()
    ratio = torch.exp(action_log_probs - old_action_log_probs_batch)
    

    Thank you for your attention. Look forward to the discussion.

    Regards, Tian

    opened by Emerald01 0
  • object has no attribute 'steps' in acktr

    object has no attribute 'steps' in acktr

    If you create a2c model and then try to apply it to kfac, it says that there is no attribute called steps. After checking, it seems that not all attributes are defined. How should I solve this?

    object has no attribute 'steps'
    
    opened by sungreong 0
  • No softmax before categorical loss?

    No softmax before categorical loss?

    Hi Thanks so much for sharing this, what a great repo. I've noticed that the final actor layer is not really activated, rather a distribution object (say categorical) is used. Later the log probabilities are taken to compute the actor's loss. Don't we lose the desired mesh that the softmax function gives us in this case? IE we encourage good actions and discourage bad actions less then if we'd used softmax, right? Just wanted to ask is this on propose or did I misunderstand the code?

    Thanks!

    opened by hopl1t 0
  • Operations that have no effect

    Operations that have no effect

    Hi, The two lines referenced below seem to have canceling effects (the second quoted line is the inverse of the sigmoid). I was wondering what has been the purpose of putting them. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/1120cdfe94e79294a52486590d9c2bcc5c01730d/a2c_ppo_acktr/algo/gail.py#L101-L102

    I think if the purpose has been to make this a Wasserstein GAIL, it would be nice to do sth like if args.wasserstein... else...

    opened by ArashVahabpour 0
  • CNN Architecture

    CNN Architecture

    Hello,

    I should have written this issue when we noticed it a while ago but the architecture of the CNN does not match the Nature CNN one (I assume that was the goal), the last layer should have 64 channels too. This repo: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/f60ac80147d7fcd3aa7e9210e37d5734d9b6f4cd/a2c_ppo_acktr/model.py#L176-L180

    SB2 repo (following OpenAI Baselines repo): https://github.com/hill-a/stable-baselines/blob/a4efff01ca678bcceee3eb21801c410612df209f/stable_baselines/common/policies.py#L16-L29

    or in the SB3 repo: https://github.com/DLR-RM/stable-baselines3/blob/88e1be9ff5e04b7688efa44951f845b7daf5717f/stable_baselines3/common/torch_layers.py#L76-L84

    opened by araffin 0
  • Possible bug on the sign of policy log prob. in Fisher computation

    Possible bug on the sign of policy log prob. in Fisher computation

    Dear @ikostrikov ,

    while reading your code I noticed that you use the log prob. of a normal distribution for the Fisher matrix calculation in the value loss, but the negative log prob. of the policy. Comparing your code in [1] with the equivalent lines of the stable baselines [2], one can see that the policy part of the Fisher matrix calculation is a log prob. (minus negative log prob. in pg_fisher_loss) and the value function contribution is also a log prob. (minus mean squared error).

    The original paper mentions the construction of the Fisher matrix using the gradient of the log prob. of the policy and the log prob. of a Gaussian around the value function (section 3.1 of [3]). I would expect therefore that the sign of the two terms used for the Fisher matrix to follow the same convention, as it is done in the stable baselines repository. The actual loss function minimisation is done with a negative log prob. for both (as you currently do, and as it is done in the stable baselines repo.), but in both cases, the sign of the two terms should be consistent.

    Therefore, I could not fully understand the reason for that sign in the fisher matrix calculation. Is this a bug, or is there some deeper reason behind it?

    Best regards, Danilo

    [1] https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/algo/a2c_acktr.py#L53 [2] https://stable-baselines.readthedocs.io/en/master/_modules/stable_baselines/acktr/acktr.html#ACKTR [3] https://arxiv.org/pdf/1708.05144.pdf

    opened by daniloefl 0
  • Stale hidden states

    Stale hidden states

    Hi!

    I was taking a look at your code and wondering if you tackle the stale hidden states after each rollout. As I have seen, the code is used in order to be stateful at episode level, and then, when done is found, the hidden states are reset. However, from one rollout to another, the output hidden state of the last rollout is copied in order to be the input hidden state of the current rollout, although the actor-critic network parameters (including GRU) have already been updated.

    Is there any reason why you do not recalculate the last rollouts hidden state taking into account the new network weights? Thank you in advance!

    opened by aklein1995 0
  • Can not run enjoy.py

    Can not run enjoy.py

    can not run enjoy.py due to "Can't get attribute 'CNNPolicy' on <module 'model' from 'a2c_ppo_acktr\model.py'>" and I cannot find CNNPolicy in model.py

    opened by juanjuan2 0
  • Can I train in my own game

    Can I train in my own game

    I wonder if i can use this code to train my own game, such as FPS games or ACT games

    opened by hhhcwb38712 0
Owner
Ilya Kostrikov
Post doc
Ilya Kostrikov
Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Overview PyTorch 0.4.1 | Python 3.6.5 Annotated implementations with comparative introductions for minimax, non-saturating, wasserstein, wasserstein g

Shayne O'Brien 454 Oct 4, 2021
PyTorch implementations of Generative Adversarial Networks.

This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as

Erik Linder-Norén 10.4k Oct 22, 2021
The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch. Feel free to make a pu

Ritchie Ng 8.5k Oct 22, 2021
Code accompanying "Learning What To Do by Simulating the Past", ICLR 2021.

Learning What To Do by Simulating the Past This repository contains code that implements the Deep Reward Learning by Simulating the Past (Deep RSLP) a

Center for Human-Compatible AI 24 Aug 7, 2021
ilpyt: imitation learning library with modular, baseline implementations in Pytorch

ilpyt The imitation learning toolbox (ilpyt) contains modular implementations of common deep imitation learning algorithms in PyTorch, with unified in

The MITRE Corporation 5 Oct 10, 2021
Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Packt 1.1k Oct 21, 2021
Reinforcement learning framework and algorithms implemented in PyTorch.

Reinforcement learning framework and algorithms implemented in PyTorch.

Robotic AI & Learning Lab Berkeley 1.7k Oct 20, 2021
PyTorch implementations of deep reinforcement learning algorithms and environments

Deep Reinforcement Learning Algorithms with PyTorch This repository contains PyTorch implementations of deep reinforcement learning algorithms and env

Petros Christodoulou 3.9k Oct 19, 2021
paper list in the area of reinforcenment learning for recommendation systems

paper list in the area of reinforcenment learning for recommendation systems

HenryZhao 18 Oct 20, 2021
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Tensorforce 3k Oct 15, 2021
DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

Jason Antic 14.2k Oct 17, 2021
Official repository for "On Generating Transferable Targeted Perturbations" (ICCV 2021)

On Generating Transferable Targeted Perturbations (ICCV'21) Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Fatih Porikli Paper:

Muzammal Naseer 29 Sep 15, 2021
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, speech recognition, generation, certification, etc.).

null 2.5k Oct 22, 2021
An efficient framework for reinforcement learning.

rl: An efficient framework for reinforcement learning Requirements Introduction PPO Test Requirements name version Python >=3.7 numpy >=1.19 torch >=1

null 7 Oct 11, 2021
A comprehensive list of published machine learning applications to cosmology

ml-in-cosmology This github attempts to maintain a comprehensive list of published machine learning applications to cosmology, organized by subject ma

George Stein 211 Oct 15, 2021
Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization

Hybrid solving process for combinatorial optimization problems Combinatorial optimization has found applications in numerous fields, from aerospace to

null 81 Oct 22, 2021
This repo contains the official implementations of EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis This repo contains the official implementations of EigenDamage: Structured Prunin

Chaoqi Wang 108 Aug 17, 2021
Rayvens makes it possible for data scientists to access hundreds of data services within Ray with little effort.

Rayvens augments Ray with events. With Rayvens, Ray applications can subscribe to event streams, process and produce events. Rayvens leverages Apache

CodeFlare 18 Oct 6, 2021
Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ACTOR Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021. Please visit our we

Mathis Petrovich 104 Oct 19, 2021