A3C LSTM Atari with Pytorch plus A3G design

Overview

NEWLY ADDED A3G A NEW GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING!!

RL A3C Pytorch

A3C LSTM playing Breakout-v0 A3C LSTM playing SpaceInvadersDeterministic-v3 A3C LSTM playing MsPacman-v0 A3C LSTM playing BeamRider-v0 A3C LSTM playing Seaquest-v0

NEWLY ADDED A3G!!

New implementation of A3C that utilizes GPU for speed increase in training. Which we can call A3G. A3G as opposed to other versions that try to utilize GPU with A3C algorithm, with A3G each agent has its own network maintained on GPU but shared model is on CPU and agent models are quickly converted to CPU to update shared model which allows updates to be frequent and fast by utilizing Hogwild Training and make updates to shared model asynchronously and without locks. This new method greatly increase training speed and models that use to take days to train can be trained in as fast as 10minutes for some Atari games! 10-15minutes for Breakout to start to score over 400! And 10mins to solve Pong!

This repository includes my implementation with reinforcement learning using Asynchronous Advantage Actor-Critic (A3C) in Pytorch an algorithm from Google Deep Mind's paper "Asynchronous Methods for Deep Reinforcement Learning."

See a3c_continuous a newly added repo of my A3C LSTM implementation for continuous action spaces which was able to solve BipedWalkerHardcore-v2 environment (average 300+ for 100 consecutive episodes)

A3C LSTM

I implemented an A3C LSTM model and trained it in the atari 2600 environments provided in the Openai Gym. So far model currently has shown the best prerfomance I have seen for atari game environments. Included in repo are trained models for SpaceInvaders-v0, MsPacman-v0, Breakout-v0, BeamRider-v0, Pong-v0, Seaquest-v0 and Asteroids-v0 which have had very good performance and currently hold the best scores on openai gym leaderboard for each of those games(No plans on training model for any more atari games right now...). Saved models in trained_models folder. *Removed trained models to reduce the size of repo

Have optimizers using shared statistics for RMSProp and Adam available for use in training as well option to use non shared optimizer.

Gym atari settings are more difficult to train than traditional ALE atari settings as Gym uses stochastic frame skipping and has higher number of discrete actions. Such as Breakout-v0 has 6 discrete actions in Gym but ALE is set to only 4 discrete actions. Also in GYM atari they randomly repeat the previous action with probability 0.25 and there is time/step limit that limits performance.

link to the Gym environment evaluations below

Tables Best 100 episode Avg Best Score
SpaceInvaders-v0 5808.45 ± 337.28 13380.0
SpaceInvaders-v3 6944.85 ± 409.60 20440.0
SpaceInvadersDeterministic-v3 79060.10 ± 5826.59 167330.0
Breakout-v0 739.30 ± 18.43 864.0
Breakout-v3 859.57 ± 1.97 864.0
Pong-v0 20.96 ± 0.02 21.0
PongDeterministic-v3 21.00 ± 0.00 21.0
BeamRider-v0 8441.22 ± 221.24 13130.0
MsPacman-v0 6323.01 ± 116.91 10181.0
Seaquest-v0 54203.50 ± 1509.85 88840.0

The 167,330 Space Invaders score is World Record Space Invaders score and game ended only due to GYM timestep limit and not from loss of life. When I increased the GYM timestep limit to a million its reached a score on Space Invaders of approximately 2,300,000 and still ended due to timestep limit. Most likely due to game getting fairly redundent after a while

Due to gym version Seaquest-v0 timestep limit agent scores lower but on Seaquest-v4 with higher timestep limit agent beats game (see gif above) with max possible score 999,999!!

Requirements

  • Python 2.7+
  • Openai Gym and Universe
  • Pytorch

Training

When training model it is important to limit number of worker processes to number of cpu cores available as too many processes (e.g. more than one process per cpu core available) will actually be detrimental in training speed and effectiveness

To train agent in Pong-v0 environment with 32 different worker processes:

python main.py --env Pong-v0 --workers 32

#A3C-GPU training using machine with 4 V100 GPUs and 20core CPU for PongDeterministic-v4 took 10 minutes to converge

To train agent in PongDeterministic-v4 environment with 32 different worker processes on 4 GPUs with new A3G:

python main.py --env PongDeterministic-v4 --workers 32 --gpu-ids 0 1 2 3 --amsgrad True

Hit Ctrl C to end training session properly

A3C LSTM playing Pong-v0

Evaluation

To run a 100 episode gym evaluation with trained model

python gym_eval.py --env Pong-v0 --num-episodes 100

Notice BeamRiderNoFrameskip-v4 reaches scores over 50,000 in less than 2hrs of training compared to the gym v0 version this shows the difficulty of those versions but also the timelimit being a major factor in score level

These training charts were done on a DGX Station using 4GPUs and 20core Cpu. I used 36 worker agents and a tau of 0.92 which is the lambda in Generalized Advantage Estimation equation to introduce more variance due to the more deterministic nature of using just a 4 frame skip environment and a 0-30 NoOp start BeamRider Training Boxing training Pong Training SpaceInvaders Training Qbert training

Project Reference

Comments
  • I just want to say your trained model has no effect

    I just want to say your trained model has no effect

    I try to eval your trained model, however the result has no effect:

    2017-08-01 21:08:13,757 : reward sum: -21.0, reward mean: -21.0000
    [2017-08-01 21:08:13,757] reward sum: -21.0, reward mean: -21.0000
    [2017-08-01 21:08:13,787] Starting new video recorder writing to /Volumes/xs/CodeSpace/AISpace/rl_space/rl_a3c_pytorch/Pong-v0_monitor/openaigym.video.0.33472.video000001.mp4
    2017-08-01 21:08:24,947 : reward sum: -21.0, reward mean: -21.0000
    [2017-08-01 21:08:24,947] reward sum: -21.0, reward mean: -21.0000
    2017-08-01 21:08:35,054 : reward sum: -21.0, reward mean: -21.0000
    [2017-08-01 21:08:35,054] reward sum: -21.0, reward mean: -21.0000
    2017-08-01 21:08:44,732 : reward sum: -21.0, reward mean: -21.0000
    [2017-08-01 21:08:44,732] reward sum: -21.0, reward mean: -21.0000
    

    And the record is white-and-black videos, can not just show on screen.

    opened by jinfagang 20
  • Normalize

    Normalize

    In the NormalizedEnv, I am puzzled why you choose alpha equal to 0.9999, if I want unbiased_mean, should the alpha be (num_step - 1)/num_step?

    I am also puzzled why normalize environment observation has a such huge influence on the performance? can you explain to me? Thanks!

    opened by yhcao6 10
  • Reward is always 0 when training Breakout-v0

    Reward is always 0 when training Breakout-v0

    I have trained the model a night on Breakout-v0, however the reward is always 0. What reasons may cause this situation? Or could you tell me what the parameters you are using when training to play Breakout-v0? Thank you. Here is the log file. log.txt

    opened by NeymarL 8
  • Pretrained models

    Pretrained models

    Hello, is it possible to get access to some of the pre-trained models? (specifically looking for sea quest, pong and space invaders but any or all would be brilliant)

    opened by lweitkamp 6
  • multi gpu support

    multi gpu support

    When I run the program on multi gpu, that is, I set gpu_id to be [0, 1, 2, 3], it report error "Some of weight/gradient/input teSLTM nsors are located on different GPUs. Please move them to a single one"

    opened by yhcao6 6
  • Performance of Breakout

    Performance of Breakout

    Could I ask how long it takes to train Breakout from scratch to get the desire score (859.57 for Breakout-v3)?

    Have You tried BreakoutNoFrameskip? This is a version without repetition and randomness.

    Thanks!

    opened by yhcao6 6
  • Solving time

    Solving time

    Thank you for the nice implementation. I'm curious about the running time on your machine. In https://github.com/ikostrikov/pytorch-a3c, it is reported that PongDeterministic-v3 is solved around 15min, did you reproduce similar results in any version of Pong?

    Thank you

    opened by hugemicrobe 6
  • Question on training function

    Question on training function

    I noticed that in your player_util.py action_train function:

    if self.done:
        if self.gpu_id >= 0:
            with torch.cuda.device(self.gpu_id):
                self.cx = Variable(torch.zeros(1, 512).cuda())
                self.hx = Variable(torch.zeros(1, 512).cuda())
        else:
            self.cx = Variable(torch.zeros(1, 512))
            self.hx = Variable(torch.zeros(1, 512))
    else:
        self.cx = Variable(self.cx.data)
        self.hx = Variable(self.hx.data)
    

    But how can you backpropagate gradients through time, to the past 20 steps, if you set:

    self.cx = Variable(self.cx.data)
    self.hx = Variable(self.hx.data)
    
    opened by yiwan-rl 5
  • Seaquest-v0 not training as well as announced

    Seaquest-v0 not training as well as announced

    I have tried twice to train the agent on Seaquest-v0 with 32 workers on a server, but after 13 hours of training, the score seems to be stuck at 2700/2800 maximum.

    Here's the log file : log.txt

    I'am using gym 0.8.1 and Atari-py 0.0.21 and let all the hyperparameters to their default value. Any idea why the score obtained is much lower than the one you obtained ? (>50000) Would you have the trained model for Seaquest-v0 ? Thanks !

    opened by ThomasLecat 5
  • Quick question on batch processing

    Quick question on batch processing

    Thanks for the implementation! Great codes.

    I read/ran your codes and realized it is processing training examples with just a batch_size=1 (instead of large batch size, am I correct on this?). I am just wondering if this is designed on purpose due to your G-A3C. With larger batch size things are running faster with GPU, so why batch_size=1?

    Is there anything we can do to run it on large batches?

    Thank you.

    opened by hohoCode 4
  • Why one process run on 2 gpus?

    Why one process run on 2 gpus?

    First, thank you for your great work of a3c implementation.

    I run the code with python main.py --workers 1 --gpu-ids 5 and find out that one process runs on 2 gpus. Similar things happened when I run with --workers 50. All the processes should run on gpu 5. However, I find that all of these processes (same PID) run on gpu 0 with Type C and smaller GPU Memory Usage compared with those run on gpu 5. How can I assign all the processes on gpu 5? Thank you very much!

    1 2

    opened by Jiankai-Sun 4
  • UserWarning: This overload of add_ is deprecated

    UserWarning: This overload of add_ is deprecated

    Is it normal to get this trace back in the console? It spams for a few dozen times and then stops abruptly. Then, it starts logging the training session as intended. Sorry if this question is incredibly ignorant lol. I'm new to python and the world of ai. Figured I'd post my question here before searching Google. Thanks in advance. C:\Users\joshu\Documents\0AIFolder\00A3CA3Gatari\rl_a3c_pytorch-master\shared_optim.py:167: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:882.) exp_avg.mul_(beta1).add_(1 - beta1, grad)

    opened by josh-tegus 0
  • question about trained models

    question about trained models

    Just want to clarify that there is only one saved model per environment and it will be overwritten each training epoch, right? For example, MsPacman will only have one saved model trained_models/MsPacman-v0.dat

    opened by enochkan 1
  • Question about Test function

    Question about Test function

    Hi, I'd to have a question about the following block

    https://github.com/dgriff777/rl_a3c_pytorch/blob/eb5c9b909abc02911b45e325f7a7c619d3b0fa46/test.py#L60

        if player.done and not player.info:
            state = player.env.reset()
            player.eps_len += 2
            player.state = torch.from_numpy(state).float()
            if gpu_id >= 0:
                with torch.cuda.device(gpu_id):
                    player.state = player.state.cuda()
        elif player.info:
    

    I don't quite understand when the info equals True or False, what is the meaning off having info=True and info=False ?

    I can't seem to find a documentation about this info flag on Gym website :(

    Thanks

    opened by AlexTo 1
Owner
David Griffis
David Griffis
PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage async actor-critic Algorithms (A3C) in PyTorch @inproceedings{mnih2016asynchronous, title={Asynchronous methods for deep reinforcement lea

LEI TAI 111 Dec 8, 2022
Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Automatic, Readable, Reusable, Extendable Machin is a reinforcement library designed for pytorch. Build status Platform Status Linux Windows Supported

Iffi 348 Dec 24, 2022
Independent and minimal implementations of some reinforcement learning algorithms using PyTorch (including PPO, A3C, A2C, ...).

PyTorch RL Minimal Implementations There are implementations of some reinforcement learning algorithms, whose characteristics are as follow: Less pack

Gemini Light 4 Dec 31, 2022
Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

Andrew 70 Dec 12, 2022
null 3 Apr 20, 2022
这是一个deeplabv3-plus-pytorch的源码,可以用于训练自己的模型。

DeepLabv3+:Encoder-Decoder with Atrous Separable Convolution语义分割模型在Pytorch当中的实现 目录 性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Download 训练步骤

Bubbliiiing 350 Dec 28, 2022
Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

Peyman Majidi Moein 16 Sep 22, 2022
Enigma-Plus - Python based Enigma machine simulator with some extra features

Enigma-Plus Python based Enigma machine simulator with some extra features Examp

null 1 Jan 5, 2022
Multi-layer convolutional LSTM with Pytorch

Convolution_LSTM_pytorch Thanks for your attention. I haven't got time to maintain this repo for a long time. I recommend this repo which provides an

Zijie Zhuang 734 Jan 3, 2023
PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM

Quasi-Recurrent Neural Network (QRNN) for PyTorch Updated to support multi-GPU environments via DataParallel - see the the multigpu_dataparallel.py ex

Salesforce 1.3k Dec 28, 2022
Tree LSTM implementation in PyTorch

Tree-Structured Long Short-Term Memory Networks This is a PyTorch implementation of Tree-LSTM as described in the paper Improved Semantic Representati

Riddhiman Dasgupta 529 Dec 10, 2022
LSTM and QRNN Language Model Toolkit for PyTorch

LSTM and QRNN Language Model Toolkit This repository contains the code used for two Salesforce Research papers: Regularizing and Optimizing LSTM Langu

Salesforce 1.9k Jan 8, 2023
Multi-layer convolutional LSTM with Pytorch

Convolution_LSTM_pytorch Thanks for your attention. I haven't got time to maintain this repo for a long time. I recommend this repo which provides an

Zijie Zhuang 733 Dec 30, 2022
LSTM model trained on a small dataset of 3000 names written in PyTorch

LSTM model trained on a small dataset of 3000 names. Model generates names from model by selecting one out of top 3 letters suggested by model at a time until an EOS (End Of Sentence) character is not encountered.

Sahil Lamba 1 Dec 20, 2021
Using LSTM write Tang poetry

本教程将通过一个示例对LSTM进行介绍。通过搭建训练LSTM网络,我们将训练一个模型来生成唐诗。本文将对该实现进行详尽的解释,并阐明此模型的工作方式和原因。并不需要过多专业知识,但是可能需要新手花一些时间来理解的模型训练的实际情况。为了节省时间,请尽量选择GPU进行训练。

null 56 Dec 15, 2022
OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network

Stock Price Prediction of Apple Inc. Using Recurrent Neural Network OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network Dataset:

Nouroz Rahman 410 Jan 5, 2023
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

null 195 Dec 7, 2022
Using multidimensional LSTM neural networks to create a forecast for Bitcoin price

Multidimensional LSTM BitCoin Time Series Using multidimensional LSTM neural networks to create a forecast for Bitcoin price. For notes around this co

Jakob Aungiers 318 Dec 14, 2022
Incorporating Transformer and LSTM to Kalman Filter with EM algorithm

Deep learning based state estimation: incorporating Transformer and LSTM to Kalman Filter with EM algorithm Overview Kalman Filter requires the true p

zshicode 57 Dec 27, 2022