A3C LSTM Atari with Pytorch plus A3G design

David Griffis

Last update: Jan 2, 2023

Related tags

Deep Learning python reinforcement-learning deep-reinforcement-learning openai-gym pytorch a3c atari actor-critic pytorch-a3c asynchronous-advantage-actor-critic a3c-gpu a3g

Overview

NEWLY ADDED A3G A NEW GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING!!

RL A3C Pytorch

NEWLY ADDED A3G!!

New implementation of A3C that utilizes GPU for speed increase in training. Which we can call A3G. A3G as opposed to other versions that try to utilize GPU with A3C algorithm, with A3G each agent has its own network maintained on GPU but shared model is on CPU and agent models are quickly converted to CPU to update shared model which allows updates to be frequent and fast by utilizing Hogwild Training and make updates to shared model asynchronously and without locks. This new method greatly increase training speed and models that use to take days to train can be trained in as fast as 10minutes for some Atari games! 10-15minutes for Breakout to start to score over 400! And 10mins to solve Pong!

This repository includes my implementation with reinforcement learning using Asynchronous Advantage Actor-Critic (A3C) in Pytorch an algorithm from Google Deep Mind's paper "Asynchronous Methods for Deep Reinforcement Learning."

See a3c_continuous a newly added repo of my A3C LSTM implementation for continuous action spaces which was able to solve BipedWalkerHardcore-v2 environment (average 300+ for 100 consecutive episodes)

A3C LSTM

I implemented an A3C LSTM model and trained it in the atari 2600 environments provided in the Openai Gym. So far model currently has shown the best prerfomance I have seen for atari game environments. Included in repo are trained models for SpaceInvaders-v0, MsPacman-v0, Breakout-v0, BeamRider-v0, Pong-v0, Seaquest-v0 and Asteroids-v0 which have had very good performance and currently hold the best scores on openai gym leaderboard for each of those games(No plans on training model for any more atari games right now...). Saved models in trained_models folder. *Removed trained models to reduce the size of repo

Have optimizers using shared statistics for RMSProp and Adam available for use in training as well option to use non shared optimizer.

Gym atari settings are more difficult to train than traditional ALE atari settings as Gym uses stochastic frame skipping and has higher number of discrete actions. Such as Breakout-v0 has 6 discrete actions in Gym but ALE is set to only 4 discrete actions. Also in GYM atari they randomly repeat the previous action with probability 0.25 and there is time/step limit that limits performance.

link to the Gym environment evaluations below

Tables	Best 100 episode Avg	Best Score
SpaceInvaders-v0	5808.45 ± 337.28	13380.0
SpaceInvaders-v3	6944.85 ± 409.60	20440.0
SpaceInvadersDeterministic-v3	79060.10 ± 5826.59	167330.0
Breakout-v0	739.30 ± 18.43	864.0
Breakout-v3	859.57 ± 1.97	864.0
Pong-v0	20.96 ± 0.02	21.0
PongDeterministic-v3	21.00 ± 0.00	21.0
BeamRider-v0	8441.22 ± 221.24	13130.0
MsPacman-v0	6323.01 ± 116.91	10181.0
Seaquest-v0	54203.50 ± 1509.85	88840.0

The 167,330 Space Invaders score is World Record Space Invaders score and game ended only due to GYM timestep limit and not from loss of life. When I increased the GYM timestep limit to a million its reached a score on Space Invaders of approximately 2,300,000 and still ended due to timestep limit. Most likely due to game getting fairly redundent after a while

Due to gym version Seaquest-v0 timestep limit agent scores lower but on Seaquest-v4 with higher timestep limit agent beats game (see gif above) with max possible score 999,999!!

Requirements

Python 2.7+
Openai Gym and Universe
Pytorch

Training

When training model it is important to limit number of worker processes to number of cpu cores available as too many processes (e.g. more than one process per cpu core available) will actually be detrimental in training speed and effectiveness

To train agent in Pong-v0 environment with 32 different worker processes:

python main.py --env Pong-v0 --workers 32

#A3C-GPU training using machine with 4 V100 GPUs and 20core CPU for PongDeterministic-v4 took 10 minutes to converge

To train agent in PongDeterministic-v4 environment with 32 different worker processes on 4 GPUs with new A3G:

python main.py --env PongDeterministic-v4 --workers 32 --gpu-ids 0 1 2 3 --amsgrad True

Hit Ctrl C to end training session properly

Evaluation

To run a 100 episode gym evaluation with trained model

python gym_eval.py --env Pong-v0 --num-episodes 100

Notice BeamRiderNoFrameskip-v4 reaches scores over 50,000 in less than 2hrs of training compared to the gym v0 version this shows the difficulty of those versions but also the timelimit being a major factor in score level

These training charts were done on a DGX Station using 4GPUs and 20core Cpu. I used 36 worker agents and a tau of 0.92 which is the lambda in Generalized Advantage Estimation equation to introduce more variance due to the more deterministic nature of using just a 4 frame skip environment and a 0-30 NoOp start

Project Reference

https://github.com/ikostrikov/pytorch-a3c

Comments

I just want to say your trained model has no effect

I try to eval your trained model, however the result has no effect:

2017-08-01 21:08:13,757 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:13,757] reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:13,787] Starting new video recorder writing to /Volumes/xs/CodeSpace/AISpace/rl_space/rl_a3c_pytorch/Pong-v0_monitor/openaigym.video.0.33472.video000001.mp4
2017-08-01 21:08:24,947 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:24,947] reward sum: -21.0, reward mean: -21.0000
2017-08-01 21:08:35,054 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:35,054] reward sum: -21.0, reward mean: -21.0000
2017-08-01 21:08:44,732 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:44,732] reward sum: -21.0, reward mean: -21.0000

And the record is white-and-black videos, can not just show on screen.

opened by jinfagang 20

Normalize

In the NormalizedEnv, I am puzzled why you choose alpha equal to 0.9999, if I want unbiased_mean, should the alpha be (num_step - 1)/num_step?

I am also puzzled why normalize environment observation has a such huge influence on the performance? can you explain to me? Thanks!

opened by yhcao6 10
Reward is always 0 when training Breakout-v0

I have trained the model a night on Breakout-v0, however the reward is always 0. What reasons may cause this situation? Or could you tell me what the parameters you are using when training to play Breakout-v0? Thank you. Here is the log file. log.txt

opened by NeymarL 8
Pretrained models

Hello, is it possible to get access to some of the pre-trained models? (specifically looking for sea quest, pong and space invaders but any or all would be brilliant)

opened by lweitkamp 6
multi gpu support

When I run the program on multi gpu, that is, I set gpu_id to be [0, 1, 2, 3], it report error "Some of weight/gradient/input teSLTM nsors are located on different GPUs. Please move them to a single one"

opened by yhcao6 6
Performance of Breakout

Could I ask how long it takes to train Breakout from scratch to get the desire score (859.57 for Breakout-v3)?

Have You tried BreakoutNoFrameskip? This is a version without repetition and randomness.

Thanks!

opened by yhcao6 6
Solving time

Thank you for the nice implementation. I'm curious about the running time on your machine. In https://github.com/ikostrikov/pytorch-a3c, it is reported that PongDeterministic-v3 is solved around 15min, did you reproduce similar results in any version of Pong?

Thank you

opened by hugemicrobe 6

Question on training function

I noticed that in your player_util.py action_train function:

if self.done:
    if self.gpu_id >= 0:
        with torch.cuda.device(self.gpu_id):
            self.cx = Variable(torch.zeros(1, 512).cuda())
            self.hx = Variable(torch.zeros(1, 512).cuda())
    else:
        self.cx = Variable(torch.zeros(1, 512))
        self.hx = Variable(torch.zeros(1, 512))
else:
    self.cx = Variable(self.cx.data)
    self.hx = Variable(self.hx.data)

But how can you backpropagate gradients through time, to the past 20 steps, if you set:

self.cx = Variable(self.cx.data)
self.hx = Variable(self.hx.data)

opened by yiwan-rl 5

Seaquest-v0 not training as well as announced

I have tried twice to train the agent on Seaquest-v0 with 32 workers on a server, but after 13 hours of training, the score seems to be stuck at 2700/2800 maximum.

Here's the log file : log.txt

I'am using gym 0.8.1 and Atari-py 0.0.21 and let all the hyperparameters to their default value. Any idea why the score obtained is much lower than the one you obtained ? (>50000) Would you have the trained model for Seaquest-v0 ? Thanks !

opened by ThomasLecat 5
Quick question on batch processing

Thanks for the implementation! Great codes.

I read/ran your codes and realized it is processing training examples with just a batch_size=1 (instead of large batch size, am I correct on this?). I am just wondering if this is designed on purpose due to your G-A3C. With larger batch size things are running faster with GPU, so why batch_size=1?

Is there anything we can do to run it on large batches?

Thank you.

opened by hohoCode 4
Why one process run on 2 gpus?

First, thank you for your great work of a3c implementation.

I run the code with python main.py --workers 1 --gpu-ids 5 and find out that one process runs on 2 gpus. Similar things happened when I run with --workers 50. All the processes should run on gpu 5. However, I find that all of these processes (same PID) run on gpu 0 with Type C and smaller GPU Memory Usage compared with those run on gpu 5. How can I assign all the processes on gpu 5? Thank you very much!

opened by Jiankai-Sun 4
UserWarning: This overload of add_ is deprecated

Is it normal to get this trace back in the console? It spams for a few dozen times and then stops abruptly. Then, it starts logging the training session as intended. Sorry if this question is incredibly ignorant lol. I'm new to python and the world of ai. Figured I'd post my question here before searching Google. Thanks in advance. C:\Users\joshu\Documents\0AIFolder\00A3CA3Gatari\rl_a3c_pytorch-master\shared_optim.py:167: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:882.) exp_avg.mul_(beta1).add_(1 - beta1, grad)

opened by josh-tegus 0
question about trained models

Just want to clarify that there is only one saved model per environment and it will be overwritten each training epoch, right? For example, MsPacman will only have one saved model trained_models/MsPacman-v0.dat

opened by enochkan 1
Question about Test function
Hi, I'd to have a question about the following block

https://github.com/dgriff777/rl_a3c_pytorch/blob/eb5c9b909abc02911b45e325f7a7c619d3b0fa46/test.py#L60

if player.done and not player.info: state = player.env.reset() player.eps_len += 2 player.state = torch.from_numpy(state).float() if gpu_id >= 0: with torch.cuda.device(gpu_id): player.state = player.state.cuda() elif player.info:

I don't quite understand when the info equals True or False, what is the meaning off having info=True and info=False ?

I can't seem to find a documentation about this info flag on Gym website :(

Thanks
opened by AlexTo 1

Owner

David Griffis

GitHub

PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage async actor-critic Algorithms (A3C) in PyTorch @inproceedings{mnih2016asynchronous, title={Asynchronous methods for deep reinforcement lea

111 Dec 8, 2022

Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Automatic, Readable, Reusable, Extendable Machin is a reinforcement library designed for pytorch. Build status Platform Status Linux Windows Supported

348 Dec 24, 2022

Independent and minimal implementations of some reinforcement learning algorithms using PyTorch (including PPO, A3C, A2C, ...).

PyTorch RL Minimal Implementations There are implementations of some reinforcement learning algorithms, whose characteristics are as follow: Less pack

4 Dec 31, 2022

Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

70 Dec 12, 2022

Moon-patrol - A faithful recreation of the 1983 hit classic Moon Patrol for the Atari 2600 created using the Pygame library for Python

Moon Patrol A recreation of the hit Atari 2600 game, Moon Patrol Moon Patrol is

3 Apr 20, 2022

这是一个deeplabv3-plus-pytorch的源码，可以用于训练自己的模型。

DeepLabv3+：Encoder-Decoder with Atrous Separable Convolution语义分割模型在Pytorch当中的实现目录性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Download 训练步骤

350 Dec 28, 2022

Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

16 Sep 22, 2022

Enigma-Plus - Python based Enigma machine simulator with some extra features

Enigma-Plus Python based Enigma machine simulator with some extra features Examp

1 Jan 5, 2022

Multi-layer convolutional LSTM with Pytorch

Convolution_LSTM_pytorch Thanks for your attention. I haven't got time to maintain this repo for a long time. I recommend this repo which provides an

734 Jan 3, 2023

PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM

Quasi-Recurrent Neural Network (QRNN) for PyTorch Updated to support multi-GPU environments via DataParallel - see the the multigpu_dataparallel.py ex

1.3k Dec 28, 2022

Tree LSTM implementation in PyTorch

Tree-Structured Long Short-Term Memory Networks This is a PyTorch implementation of Tree-LSTM as described in the paper Improved Semantic Representati

529 Dec 10, 2022

LSTM and QRNN Language Model Toolkit for PyTorch

LSTM and QRNN Language Model Toolkit This repository contains the code used for two Salesforce Research papers: Regularizing and Optimizing LSTM Langu

1.9k Jan 8, 2023

Multi-layer convolutional LSTM with Pytorch

Convolution_LSTM_pytorch Thanks for your attention. I haven't got time to maintain this repo for a long time. I recommend this repo which provides an

733 Dec 30, 2022

LSTM model trained on a small dataset of 3000 names written in PyTorch

LSTM model trained on a small dataset of 3000 names. Model generates names from model by selecting one out of top 3 letters suggested by model at a time until an EOS (End Of Sentence) character is not encountered.

1 Dec 20, 2021

Using LSTM write Tang poetry

本教程将通过一个示例对LSTM进行介绍。通过搭建训练LSTM网络，我们将训练一个模型来生成唐诗。本文将对该实现进行详尽的解释，并阐明此模型的工作方式和原因。并不需要过多专业知识，但是可能需要新手花一些时间来理解的模型训练的实际情况。为了节省时间，请尽量选择GPU进行训练。

56 Dec 15, 2022

OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network

Stock Price Prediction of Apple Inc. Using Recurrent Neural Network OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network Dataset:

410 Jan 5, 2023

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

195 Dec 7, 2022

Using multidimensional LSTM neural networks to create a forecast for Bitcoin price

Multidimensional LSTM BitCoin Time Series Using multidimensional LSTM neural networks to create a forecast for Bitcoin price. For notes around this co

318 Dec 14, 2022

Incorporating Transformer and LSTM to Kalman Filter with EM algorithm

Deep learning based state estimation: incorporating Transformer and LSTM to Kalman Filter with EM algorithm Overview Kalman Filter requires the true p

57 Dec 27, 2022