Pytorch implementation of Distributed Proximal Policy Optimization

Overview

Pytorch-DPPO

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https://arxiv.org/pdf/1707.06347.pdf).

I finally fixed what was wrong with the gradient descent step, using previous log-prob from rollout batches. At least ppo.py is fixed, the rest is going to be corrected as well very soon.

In the following example I was not patient enough to wait for million iterations, I just wanted to check if the model is properly learning:

Progress of single PPO:

InvertedPendulum

InvertedPendulum

InvertedDoublePendulum

InvertedDoublePendulum

HalfCheetah

HalfCheetah

hopper (PyBullet)

hopper (PyBullet)

halfcheetah (PyBullet)

halfcheetah (PyBullet)

Progress of DPPO (4 agents) [TODO]

Acknowledgments

The structure of this code is based on https://github.com/ikostrikov/pytorch-a3c.

Hyperparameters and loss computation has been taken from https://github.com/openai/baselines

Comments
  • Old policy?

    Old policy?

    Great work! I'm also working on a PPO implementation, but I completely miss where π and π_old come from. Here you store the policy output when actually acting on the environment - if you stored this and retrieved it from the memory, wouldn't it be the same as calculating them again in a batch like you do here?

    You then construct a new policy, and calculate the new policy output here. I see that it is different because you load weights that have been updated by other processes, but in a synchronous setting, the weights wouldn't have been updated, and hence the policy outputs wouldn't be any different?

    opened by Kaixhin 5
  • Loss questions

    Loss questions

    I just went throught your code and the PPO paper and have a few questions, perhaps if you have time you could comment.

    • First off, nice work. The code is easy to read with lots of comments and full variable names, it made it easy for me to read through and (partially) understand.
    • ~~Should you take away lossvalue or add it? We want each individual loss to make the overall loss larger as they get larger. I can see you copied baselines exactly but maybe they have it wrong, In the PPO paper they have minus on one term (eq9). If you stop the training and inspect loss_clip and loss_value the first is negative, the second is positive. So it seems like we need to have loss=loss_value-loss_clip. Thoughts?~~
    • what's log_std, is that an exploration parameter set by the model?
    • Do we need loss_value? In the PPO paper they say that if we don't have shared parameter between the policy and value function then it's not needed (first paragraph of section 5). And your example model doesn't share parameters. An example of one that does is in baselines and it could halve your model parameters e.g.:
    class Model(nn.Module):
        def __init__(self, num_inputs, num_outputs):
            super(Model, self).__init__()
            h_size_1 = 100
            h_size_2 = 100
            self.fc1 = nn.Linear(num_inputs, h_size_1)
            self.fc2 = nn.Linear(h_size_1, h_size_2)
            self.mu = nn.Linear(h_size_2, num_outputs)
            self.log_std = nn.Parameter(torch.zeros(num_outputs))
            self.v = nn.Linear(h_size_2,1)
            for name, p in self.named_parameters():
                # init parameters
                if 'bias' in name:
                    p.data.fill_(0)
                '''
                if 'mu.weight' in name:
                    p.data.normal_()
                    p.data /= torch.sum(p.data**2,0).expand_as(p.data)'''
            # mode
            self.train()
    
        def forward(self, inputs):
            # actor
            x = F.tanh(self.fc1(inputs))
            h = F.tanh(self.fc2(x))
            mu = self.mu(h)
            log_std = torch.exp(self.log_std).unsqueeze(0).expand_as(mu)
            # critic
            v = self.v(h)
            return mu, log_std, v
    
    opened by wassname 3
  • Failed in more complex environment

    Failed in more complex environment

    Thank for the sharing the code. I tested that the code is working in the invertedPendulum-v1.

    But when I changed the environment to Ant-v1 without changing any other parameters, it seems the agent failed to learn as below. Do I need to change some parameters?

    Time 00h 01m 01s, episode reward -3032.25671304008, episode length 1000
    Time 00h 02m 01s, episode reward -99.15254692012928, episode length 25
    Time 00h 03m 01s, episode reward -41.27665909454931, episode length 14
    Time 00h 04m 01s, episode reward -39.077425184658665, episode length 17
    Time 00h 05m 02s, episode reward -136.60746428384076, episode length 45
    Time 00h 06m 02s, episode reward -111.40062667574634, episode length 40
    Time 00h 07m 02s, episode reward -516.1070385678166, episode length 169
    Time 00h 08m 02s, episode reward -129.64627338344073, episode length 42
    Time 00h 09m 02s, episode reward -146.55425861577797, episode length 45
    Time 00h 10m 03s, episode reward -253.41361049200614, episode length 86
    Time 00h 11m 03s, episode reward -108.6953450777496, episode length 38
    Time 00h 12m 03s, episode reward -64.66194807902957, episode length 16
    Time 00h 13m 03s, episode reward -33.51695185844647, episode length 11
    Time 00h 14m 03s, episode reward -86.88904449639067, episode length 35
    Time 00h 15m 03s, episode reward -78.48049851223362, episode length 23
    Time 00h 16m 03s, episode reward -165.73681903021165, episode length 61
    Time 00h 17m 04s, episode reward -155.3555664457943, episode length 60
    Time 00h 18m 04s, episode reward -57.65249942070945, episode length 20
    Time 00h 19m 04s, episode reward -392.10161323743887, episode length 109
    Time 00h 20m 04s, episode reward -55.63287075930159, episode length 12
    Time 00h 21m 04s, episode reward -81.0448173961397, episode length 29
    Time 00h 22m 04s, episode reward -149.84827826419726, episode length 52
    Time 00h 23m 04s, episode reward -398.0365800924663, episode length 22
    Time 00h 24m 05s, episode reward -1948.6136580594682, episode length 17
    Time 00h 25m 05s, episode reward -18719.08471382285, episode length 51
    Time 00h 26m 06s, episode reward -805145.8854457787, episode length 1000
    Time 00h 27m 06s, episode reward -17008.04843510176, episode length 17
    Time 00h 28m 07s, episode reward -168769.34038655, episode length 129
    Time 00h 29m 07s, episode reward -104933.08883886453, episode length 79
    Time 00h 30m 07s, episode reward -22809.687035617088, episode length 17
    Time 00h 31m 07s, episode reward -46398.71530676861, episode length 37
    Time 00h 32m 07s, episode reward -18513.064083079746, episode length 15
    Time 00h 33m 07s, episode reward -21329.411481710402, episode length 15
    Time 00h 34m 09s, episode reward -1393903.341478124, episode length 1000
    Time 00h 35m 10s, episode reward -1374988.6133415946, episode length 1000
    Time 00h 36m 10s, episode reward -33792.40522011441, episode length 28
    Time 00h 37m 10s, episode reward -20629.94697013807, episode length 16
    Time 00h 38m 10s, episode reward -39780.93399623488, episode length 29
    Time 00h 39m 10s, episode reward -61722.81635309537, episode length 47
    Time 00h 40m 10s, episode reward -46780.12455378964, episode length 36
    Time 00h 41m 10s, episode reward -91640.36757206521, episode length 73
    Time 00h 42m 11s, episode reward -77137.71004513587, episode length 63
    Time 00h 43m 11s, episode reward -15184.611248485926, episode length 10
    Time 00h 44m 11s, episode reward -26995.023495691694, episode length 20
    Time 00h 45m 11s, episode reward -110371.66228435331, episode length 81
    Time 00h 46m 11s, episode reward -55639.738879114084, episode length 41
    Time 00h 47m 11s, episode reward -53735.2616539847, episode length 39
    Time 00h 48m 11s, episode reward -60755.49631228513, episode length 43
    Time 00h 49m 11s, episode reward -29466.664499076247, episode length 23
    Time 00h 50m 12s, episode reward -48580.31395829051, episode length 37
    Time 00h 51m 12s, episode reward -128957.8903571858, episode length 99
    Time 00h 52m 12s, episode reward -70144.76359014906, episode length 51
    Time 00h 53m 12s, episode reward -29271.097255889938, episode length 21
    Time 00h 54m 12s, episode reward -21737.6644599086, episode length 17
    Time 00h 55m 12s, episode reward -27549.40889570978, episode length 20
    Time 00h 56m 12s, episode reward -97097.66966694668, episode length 77
    Time 00h 57m 13s, episode reward -18384.51761876518, episode length 14
    Time 00h 58m 13s, episode reward -28424.585660954337, episode length 22
    Time 00h 59m 13s, episode reward -96267.24448946006, episode length 72
    Time 01h 00m 13s, episode reward -79794.54738721657, episode length 60
    Time 01h 01m 13s, episode reward -88486.88046448736, episode length 64
    Time 01h 02m 13s, episode reward -31071.50782185118, episode length 24
    Time 01h 03m 13s, episode reward -53608.97197643964, episode length 38
    Time 01h 04m 14s, episode reward -38451.031800392186, episode length 27
    Time 01h 05m 14s, episode reward -27645.787896926682, episode length 20
    
    opened by kkjh0723 1
  • one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead.

    one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead.

    env: torch 1.8.1+cu111

    Error: UserWarning: Error detected in AddmmBackward. Traceback of forward call that caused the error: File "", line 1, in File "E:\A\envs\gym\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "E:\A\envs\gym\lib\multiprocessing\spawn.py", line 118, in _main return self._bootstrap() File "E:\A\envs\gym\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "E:\A\envs\gym\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "Pytorch-RL\Pytorch-DPPO-master\train.py", line 155, in train mu_old, sigma_sq_old, v_pred_old = model_old(batch_states) File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Pytorch-DPPO-master\model.py", line 53, in forward v1 = self.v(x3) File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\linear.py", line 94, in forward return F.linear(input, self.weight, self.bias) File "E:\A\envs\gym\lib\site-packages\torch\nn\functional.py", line 1753, in linear return torch._C._nn.linear(input, weight, bias) (Triggered internally at ..\torch\csrc\autograd\python_anomaly_mode.cpp:104.) allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag Process Process-4: Traceback (most recent call last): File "E:\A\envs\gym\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "E:\A\envs\gym\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self.kwargs) File "Pytorch-DPPO-master\train.py", line 197, in train total_loss.backward(retain_graph=True) File "E:\A\envs\gym\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\A\envs\gym\lib\site-packages\torch\autograd_init.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

    i googled and some says its caused by inplcae op ,but i cant seems to find any,i havent try to downgrade torch version,but is there any solutions that i dont need to downgrade ?

    opened by TJ2333 2
  • Question on algorithm itself

    Question on algorithm itself

    Usually PPO is for continous action, but for OpenAI FIVE, shouldn't the action be discrete? What's the technique to make PPO applicable to Dota2 actions?

    opened by QiXuanWang 2
  • average gradients to update global theta?

    average gradients to update global theta?

    Thanks for the nice implementation in pytorch, which made easier for me to learn.

    Regarding chief.py implementation, I got a question about updates to global weights. From Algorithm Pseudocode in the paper, it seems to use averaged gradients from workers to update the global weights, but chief.py looks using sum of worker's gradients? Thanks.

    Cheng

    opened by weicheng113 8
  • on advantages

    on advantages

    after test your PPO, and compare with another , i think your advantages need to been : (advantages - advantages.mean()) / advantages.std() for you reference

    opened by cn3c3p 1
Owner
Alexis David Jacq
Alexis David Jacq
A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

Fidelity Investments 56 Sep 13, 2022
Riemannian Adaptive Optimization Methods with pytorch optim

geoopt Manifold aware pytorch.optim. Unofficial implementation for “Riemannian Adaptive Optimization Methods” ICLR2019 and more. Installation Make sur

null 642 Jan 3, 2023
OptNet: Differentiable Optimization as a Layer in Neural Networks

OptNet: Differentiable Optimization as a Layer in Neural Networks This repository is by Brandon Amos and J. Zico Kolter and contains the PyTorch sourc

CMU Locus Lab 428 Dec 24, 2022
Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.

Tez: a simple pytorch trainer NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something does

abhishek thakur 1.1k Jan 4, 2023
null 270 Dec 24, 2022
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

null 878 Dec 30, 2022
PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

Cong Cai 12 Dec 19, 2021
A PyTorch implementation of EfficientNet

EfficientNet PyTorch Quickstart Install with pip install efficientnet_pytorch and load a pretrained EfficientNet with: from efficientnet_pytorch impor

Luke Melas-Kyriazi 7.2k Jan 6, 2023
PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

README TabNet : Attentive Interpretable Tabular Learning This is a pyTorch implementation of Tabnet (Arik, S. O., & Pfister, T. (2019). TabNet: Attent

DreamQuark 2k Dec 27, 2022
An implementation of Performer, a linear attention-based transformer, in Pytorch

Performer - Pytorch An implementation of Performer, a linear attention-based transformer variant with a Fast Attention Via positive Orthogonal Random

Phil Wang 900 Dec 22, 2022
PyTorch implementation of Glow, Generative Flow with Invertible 1x1 Convolutions

glow-pytorch PyTorch implementation of Glow, Generative Flow with Invertible 1x1 Convolutions

Kim Seonghyeon 433 Dec 27, 2022
This is an differentiable pytorch implementation of SIFT patch descriptor.

This is an differentiable pytorch implementation of SIFT patch descriptor. It is very slow for describing one patch, but quite fast for batch. It can

Dmytro Mishkin 150 Dec 24, 2022
GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks

GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks This repository implements a capsule model Inten

Joel Huang 15 Dec 24, 2022
Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Tacotron 2 (without wavenet) PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions. This implementati

NVIDIA Corporation 4.1k Jan 3, 2023
A Pytorch Implementation for Compact Bilinear Pooling.

CompactBilinearPooling-Pytorch A Pytorch Implementation for Compact Bilinear Pooling. Adapted from tensorflow_compact_bilinear_pooling Prerequisites I

null 169 Dec 23, 2022
A pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch.

Compact Bilinear Pooling for PyTorch. This repository has a pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch. This

Grégoire Payen de La Garanderie 234 Dec 7, 2022
A PyTorch implementation of L-BFGS.

PyTorch-LBFGS: A PyTorch Implementation of L-BFGS Authors: Hao-Jun Michael Shi (Northwestern University) and Dheevatsa Mudigere (Facebook) What is it?

Hao-Jun Michael Shi 478 Dec 27, 2022
A PyTorch implementation of Learning to learn by gradient descent by gradient descent

Intro PyTorch implementation of Learning to learn by gradient descent by gradient descent. Run python main.py TODO Initial implementation Toy data LST

Ilya Kostrikov 300 Dec 11, 2022
PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

Jacob Gildenblat 836 Dec 26, 2022