Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Overview

Pytorch-DPPO

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https://arxiv.org/pdf/1707.06347.pdf).

I finally fixed what was wrong with the gradient descent step, using previous log-prob from rollout batches. At least ppo.py is fixed, the rest is going to be corrected as well very soon.

In the following example I was not patient enough to wait for million iterations, I just wanted to check if the model is properly learning:

Progress of single PPO:

InvertedPendulum

InvertedPendulum

InvertedDoublePendulum

InvertedDoublePendulum

HalfCheetah

HalfCheetah

hopper (PyBullet)

hopper (PyBullet)

halfcheetah (PyBullet)

halfcheetah (PyBullet)

Progress of DPPO (4 agents) [TODO]

Acknowledgments

The structure of this code is based on https://github.com/ikostrikov/pytorch-a3c.

Hyperparameters and loss computation has been taken from https://github.com/openai/baselines

Comments
  • Old policy?

    Old policy?

    Great work! I'm also working on a PPO implementation, but I completely miss where π and π_old come from. Here you store the policy output when actually acting on the environment - if you stored this and retrieved it from the memory, wouldn't it be the same as calculating them again in a batch like you do here?

    You then construct a new policy, and calculate the new policy output here. I see that it is different because you load weights that have been updated by other processes, but in a synchronous setting, the weights wouldn't have been updated, and hence the policy outputs wouldn't be any different?

    opened by Kaixhin 5
  • Loss questions

    Loss questions

    I just went throught your code and the PPO paper and have a few questions, perhaps if you have time you could comment.

    • First off, nice work. The code is easy to read with lots of comments and full variable names, it made it easy for me to read through and (partially) understand.
    • ~~Should you take away lossvalue or add it? We want each individual loss to make the overall loss larger as they get larger. I can see you copied baselines exactly but maybe they have it wrong, In the PPO paper they have minus on one term (eq9). If you stop the training and inspect loss_clip and loss_value the first is negative, the second is positive. So it seems like we need to have loss=loss_value-loss_clip. Thoughts?~~
    • what's log_std, is that an exploration parameter set by the model?
    • Do we need loss_value? In the PPO paper they say that if we don't have shared parameter between the policy and value function then it's not needed (first paragraph of section 5). And your example model doesn't share parameters. An example of one that does is in baselines and it could halve your model parameters e.g.:
    class Model(nn.Module):
        def __init__(self, num_inputs, num_outputs):
            super(Model, self).__init__()
            h_size_1 = 100
            h_size_2 = 100
            self.fc1 = nn.Linear(num_inputs, h_size_1)
            self.fc2 = nn.Linear(h_size_1, h_size_2)
            self.mu = nn.Linear(h_size_2, num_outputs)
            self.log_std = nn.Parameter(torch.zeros(num_outputs))
            self.v = nn.Linear(h_size_2,1)
            for name, p in self.named_parameters():
                # init parameters
                if 'bias' in name:
                    p.data.fill_(0)
                '''
                if 'mu.weight' in name:
                    p.data.normal_()
                    p.data /= torch.sum(p.data**2,0).expand_as(p.data)'''
            # mode
            self.train()
    
        def forward(self, inputs):
            # actor
            x = F.tanh(self.fc1(inputs))
            h = F.tanh(self.fc2(x))
            mu = self.mu(h)
            log_std = torch.exp(self.log_std).unsqueeze(0).expand_as(mu)
            # critic
            v = self.v(h)
            return mu, log_std, v
    
    opened by wassname 3
  • Failed in more complex environment

    Failed in more complex environment

    Thank for the sharing the code. I tested that the code is working in the invertedPendulum-v1.

    But when I changed the environment to Ant-v1 without changing any other parameters, it seems the agent failed to learn as below. Do I need to change some parameters?

    Time 00h 01m 01s, episode reward -3032.25671304008, episode length 1000
    Time 00h 02m 01s, episode reward -99.15254692012928, episode length 25
    Time 00h 03m 01s, episode reward -41.27665909454931, episode length 14
    Time 00h 04m 01s, episode reward -39.077425184658665, episode length 17
    Time 00h 05m 02s, episode reward -136.60746428384076, episode length 45
    Time 00h 06m 02s, episode reward -111.40062667574634, episode length 40
    Time 00h 07m 02s, episode reward -516.1070385678166, episode length 169
    Time 00h 08m 02s, episode reward -129.64627338344073, episode length 42
    Time 00h 09m 02s, episode reward -146.55425861577797, episode length 45
    Time 00h 10m 03s, episode reward -253.41361049200614, episode length 86
    Time 00h 11m 03s, episode reward -108.6953450777496, episode length 38
    Time 00h 12m 03s, episode reward -64.66194807902957, episode length 16
    Time 00h 13m 03s, episode reward -33.51695185844647, episode length 11
    Time 00h 14m 03s, episode reward -86.88904449639067, episode length 35
    Time 00h 15m 03s, episode reward -78.48049851223362, episode length 23
    Time 00h 16m 03s, episode reward -165.73681903021165, episode length 61
    Time 00h 17m 04s, episode reward -155.3555664457943, episode length 60
    Time 00h 18m 04s, episode reward -57.65249942070945, episode length 20
    Time 00h 19m 04s, episode reward -392.10161323743887, episode length 109
    Time 00h 20m 04s, episode reward -55.63287075930159, episode length 12
    Time 00h 21m 04s, episode reward -81.0448173961397, episode length 29
    Time 00h 22m 04s, episode reward -149.84827826419726, episode length 52
    Time 00h 23m 04s, episode reward -398.0365800924663, episode length 22
    Time 00h 24m 05s, episode reward -1948.6136580594682, episode length 17
    Time 00h 25m 05s, episode reward -18719.08471382285, episode length 51
    Time 00h 26m 06s, episode reward -805145.8854457787, episode length 1000
    Time 00h 27m 06s, episode reward -17008.04843510176, episode length 17
    Time 00h 28m 07s, episode reward -168769.34038655, episode length 129
    Time 00h 29m 07s, episode reward -104933.08883886453, episode length 79
    Time 00h 30m 07s, episode reward -22809.687035617088, episode length 17
    Time 00h 31m 07s, episode reward -46398.71530676861, episode length 37
    Time 00h 32m 07s, episode reward -18513.064083079746, episode length 15
    Time 00h 33m 07s, episode reward -21329.411481710402, episode length 15
    Time 00h 34m 09s, episode reward -1393903.341478124, episode length 1000
    Time 00h 35m 10s, episode reward -1374988.6133415946, episode length 1000
    Time 00h 36m 10s, episode reward -33792.40522011441, episode length 28
    Time 00h 37m 10s, episode reward -20629.94697013807, episode length 16
    Time 00h 38m 10s, episode reward -39780.93399623488, episode length 29
    Time 00h 39m 10s, episode reward -61722.81635309537, episode length 47
    Time 00h 40m 10s, episode reward -46780.12455378964, episode length 36
    Time 00h 41m 10s, episode reward -91640.36757206521, episode length 73
    Time 00h 42m 11s, episode reward -77137.71004513587, episode length 63
    Time 00h 43m 11s, episode reward -15184.611248485926, episode length 10
    Time 00h 44m 11s, episode reward -26995.023495691694, episode length 20
    Time 00h 45m 11s, episode reward -110371.66228435331, episode length 81
    Time 00h 46m 11s, episode reward -55639.738879114084, episode length 41
    Time 00h 47m 11s, episode reward -53735.2616539847, episode length 39
    Time 00h 48m 11s, episode reward -60755.49631228513, episode length 43
    Time 00h 49m 11s, episode reward -29466.664499076247, episode length 23
    Time 00h 50m 12s, episode reward -48580.31395829051, episode length 37
    Time 00h 51m 12s, episode reward -128957.8903571858, episode length 99
    Time 00h 52m 12s, episode reward -70144.76359014906, episode length 51
    Time 00h 53m 12s, episode reward -29271.097255889938, episode length 21
    Time 00h 54m 12s, episode reward -21737.6644599086, episode length 17
    Time 00h 55m 12s, episode reward -27549.40889570978, episode length 20
    Time 00h 56m 12s, episode reward -97097.66966694668, episode length 77
    Time 00h 57m 13s, episode reward -18384.51761876518, episode length 14
    Time 00h 58m 13s, episode reward -28424.585660954337, episode length 22
    Time 00h 59m 13s, episode reward -96267.24448946006, episode length 72
    Time 01h 00m 13s, episode reward -79794.54738721657, episode length 60
    Time 01h 01m 13s, episode reward -88486.88046448736, episode length 64
    Time 01h 02m 13s, episode reward -31071.50782185118, episode length 24
    Time 01h 03m 13s, episode reward -53608.97197643964, episode length 38
    Time 01h 04m 14s, episode reward -38451.031800392186, episode length 27
    Time 01h 05m 14s, episode reward -27645.787896926682, episode length 20
    
    opened by kkjh0723 1
  • one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead.

    one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead.

    env: torch 1.8.1+cu111

    Error: UserWarning: Error detected in AddmmBackward. Traceback of forward call that caused the error: File "", line 1, in File "E:\A\envs\gym\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "E:\A\envs\gym\lib\multiprocessing\spawn.py", line 118, in _main return self._bootstrap() File "E:\A\envs\gym\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "E:\A\envs\gym\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "Pytorch-RL\Pytorch-DPPO-master\train.py", line 155, in train mu_old, sigma_sq_old, v_pred_old = model_old(batch_states) File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Pytorch-DPPO-master\model.py", line 53, in forward v1 = self.v(x3) File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\linear.py", line 94, in forward return F.linear(input, self.weight, self.bias) File "E:\A\envs\gym\lib\site-packages\torch\nn\functional.py", line 1753, in linear return torch._C._nn.linear(input, weight, bias) (Triggered internally at ..\torch\csrc\autograd\python_anomaly_mode.cpp:104.) allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag Process Process-4: Traceback (most recent call last): File "E:\A\envs\gym\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "E:\A\envs\gym\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self.kwargs) File "Pytorch-DPPO-master\train.py", line 197, in train total_loss.backward(retain_graph=True) File "E:\A\envs\gym\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\A\envs\gym\lib\site-packages\torch\autograd_init.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

    i googled and some says its caused by inplcae op ,but i cant seems to find any,i havent try to downgrade torch version,but is there any solutions that i dont need to downgrade ?

    opened by TJ2333 2
  • Question on algorithm itself

    Question on algorithm itself

    Usually PPO is for continous action, but for OpenAI FIVE, shouldn't the action be discrete? What's the technique to make PPO applicable to Dota2 actions?

    opened by QiXuanWang 2
  • average gradients to update global theta?

    average gradients to update global theta?

    Thanks for the nice implementation in pytorch, which made easier for me to learn.

    Regarding chief.py implementation, I got a question about updates to global weights. From Algorithm Pseudocode in the paper, it seems to use averaged gradients from workers to update the global weights, but chief.py looks using sum of worker's gradients? Thanks.

    Cheng

    opened by weicheng113 8
  • on advantages

    on advantages

    after test your PPO, and compare with another , i think your advantages need to been : (advantages - advantages.mean()) / advantages.std() for you reference

    opened by cn3c3p 1
Owner
Alexis David Jacq
Alexis David Jacq
Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory >= 8G Numpy > 1.

null 46 Dec 14, 2022
[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

null 153 Dec 14, 2022
Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

Daniel Roich 58 Dec 24, 2022
A PyTorch implementation of EventProp [https://arxiv.org/abs/2009.08378], a method to train Spiking Neural Networks

Spiking Neural Network training with EventProp This is an unofficial PyTorch implemenation of EventProp, a method to compute exact gradients for Spiki

Pedro Savarese 35 Jul 29, 2022
Unofficial implementation of Alias-Free Generative Adversarial Networks. (https://arxiv.org/abs/2106.12423) in PyTorch

alias-free-gan-pytorch Unofficial implementation of Alias-Free Generative Adversarial Networks. (https://arxiv.org/abs/2106.12423) This implementation

Kim Seonghyeon 502 Jan 3, 2023
PyTorch implementation of Asymmetric Siamese (https://arxiv.org/abs/2204.00613)

Asym-Siam: On the Importance of Asymmetry for Siamese Representation Learning This is a PyTorch implementation of the Asym-Siam paper, CVPR 2022: @inp

Meta Research 89 Dec 18, 2022
Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow 201 Dec 21, 2022
Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement Recently, the power of unconditional image synthesis has significantly advanced th

null 967 Jan 4, 2023
This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression Introduction In this paper, we are interested in the bottom-up paradigm of estima

HRNet 367 Dec 27, 2022
Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

PAWS-TF ?? Implementation of Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples (PAWS)

Sayak Paul 43 Jan 8, 2023
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022
Unofficial Tensorflow-Keras implementation of Fastformer based on paper [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084).

Fastformer-Keras Unofficial Tensorflow-Keras implementation of Fastformer based on paper Fastformer: Additive Attention Can Be All You Need. Tensorflo

Yam Peleg 10 Jan 30, 2022
Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)

Transfer Learning for Text Classification with Tensorflow Tensorflow implementation of Semi-supervised Sequence Learning(https://arxiv.org/abs/1511.01

DONGJUN LEE 82 Oct 22, 2022
This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

Emma Rocheteau 76 Dec 22, 2022
https://arxiv.org/abs/2102.11005

LogME LogME: Practical Assessment of Pre-trained Models for Transfer Learning How to use Just feed the features f and labels y to the function, and yo

THUML: Machine Learning Group @ THSS 149 Dec 19, 2022
Supplementary code for the paper "Meta-Solver for Neural Ordinary Differential Equations" https://arxiv.org/abs/2103.08561

Meta-Solver for Neural Ordinary Differential Equations Towards robust neural ODEs using parametrized solvers. Main idea Each Runge-Kutta (RK) solver w

Julia Gusak 25 Aug 12, 2021
Code for paper "A Critical Assessment of State-of-the-Art in Entity Alignment" (https://arxiv.org/abs/2010.16314)

A Critical Assessment of State-of-the-Art in Entity Alignment This repository contains the source code for the paper A Critical Assessment of State-of

Max Berrendorf 16 Oct 14, 2022
Code for the paper: Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization (https://arxiv.org/abs/2002.11798)

Representation Robustness Evaluations Our implementation is based on code from MadryLab's robustness package and Devon Hjelm's Deep InfoMax. For all t

Sicheng 19 Dec 7, 2022
ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

Jie Hu 182 Dec 19, 2022