PyTorch implementation of Trust Region Policy Optimization

Overview

PyTorch implementation of TRPO

Try my implementation of PPO (aka newer better variant of TRPO), unless you need to you TRPO for some specific reasons.

This is a PyTorch implementation of "Trust Region Policy Optimization (TRPO)".

This is code mostly ported from original implementation by John Schulman. In contrast to another implementation of TRPO in PyTorch, this implementation uses exact Hessian-vector product instead of finite differences approximation.

Contributions

Contributions are very welcome. If you know how to make this code better, don't hesitate to send a pull request.

Usage

python main.py --env-name "Reacher-v1"

Recommended hyper parameters

InvertedPendulum-v1: 5000

Reacher-v1, InvertedDoublePendulum-v1: 15000

HalfCheetah-v1, Hopper-v1, Swimmer-v1, Walker2d-v1: 25000

Ant-v1, Humanoid-v1: 50000

Results

More or less similar to the original code. Coming soon.

Todo

  • Plots.
  • Collect data in multiple threads.
Comments
  • What is get_kl() doing in main.py?

    What is get_kl() doing in main.py?

    Hi. Thanks for publishing implementation of trpo.

    I have question about get_kl().

    I thought what get_kl() is supposed to do is to calculate the kl divergence of old policy and new policy, but this get_kl() seems always returning 0.

    Also,I do not see kl constraining part in the parameters updating process.

    Is this code the modification of trpo or do I have some misunderstanding?

    Thanks,

    opened by jtoyama4 11
  • doc?

    doc?

    like eg, imagine I have my own policy, that takes in a state, and outputs an action, or perhaps a distribution over actions; and I have a world that takes an action, and returns a reward and a new state, how would I plug these into this TRPO implementation?

    opened by hughperkins 4
  • How to modify the code for discrete actions?

    How to modify the code for discrete actions?

    Hi, thanks once again for implementing a really interesting algorithm in PyTorch :+1: ,

    I was wondering how to modify the code to be able to use it for environments which require discrete actions, (say cartpole as in the other pytorch trpo implementation, or maybe even Atari games)?

    opened by AjayTalati 3
  • What and When to send on the GPU?

    What and When to send on the GPU?

    I'm new to pytorch and am having a hard time getting used to handling the variables properly on cpu and gpu. As we are calculating our own losses here, I am having trouble understanding what and when to send to the device (gpu). Would really appreciate an explanation of how to go about this. The code is quite well written and easy to understand btw.

    opened by prathamesh0 1
  • what does volatile=True for?

    what does volatile=True for?

    Dear author: I found your code very helpful. However, I have problems trying to read the following code:

    https://github.com/ikostrikov/pytorch-trpo/blob/eb26e29ed75b7c7b46b0c717331cc7488ab16b8d/main.py#L111

    I wonder the usage of volatile flag. I want to when u set volatile to True/False.

    opened by dragen1860 1
  • compute the Fisher-Vector Producy

    compute the Fisher-Vector Producy

    Hello, I wanna ask that in line 67 in your trpo.py, you will get two terms, and in the TRPO paper, he said the second term vanishes ?, and you add v*damping, I guess its function is to make sure the positive definiteness? , could you explain it in detail? thank you very much! and in your line 117 in your main.py, could you explain why this can approximate the average KL in detail? thank you very much!

    opened by ghost 1
  • other env

    other env

    hello, so I notice your code is about mujoco, and I wonder how to modify it to fit other env, I have tried but failed. thx a lot! ikostrikov, thx very much. I have tried one continuous control game "MountainCarContinuous-v0" in classical control and it succeeds.

    opened by ghost 1
  • Is the get_kl() function correct?

    Is the get_kl() function correct?

    Thanks for your great code! I notice that in the function get_kl(), you use policy net to generate the mean, log_std and std, then copy these three parameters and calculate the KL divergence between the original parameters and the copied parameters, which is obviously zero all the time. Is this a bug or a intended behavior?

    opened by zzzxxxttt 0
  • Idon‘t konw what the “neggdotstepdir” for ,Thanks !!!

    Idon‘t konw what the “neggdotstepdir” for ,Thanks !!!

    Thank you very much for the code you provided!I learn a lot from it . I would like to ask what is the function of these lines of code, is there any mathematical proof or the like, thank you very much!!!these are different from the original paper?Thanks!!!

    neggdotstepdir = (-loss_grad * stepdir).sum(0, keepdim=True)
    expected_improve = expected_improve_rate * stepfrac
    ratio = actual_improve / expected_improve
     if ratio.item() > accept_ratio and actual_improve.item() > 0:
    
    opened by baywc568 0
  • Object oriented

    Object oriented

    It would be nice if the agent was an object (with methods "get_action" and "remember" or similar) so that it could be more easily reused.

    opened by GittiHab 0
  • Bootstrapping the value function?

    Bootstrapping the value function?

    Currently, the target for the value function is the discounted sum of all future rewards. This gives unbiased estimate but will result in higher variance. An alternative is to use bootstrapped estimate, i.e. something like target[i] = rewards[i] + gamma * prev_values * masks[i]

    Bootstrapping is often preferred due to low variance, even though it results in biased gradient estimate.

    opened by XuchanBao 1
  • dose the linesearch method conflict with a

    dose the linesearch method conflict with a "trust region" policy gradient algorithm?

    Hi, I am a newcomer to drl. When I try to read trpo_step in trpo.py, I notice that you use a linesearch method instead of trust region for numerical optimization. So I want to know why you choose that method and dose it conflict with a "trust region" policy gradient algorithm?

    opened by nuomizai 1
  • The step of t is not necessary in main.py

    The step of t is not necessary in main.py

    In your main.py, line 147: for t in range(10000): # Don't infinite loop while learning But actually, the t ends at 50, because the env is done in 50 steps. so the range(10000) is so big and not necessary.

    opened by LeonardPatrick 0
Owner
Ilya Kostrikov
Post doc
Ilya Kostrikov
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Ilya Kostrikov 3k Dec 31, 2022
MBPO (paper: When to trust your model: Model-based policy optimization) in offline RL settings

offline-MBPO This repository contains the code of a version of model-based RL algorithm MBPO, which is modified to perform in offline RL settings Pape

LxzGordon 1 Oct 24, 2021
Prototypical python implementation of the trust-region algorithm presented in Sequential Linearization Method for Bound-Constrained Mathematical Programs with Complementarity Constraints by Larson, Leyffer, Kirches, and Manns.

Prototypical python implementation of the trust-region algorithm presented in Sequential Linearization Method for Bound-Constrained Mathematical Programs with Complementarity Constraints by Larson, Leyffer, Kirches, and Manns.

null 3 Dec 2, 2022
Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

V-MPO Simple code to demonstrate Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) in Pyt

Nugroho Dewantoro 9 Jun 6, 2022
Cross-Image Region Mining with Region Prototypical Network for Weakly Supervised Segmentation

Cross-Image Region Mining with Region Prototypical Network for Weakly Supervised Segmentation The code of: Cross-Image Region Mining with Region Proto

LiuWeide 16 Nov 26, 2022
Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

Alexis David Jacq 163 Dec 26, 2022
ppo_pytorch_cpp - an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

PPO Pytorch C++ This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment t

Martin Huber 59 Dec 9, 2022
PyTorch implementation of Constrained Policy Optimization

PyTorch implementation of Constrained Policy Optimization (CPO) This repository has a simple to understand and use implementation of CPO in PyTorch. A

Sapana Chaudhary 25 Dec 8, 2022
'A C2C E-COMMERCE TRUST MODEL BASED ON REPUTATION' Python implementation

Project description A library providing functionalities to calculate reputation and degree of trust on C2C ecommerce platforms. The work is fully base

Davide Bigotti 2 Dec 14, 2022
An implementation of the proximal policy optimization algorithm

PPO Pytorch C++ This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment t

Martin Huber 59 Dec 9, 2022
PyTorch Implementation of Region Similarity Representation Learning (ReSim)

ReSim This repository provides the PyTorch implementation of Region Similarity Representation Learning (ReSim) described in this paper: @Article{xiao2

Tete Xiao 74 Jan 3, 2023
[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

Discriminative Region-based Multi-Label Zero-Shot Learning (ICCV 2021) [arXiv][Project page >> coming soon] Sanath Narayan*, Akshita Gupta*, Salman Kh

Akshita Gupta 54 Nov 21, 2022
[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

Discriminative Region-based Multi-Label Zero-Shot Learning (ICCV 2021) [arXiv][Project page >> coming soon] Sanath Narayan*, Akshita Gupta*, Salman Kh

Akshita Gupta 54 Nov 21, 2022
A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

Multi-Agent-Deep-Deterministic-Policy-Gradients A Pytorch implementation of the multi agent deep deterministic policy gradients(MADDPG) algorithm This

Phil Tabor 159 Dec 28, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 3, 2023
library for nonlinear optimization, wrapping many algorithms for global and local, constrained or unconstrained, optimization

NLopt is a library for nonlinear local and global optimization, for functions with and without gradient information. It is designed as a simple, unifi

Steven G. Johnson 1.4k Dec 25, 2022
Racing line optimization algorithm in python that uses Particle Swarm Optimization.

Racing Line Optimization with PSO This repository contains a racing line optimization algorithm in python that uses Particle Swarm Optimization. Requi

Parsa Dahesh 6 Dec 14, 2022
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

null 184 Dec 11, 2022
Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Off-Policy Multi-Agent Reinforcement Learning (MARL) Algorithms This repository contains implementations of various off-policy multi-agent reinforceme

null 183 Dec 28, 2022