PyTorch implementation of Decoupling Value and Policy for Generalization in Reinforcement Learning

Related tags

Deep Learning idaac
Overview

IDAAC: Invariant Decoupled Advantage Actor-Critic

This is a PyTorch implementation of the methods proposed in

Decoupling Value and Policy for Generalization in Reinforcement Learning by

Roberta Raileanu and Rob Fergus.

Citation

If you use this code in your own work, please cite our paper:

@article{Raileanu2021DecouplingVA,
  title={Decoupling Value and Policy for Generalization in Reinforcement Learning},
  author={Roberta Raileanu and R. Fergus},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.10330}
}

Requirements

To install all the required dependencies:

conda create -n idaac python=3.7
conda activate idaac

cd idaac
pip install -r requirements.txt

pip install procgen

git clone https://github.com/openai/baselines.git
cd baselines 
python setup.py install 

Instructions

This repo provides instructions for training IDAAC, DAAC, and PPO on the Procgen benchmark.

Train IDAAC on CoinRun

python train.py --env_name coinrun --algo idaac

Train DAAC on CoinRun

python train.py --env_name coinrun --algo daac

Train PPO on CoinRun

python train.py --env_name coinrun --algo ppo --ppo_epoch 3

Note: The default code uses the same set of hyperparameters (HPs) for all environments, which are the best ones overall. In our studies, we've found some of the games can further benefit from slightly different HPs, so we provide those as well. To use the best hyperparameters for each environment, use the flag --use_best_hps.

Overview of DAAC and IDAAC

IDAAC Overview

Procgen Results

IDAAC achieves state-of-the-art performance on the Procgen benchmark (easy mode), significantly improving the agent's generalization ability over standard RL methods such as PPO.

Test Results on Procgen

Procgen Test Results

Acknowledgements

This code was based on an open sourced PyTorch implementation of PPO.

Comments
  • theoretical question

    theoretical question

    Do you think the improvement noticed here has more to do with the fact that this is essentially two impala network compared to one impala, hence bigger network just tend to do better?

    thank you

    opened by hlsfin 7
  • Reproducing DAAC results on Procgen Miner

    Reproducing DAAC results on Procgen Miner

    Hi @rraileanu

    thanks for your scientific contributions!

    I'm currently trying to reproduce your Miner results. After installing the dependencies, I simply run python train.py --env_name miner --algo daac to launch the training. My first run ended up pretty bad, not even near your reported score.

    Here is the log of that training run. daac.log

    To verify that this is not just an extreme outlier, I got two more training runs running, which are at 12.5 million steps right now, but behave similar so far.

    Did I miss something or how can I reproduce the results of your paper? I added DAAC to my custom PPO implementation and was struggling to get good results either.

    opened by MarcoMeter 7
  • Reproducing Procgen results

    Reproducing Procgen results

    Hi. @rraileanu

    Thank you for your interesting work on RL generalization.

    I am trying to reproduce your Procgen results using the same hyperparameter setting across the environments, following Appendix C in your paper, but get scores lower than reported in some environments (Plunder, ...).

    Should I use environment-specific hyperparameter settings to reproduce the results?

    opened by symoon11 3
  • Implementation questions

    Implementation questions

    Hello,

    Thank you for sharing the code! I try to adapt the code for stable-baselines and I have some questions:

    1. In the IDAACnet act method you call self.base two times: a. gae, actor_features = self.base(inputs) b. gae, _ = self.base(inputs, action) Is this necessary, or I could do the following:

      • move the critic_linear from PolicyResNetBase to IDAACnet and modify act() like this: actor_features = self.base(inputs) .... gae = self.critic_linear (torch.cat([actor_features, actions]) so that PolicyResNetBase.forward is called only once.
    2. Where you create the optimizers you use two lists of parameters for policy and value, is this a performance optimization, or do you want to avoid a possible interference?

    3. Do you think that adding the auxiliary phase of PPG in DAAC will improve results?

    Thank you, Flaviu

    opened by flav27 2
  • About adversarial training

    About adversarial training

    Hi, in Section 4.3 the paper says " ... only the encoder’s parameters are updated by minimizing the loss in eq. 3", but the eq. 3 is exactly the entropy (of the discriminator regarding which observation was first in the episode), which should be maximized according to the paper. Did I misunderstand somewhere?

    opened by Asuka20 1
  • The measure of the experiment

    The measure of the experiment

    Hello, i have a question, What does 'The mean and standard deviation are computed using 10 runs with different seeds.' mean in Table 4? The seeds means ten levels seeds or the random seed that determines network initialization? thank you.

    opened by TheGreatLy 1
Owner
null
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Ilya Kostrikov 3k Dec 31, 2022
Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Off-Policy Multi-Agent Reinforcement Learning (MARL) Algorithms This repository contains implementations of various off-policy multi-agent reinforceme

null 183 Dec 28, 2022
Implementation of Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

advantage-weighted-regression Implementation of Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning, by Peng et al. (

Omar D. Domingues 1 Dec 2, 2021
[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

DSM The source code for paper Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion Project Website; Datasets li

Jinpeng Wang 114 Oct 16, 2022
DRLib:A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos.

DRLib:A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos A concise deep reinforcement learning libr

null 329 Jan 3, 2023
PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

This is the original implementation of our paper, A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem (arXiv:1706.1

Zhengyao Jiang 1.5k Dec 29, 2022
A list of papers regarding generalization in (deep) reinforcement learning

A list of papers regarding generalization in (deep) reinforcement learning

Kaixin WANG 13 Apr 26, 2021
On the model-based stochastic value gradient for continuous reinforcement learning

On the model-based stochastic value gradient for continuous reinforcement learning This repository is by Brandon Amos, Samuel Stanton, Denis Yarats, a

Facebook Research 46 Dec 15, 2022
Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation The code of: Context Decoupling Augmentation for Weakly Supervised Semanti

null 54 Dec 12, 2022
Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

Karush Suri 8 Nov 7, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization This is an official implementation of "Feature Stylization and Domain-

null 22 Sep 22, 2022
[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

template-pose Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions

Van Nguyen Nguyen 92 Dec 28, 2022
PyTorch implementation of Value Iteration Networks (VIN): Clean, Simple and Modular. Visualization in Visdom.

VIN: Value Iteration Networks This is an implementation of Value Iteration Networks (VIN) in PyTorch to reproduce the results.(TensorFlow version) Key

Xingdong Zuo 215 Dec 7, 2022
A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

Multi-Agent-Deep-Deterministic-Policy-Gradients A Pytorch implementation of the multi agent deep deterministic policy gradients(MADDPG) algorithm This

Phil Tabor 159 Dec 28, 2022
PyTorch implementation of Trust Region Policy Optimization

PyTorch implementation of TRPO Try my implementation of PPO (aka newer better variant of TRPO), unless you need to you TRPO for some specific reasons.

Ilya Kostrikov 366 Nov 15, 2022
Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

Alexis David Jacq 163 Dec 26, 2022