PyTorch implementation of Decoupling Value and Policy for Generalization in Reinforcement Learning

Last update: Dec 8, 2022

Related tags

Deep Learning idaac

Overview

IDAAC: Invariant Decoupled Advantage Actor-Critic

This is a PyTorch implementation of the methods proposed in

Decoupling Value and Policy for Generalization in Reinforcement Learning by

Roberta Raileanu and Rob Fergus.

Citation

If you use this code in your own work, please cite our paper:

@article{Raileanu2021DecouplingVA,
  title={Decoupling Value and Policy for Generalization in Reinforcement Learning},
  author={Roberta Raileanu and R. Fergus},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.10330}
}

Requirements

To install all the required dependencies:

conda create -n idaac python=3.7
conda activate idaac

cd idaac
pip install -r requirements.txt

pip install procgen

git clone https://github.com/openai/baselines.git
cd baselines 
python setup.py install

Instructions

This repo provides instructions for training IDAAC, DAAC, and PPO on the Procgen benchmark.

Train IDAAC on CoinRun

python train.py --env_name coinrun --algo idaac

Train DAAC on CoinRun

python train.py --env_name coinrun --algo daac

Train PPO on CoinRun

python train.py --env_name coinrun --algo ppo --ppo_epoch 3

Note: The default code uses the same set of hyperparameters (HPs) for all environments, which are the best ones overall. In our studies, we've found some of the games can further benefit from slightly different HPs, so we provide those as well. To use the best hyperparameters for each environment, use the flag --use_best_hps.

Overview of DAAC and IDAAC

Procgen Results

IDAAC achieves state-of-the-art performance on the Procgen benchmark (easy mode), significantly improving the agent's generalization ability over standard RL methods such as PPO.

Test Results on Procgen

Acknowledgements

This code was based on an open sourced PyTorch implementation of PPO.

Comments

theoretical question

Do you think the improvement noticed here has more to do with the fact that this is essentially two impala network compared to one impala, hence bigger network just tend to do better?

thank you

opened by hlsfin 7
Reproducing DAAC results on Procgen Miner

Hi @rraileanu

thanks for your scientific contributions!

I'm currently trying to reproduce your Miner results. After installing the dependencies, I simply run python train.py --env_name miner --algo daac to launch the training. My first run ended up pretty bad, not even near your reported score.

Here is the log of that training run. daac.log

To verify that this is not just an extreme outlier, I got two more training runs running, which are at 12.5 million steps right now, but behave similar so far.

Did I miss something or how can I reproduce the results of your paper? I added DAAC to my custom PPO implementation and was struggling to get good results either.

opened by MarcoMeter 7
Reproducing Procgen results

Hi. @rraileanu

Thank you for your interesting work on RL generalization.

I am trying to reproduce your Procgen results using the same hyperparameter setting across the environments, following Appendix C in your paper, but get scores lower than reported in some environments (Plunder, ...).

Should I use environment-specific hyperparameter settings to reproduce the results?

opened by symoon11 3
Implementation questions
Hello,

Thank you for sharing the code! I try to adapt the code for stable-baselines and I have some questions:

In the IDAACnet act method you call self.base two times: a. gae, actor_features = self.base(inputs) b. gae, _ = self.base(inputs, action) Is this necessary, or I could do the following:

move the critic_linear from PolicyResNetBase to IDAACnet and modify act() like this: actor_features = self.base(inputs) .... gae = self.critic_linear (torch.cat([actor_features, actions]) so that PolicyResNetBase.forward is called only once.

Where you create the optimizers you use two lists of parameters for policy and value, is this a performance optimization, or do you want to avoid a possible interference?

Do you think that adding the auxiliary phase of PPG in DAAC will improve results?

Thank you, Flaviu
opened by flav27 2
About adversarial training

Hi, in Section 4.3 the paper says " ... only the encoder’s parameters are updated by minimizing the loss in eq. 3", but the eq. 3 is exactly the entropy (of the discriminator regarding which observation was first in the episode), which should be maximized according to the paper. Did I misunderstand somewhere?

opened by Asuka20 1
The measure of the experiment

Hello, i have a question, What does 'The mean and standard deviation are computed using 10 runs with different seeds.' mean in Table 4? The seeds means ten levels seeds or the random seed that determines network initialization? thank you.

opened by TheGreatLy 1

PyTorch implementation of Decoupling Value and Policy for Generalization in Reinforcement Learning

Related tags

Overview

IDAAC: Invariant Decoupled Advantage Actor-Critic

Citation

Requirements

Instructions

Train IDAAC on CoinRun

Train DAAC on CoinRun

Train PPO on CoinRun

Overview of DAAC and IDAAC

Procgen Results

Acknowledgements

Comments

theoretical question

Reproducing DAAC results on Procgen Miner

Reproducing Procgen results

Implementation questions

About adversarial training

The measure of the experiment

Owner

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.

Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Implementation of Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

DRLib：A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos.

PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

A list of papers regarding generalization in (deep) reinforcement learning

On the model-based stochastic value gradient for continuous reinforcement learning

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

PyTorch implementation of Value Iteration Networks (VIN): Clean, Simple and Modular. Visualization in Visdom.

A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

PyTorch implementation of Trust Region Policy Optimization

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286