Simple (but Strong) Baselines for POMDPs

Overview

Recurrent Model-Free RL is a Strong Baseline for Many POMDPs

Welcome to the POMDP world! This repo provides some simple baselines for POMDPs, specifically the recurrent model-free RL, for the following paper

Paper: arXiv Numeric Results: google drive

by Tianwei Ni, Benjamin Eysenbach and Ruslan Salakhutdinov.

Installation

First download this repo into your local directory (preferably on a cluster or a server) <local_path>. Then we recommend to use a virtual env to install all the dependencies. For example, we install using miniconda:

conda env create -f install.yml
conda activate pomdp

The yaml file includes all the dependencies (e.g. PyTorch, PyBullet) used in our experiments (including compared methods), but there are two exceptions:

  • To run Cheetah-Vel in meta RL, you have to install MuJoCo with a license
  • To run robust RL and generalization in RL experiments, you have to install roboschool.
    • We found it hard to install roboschool from scratch, therefore we provide a docker file roboschool.sif in google drive that contains roboschool and the other necessary libraries, adapted from SunBlaze repo.
    • To download and activate the docker file by singularity on a cluster (on a single server should be similar):
    # download roboschool.sif from the google drive to envs/rl-generalization/roboschool.sif
    # then run singularity shell
    singularity shell --nv -H <local_path>:/home envs/rl-generalization/roboschool.sif
    • Then you can test it by import roboschool in a python3 shell.

General Form to Run Our Implementation of Recurrent Model-Free RL and Compared Methods

Basically, we use .yml file in configs/ folder for each subarea of POMDPs. To run our implementation, in <local_path> simply use

export PYTHONPATH=${PWD}:$PYTHONPATH
python3 policies/main.py configs/<subarea>/<env_name>/<algo_name>.yml

where algo_name specifies the algorithm name:

  • sac_rnn and td3_rnn correspond to our implementation of recurrent model-free RL
  • ppo_rnn and a2c_rnn correspond to (Kostrikov, 2018) implementation of recurrent model-free RL
  • vrm corresponds to VRM compared in "standard" POMDPs
  • varibad corresponds the off-policy version of original VariBAD compared in meta RL
  • MRPO correspond to MRPO compared in robust RL

We have merged the prior methods above into our repository (there is no need to install other repositories), so that future work can use this single repository to run a number of baselines besides ours: A2C-GRU, PPO-GRU, VRM, VariBAD, MRPO. Since our code is heavily drawn from those prior works, we encourage authors to cite those prior papers or implementations. For the compared methods, we use their open-sourced implementation with their default hyperparameters.

Specific Running Commands for Each Subarea

Please see run_commands.md for details on running our implementation of recurrent model-free RL and also all the compared methods.

A Minimal Example to Run Our Implementation

Here we provide a stand-alone minimal example with the least dependencies to run our implementation of recurrent model-free RL!

Only requires PyTorch and PyBullet, no need to install MuJoCo or roboschool, no external configuration file.

Simply open the Jupyter Notebook example.ipynb and it contains the training and evaluation procedure on a toy POMDP environment (Pendulum-V). It only costs < 20 min to run the whole process.

Details of Our Implementation of Recurrent Model-Free RL: Decision Factors, Best Variants, Code Features

Please see our_details.md for more information on:

  • How to tune the decision factors discussed in the paper in the configuration files
  • How to tune the other hyperparameters that are also important to training
  • Where is the core class of our recurrent model-free RL and the RAM-efficient replay buffer
  • Our best variants in subarea and numeric results on all the bar charts and learning curves

Acknowledgement

Please see acknowledge.md for details.

Citation

If you find our code useful to your work, please consider citing our paper:

@article{ni2021recurrentrl,
  title={Recurrent Model-Free RL is a Strong Baseline for Many POMDPs},
  author={Ni, Tianwei and Eysenbach, Benjamin and Salakhutdinov, Ruslan},
  year={2021}
}

Contact

If you have any questions, please create an issue in this repo or contact Tianwei Ni ([email protected])

Comments
  • introduce recurrent sac-discrete

    introduce recurrent sac-discrete

    This PR introduces recurrent SAC-discrete algorithm for POMDPs with discrete action space. The code is heavily based on the SAC-discrete open-sourced code https://github.com/ku2482/sac-discrete.pytorch/blob/master/sacd/agent/sacd.py and the SAC-discrete paper https://arxiv.org/abs/1910.07207

    We provide two sanity checks on classic gym discrete control environments: CartPole-v0 and LunarLander-v2. The commands for running Markovian and recurrent SAC-discrete algorithms are:

    # CartPole
    python3 policies/main.py --cfg configs/pomdp/cartpole/f/mlp.yml --target_entropy 0.7 --cuda -1
    # CartPole-V
    python3 policies/main.py --cfg configs/pomdp/cartpole/v/rnn.yml --target_entropy 0.7 --cuda 0
    # Lunalander
    python3 policies/main.py --cfg configs/pomdp/lunalander/f/mlp.yml --target_entropy 0.7 --cuda -1
    # Lunalander-V
    python3 policies/main.py --cfg configs/pomdp/lunalander/v/rnn.yml --target_entropy 0.5 --cuda 0
    

    where target_entropy sets the ratio of target entropy: ratio * log(|A|).

    • CartPole: Markovian SAC-discrete is quite sensitive to target_entropy but can solve the task with max return 200:
    Screen Shot 2022-03-01 at 2 33 15 AM
    • CartPole-V: recurrent SAC-discrete is robust to target_entropy and can solve the task with max return 200 within 10 episodes
    Screen Shot 2022-03-01 at 2 18 30 AM
    • Lunalander: Markovian SAC-discrete is sensitive to target_entropy but can solve the task with return over 200:
    Screen Shot 2022-03-01 at 10 29 56 AM
    • Lunalander-V: recurrent SAC-discrete is very sensitive to target_entropy but can nearly solve the task in one target_entropy value:
    Screen Shot 2022-03-01 at 1 09 06 PM
    opened by twni2016 5
  • Handle cases when the episode length < 2

    Handle cases when the episode length < 2

    Hello,

    I wonder do you have any thoughts on how to handle that case? Currently, you assume it never happens and the code will raise an error for such cases. However, in some domains like the MiniGrid Lava Crossing, for instance, that might happen quite often. Thanks!

    opened by hai-h-nguyen 4
  • Potential bug?

    Potential bug?

    For SACD, can you explain why you do this https://github.com/twni2016/pomdp-baselines/blob/main/policies/models/policy_rnn.py#L235 instead of this (which I think is correct)? https://github.com/twni2016/pomdp-baselines/blob/main/policies/models/policy_rnn.py#L230

    opened by hai-h-nguyen 2
  • Pixel observation with recurrent SAC-Discrete

    Pixel observation with recurrent SAC-Discrete

    This PR is not intended to be merged, but as a showcase for support pixel observation with discrete action space, e.g. Atari games.

    We take delayed-catch environment as a sanity check, introduced by IMPALA+SR https://arxiv.org/abs/2102.12425 The environment has only terminal reward and requires long-term memory. It has image size of 1x7x7, discrete action of 3, horizon of ~runs*7. We use a simple image encoder for image observation to replace the MLP encoder for vector observation.

    We try delayed-catch with 5, 10, 20, 40 runs. The more the runs, the harder the problems. Below are the learning curves of 10, 20, 40 runs for IMPALA and IMPALA+SR (their Fig. 7b).

    Screen Shot 2022-03-02 at 1 40 29 PM

    Our running command:

    # We sweep over the following range
    python3 policies/main.py --cfg configs/pomdp/catch/rnn.yml --noautomatic_entropy_tuning --entropy_alpha [0.1,0.01,0.001]
    

    where we found fixed temperature works much better than auto-tuning it with target entropy in this task. (Still a bit strange why this can work but that cannot; auto-tuning will finally has zero actor gradient).

    • Delayed-cach with 5 runs: solve it with 100k samples
    Screen Shot 2022-03-09 at 2 22 10 AM
    • Delayed-cach with 10 runs: solve it with 400k samples (vs 50M for IMPALA+SR)
    Screen Shot 2022-03-09 at 2 21 21 AM
    • Delayed-cach with 20 runs: solve it with 700k samples (vs 100M for IMPALA+SR)
    Screen Shot 2022-03-09 at 2 22 09 PM
    • Delayed-cach with 40 runs: after hparam tuning, can solve it with 2M samples (vs 200M for IMPALA+SR)

    Different fixed alpha value: Screen Shot 2022-03-11 at 2 05 30 PM

    With alpha=0.1: Screen Shot 2022-03-13 at 5 07 47 PM

    opened by twni2016 2
  • Refactor the code on RL

    Refactor the code on RL

    This PR is a major code refactor on policy and RL part. The idea is to remove RL algorithm specific details from policy*.py, e.g. TD3, SAC, SAC-D details. These details are moved to rl directory.

    With such decoupling, one can create a new RL algorithm inheriting the RLAlgorithmBase class, without changing the policy* class.

    opened by twni2016 0
  • Introduce RAM-efficient seq buffer

    Introduce RAM-efficient seq buffer

    This PR introduces RAM-efficient seq buffer in buffers/seq_replay_buffer_efficient.py, which stores observations only once, unlike the vanilla seq buffer in buffers/seq_replay_buffer_vanilla.py storing twice. Thus it can reduce RAM roughly by 2x, especially useful for large observation space such as pixel inputs.

    I test the correctness of this buffer in __main__, if you are interested, please run

    python buffers/seq_replay_buffer_efficient.py
    

    to double check.

    To use it, set train: buffer_type: "seq_efficient" in config file. The default buffer type is still the vanilla one, to keep consistent with the paper's results. However, we recommend to use the efficient one, especially for atari games.

    opened by twni2016 0
Owner
Tianwei V. Ni
Efficient coding excites me. Good research surprises me.
Tianwei V. Ni
Baselines for TrajNet++

TrajNet++ : The Trajectory Forecasting Framework PyTorch implementation of Human Trajectory Forecasting in Crowds: A Deep Learning Perspective TrajNet

VITA lab at EPFL 183 Jan 5, 2023
PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

DLR-RM 4.7k Jan 1, 2023
Provide baselines and evaluation metrics of the task: traffic flow prediction

Note: This repo is adpoted from https://github.com/UNIMIBInside/Smart-Mobility-Prediction. Due to technical reasons, I did not fork their code. Introd

Zhangzhi Peng 11 Nov 2, 2022
[ICCV 2021] A Simple Baseline for Semi-supervised Semantic Segmentation with Strong Data Augmentation

[ICCV 2021] A Simple Baseline for Semi-supervised Semantic Segmentation with Strong Data Augmentation

CodingMan 45 Dec 12, 2022
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

null 61.4k Jan 4, 2023
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

null 46.1k Feb 13, 2021
TransGAN: Two Transformers Can Make One Strong GAN

[Preprint] "TransGAN: Two Transformers Can Make One Strong GAN", Yifan Jiang, Shiyu Chang, Zhangyang Wang

VITA 1.5k Jan 7, 2023
This repo holds code for TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

TransUNet This repo holds code for TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation Usage

null 1.4k Jan 4, 2023
FrankMocap: A Strong and Easy-to-use Single View 3D Hand+Body Pose Estimator

FrankMocap pursues an easy-to-use single view 3D motion capture system developed by Facebook AI Research (FAIR). FrankMocap provides state-of-the-art 3D pose estimation outputs for body, hand, and body+hands in a single system. The core objective of FrankMocap is to democratize the 3D human pose estimation technology, enabling anyone (researchers, engineers, developers, artists, and others) can easily obtain 3D motion capture outputs from videos and images.

Facebook Research 1.9k Jan 7, 2023
This repo is developed for Strong Baseline For Vehicle Re-Identification in Track 2 Ai-City-2021 Challenges

A STRONG BASELINE FOR VEHICLE RE-IDENTIFICATION This paper is accepted to the IEEE Conference on Computer Vision and Pattern Recognition Workshop(CVPR

Cybercore Co. Ltd 78 Dec 29, 2022
The official codes of "Semi-supervised Models are Strong Unsupervised Domain Adaptation Learners".

SSL models are Strong UDA learners Introduction This is the official code of paper "Semi-supervised Models are Strong Unsupervised Domain Adaptation L

Yabin Zhang 26 Dec 26, 2022
This framework implements the data poisoning method found in the paper Adversarial Examples Make Strong Poisons

Adversarial poison generation and evaluation. This framework implements the data poisoning method found in the paper Adversarial Examples Make Strong

null 31 Nov 1, 2022
A Strong Baseline for Image Semantic Segmentation

A Strong Baseline for Image Semantic Segmentation Introduction This project is an open source semantic segmentation toolbox based on PyTorch. It is ba

Clark He 49 Sep 20, 2022
A tiny, friendly, strong baseline code for Person-reID (based on pytorch).

Pytorch ReID Strong, Small, Friendly A tiny, friendly, strong baseline code for Person-reID (based on pytorch). Strong. It is consistent with the new

Zhedong Zheng 3.5k Jan 8, 2023
A general and strong 3D object detection codebase that supports more methods, datasets and tools (debugging, recording and analysis).

ALLINONE-Det ALLINONE-Det is a general and strong 3D object detection codebase built on OpenPCDet, which supports more methods, datasets and tools (de

Michael.CV 5 Nov 3, 2022
Simple, but essential Bayesian optimization package

BayesO: A Bayesian optimization framework in Python Simple, but essential Bayesian optimization package. http://bayeso.org Online documentation Instal

Jungtaek Kim 74 Dec 5, 2022
The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

Dinghan Shen 49 Dec 22, 2022
A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

RSG: A Simple but Effective Module for Learning Imbalanced Datasets (CVPR 2021) A Pytorch implementation of our CVPR 2021 paper "RSG: A Simple but Eff

null 120 Dec 12, 2022