Public implementation of "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression" from CoRL'21

Related tags

Deep Learning SSRR
Overview

Self-Supervised Reward Regression (SSRR)

Codebase for CoRL 2021 paper "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression " Authors: Letian "Zac" Chen, Rohan Paleja, Matthew Gombolay

Usage

Quick overview

The pipeline of SSRR includes

  1. Initial IRL: Noisy-AIRL or AIRL.
  2. Noisy Dataset Generation: use initial policy learned in step 1 to generate trajectories with different noise levels and criticize trajectories with initial reward.
  3. Sigmoid Fitting: fit a sigmoid function for the noise-performance relationship using the data obtained in step 2.
  4. Reward Learning: learn a reward function by regressing to the sigmoid relationship obtained in step 3.
  5. Policy Learning: learn a policy by optimizing the reward learned in step 4.

I know this is a long README, but please make sure you read the entirety before trying out our code. Trust me, that will save your time!

Dependencies and Environment Preparations

Code is tested with Python 3.6 with Anaconda.

Required packages:

pip install scipy path.py joblib==0.12.3 flask h5py matplotlib scikit-learn pandas pillow pyprind tqdm nose2 mujoco-py cached_property cloudpickle git+https://github.com/Theano/Theano.git@adfe319ce6b781083d8dc3200fb4481b00853791#egg=Theano git+https://github.com/neocxi/Lasagne.git@484866cf8b38d878e92d521be445968531646bb8#egg=Lasagne plotly==2.0.0 gym[all]==0.14.0 progressbar2 tensorflow-gpu==1.15 imgcat

Test sets of trajectories could be downloaded at Google Drive because Github could not hold files that are larger than 100MB! After downloading, please put full_demos/ under demos/.

If you are directly running python scripts, you will need to add the project root and the rllab_archive folder into your PYTHONPATH:

export PYTHONPATH=/path/to/this/repo/:/path/to/this/repo/rllab_archive/

If you are using the bash scripts provided (for example, noisy_airl_ssrr_drex_comparison_halfcheetah.sh), make sure to replace the first line to be

export PYTHONPATH=/path/to/this/repo/:/path/to/this/repo/rllab_archive/

Initial IRL

We provide code for AIRL and Noisy-AIRL implementation.

Running

Examples of running command would be

python script_experiment/halfcheetah_airl.py --output_dir=./data/halfcheetah_airl_test_1
python script_experiment/hopper_noisy_airl.py --output_dir=./data/hopper_noisy_airl_test_1 --noisy

Please note for Noisy-AIRL, you have to include the --noisy flag to make it actually sample trajectories with noise, otherwise it only changes the loss function according to Equation 6 in the paper.

Results

The result will be available in the output dir specified, and we recommend using rllab viskit to visualize it.

We also provide our run results available in data/{halfcheetah/hopper/ant}_{airl/noisy_airl}_test_1 if you want to skip this step!

Code Structure

The AIRL and Noisy-AIRL codes reside in inverse_rl/ with rllab dependencies in rllab_archive. The AIRL code is adjusted from the original AIRL codebase https://github.com/justinjfu/inverse_rl. The rllab archive was adjusted from the original rllab codebase https://github.com/rll/rllab.

Noisy Dataset Generation & Sigmoid Fitting

We implemented noisy dataset generation and sigmoid fitting together in code.

Running

Examples of running command would be

python script_experiment/noisy_dataset.py \
   --log_dir=./results/halfcheetah/temp/noisy_dataset/ \
   --env_id=HalfCheetah-v3 \
   --bc_agent=./results/halfcheetah/temp/bc/model.ckpt \
   --demo_trajs=./demos/suboptimal_demos/ant/dataset.pkl \
   --airl_path=./data/halfcheetah_airl_test_1/itr_999.pkl \
   --airl \
   --seed="${loop}"

Note that flag --airl determines whether we utilize the --airl_path or --bc_agent policy to generate the trajectory. Therefore, --bc_agent is optional when --airl present. For behavior cloning policy, please refer to https://github.com/dsbrown1331/CoRL2019-DREX.

The --airl_path always provide the initial reward to criticize the generated trajectories no matter whether --airl present.

Results

The result will be available in the log dir specified.

We also provide our run results available in results/{halfcheetah/hopper/ant}/{airl/noisy_airl}_data_ssrr_{1/2/3/4/5}/noisy_dataset/ if you want to skip this step!

Code Structure

Noisy dataset generation and Sigmoid fitting are implemented in script_experiment/noisy_dataset.py.

Reward Learning

We provide SSRR and D-REX implementation.

Running

Examples of running command would be

  python script_experiment/drex.py \
   --log_dir=./results/halfcheetah/temp/drex \
   --env_id=HalfCheetah-v3 \
   --bc_trajs=./demos/suboptimal_demos/halfcheetah/dataset.pkl \
   --unseen_trajs=./demos/full_demos/halfcheetah/unseen_trajs.pkl \
   --noise_injected_trajs=./results/halfcheetah/temp/noisy_dataset/prebuilt.pkl \
   --seed="${loop}"
  python script_experiment/ssrr.py \
   --log_dir=./results/halfcheetah/temp/ssrr \
   --env_id=HalfCheetah-v3 \
   --mode=train_reward \
   --noise_injected_trajs=./results/halfcheetah/temp/noisy_dataset/prebuilt.pkl \
   --bc_trajs=demos/suboptimal_demos/halfcheetah/dataset.pkl \
   --unseen_trajs=demos/full_demos/halfcheetah/unseen_trajs.pkl \
   --min_steps=50 --max_steps=500 --l2_reg=0.1 \
   --sigmoid_params_path=./results/halfcheetah/temp/noisy_dataset/fitted_sigmoid_param.pkl \
   --seed="${loop}"

The bash script also helps combining running of noisy dataset generation, sigmoid fitting, and reward learning, and repeats several times:

./airl_ssrr_drex_comparison_halfcheetah.sh

Results

The result will be available in the log dir specified.

The correlation between the predicted reward and the ground-truth reward tested on the unseen_trajs is reported at the end of running on console, or, if you are using the bash script, at the end of the d_rex.log or ssrr.log.

We also provide our run results available in results/{halfcheetah/hopper/ant}/{airl/noisy_airl}_data_ssrr_{1/2/3/4/5}/{drex/ssrr}/.

Code Structure

SSRR is implemented in script_experiment/ssrr.py, Agents/SSRRAgent.py, Datasets/NoiseDataset.py.

D-REX is implemented in script_experiment/drex.py, scrip_experiment/drex_utils.py, and script_experiment/tf_commons/ops.

Both implementations are adapted from https://github.com/dsbrown1331/CoRL2019-DREX.

Policy Learning

We utilize stable-baselines to optimize policy over the reward we learned.

Running

Before running, you should edit script_experiment/rl_utils/sac.yml to change the learned reward model directory, for example:

  env_wrapper: {"script_experiment.rl_utils.wrappers.CustomNormalizedReward": {"model_dir": "/home/zac/Programming/Zac-SSRR/results/halfcheetah/noisy_airl_data_ssrr_4/ssrr/", "ctrl_coeff": 0.1, "alive_bonus": 0.0}}

Examples of running command would be

python script_experiment/train_rl_with_learned_reward.py \
 --algo=sac \
 --env=HalfCheetah-v3 \
 --tensorboard-log=./results/HalfCheetah_custom_reward/ \
 --log-folder=./results/HalfCheetah_custom_reward/ \
 --save-freq=10000

Please note the flag --env-kwargs=terminate_when_unhealthy:False is necessary for Hopper and Ant as discussed in our paper Supplementary D.1.

Examples of running evaluation the learned policy's ground-truth reward would be

python script_experiment/test_rl_with_ground_truth_reward.py \
 --algo=sac \
 --env=HalfCheetah-v3 \
 -f=./results/HalfCheetah_custom_reward/ \
 --exp-id=1 \
 -e=5 \
 --no-render \
 --env-kwargs=terminate_when_unhealthy:False

Results

The result will be available in the log folder specified.

We also provide our run results in results/.

Code Structure

The code script_experiment/train_rl_with_learned_reward.py and utils/ call stable-baselines library to learn a policy with the learned reward function. Note that utils could not be renamed because of the rl-baselines-zoo constraint.

The codes are adjusted from https://github.com/araffin/rl-baselines-zoo.

Random Seeds

Because of the inherent stochasticity of GPU reduction operations such as mean and sum (https://github.com/tensorflow/tensorflow/issues/3103), even if we set the random seed, we cannot reproduce the exact result every time. Therefore, we encourage you to run multiple times to reduce the random effect.

If you have a nice way to get the same result each time, please let us know!

Ending Thoughts

We welcome discussions or extensions of our paper and code in Issues!

Feel free to leave a star if you like this repo!

For more exciting work our lab (CORE Robotics Lab in Georgia Institute of Technology led by Professor Matthew Gombolay), check out our website!

You might also like...
Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO

dcf-game-infrastructure All the components necessary to run a game of the OOO DC

Official public repository of paper
Official public repository of paper "Intention Adaptive Graph Neural Network for Category-Aware Session-Based Recommendation"

Intention Adaptive Graph Neural Network (IAGNN) This is the official repository of paper Intention Adaptive Graph Neural Network for Category-Aware Se

A forwarding MPI implementation that can use any other MPI implementation via an MPI ABI

MPItrampoline MPI wrapper library: MPI trampoline library: MPI integration tests: MPI is the de-facto standard for inter-node communication on HPC sys

ALBERT-pytorch-implementation - ALBERT pytorch implementation

ALBERT-pytorch-implementation developing... 모델의 개념이해를 돕기 위한 구현물로 현재 변수명을 상세히 적었고

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

PyTorch implementation of neural style transfer algorithm
PyTorch implementation of neural style transfer algorithm

neural-style-pt This is a PyTorch implementation of the paper A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias

PyTorch implementation of DeepDream algorithm
PyTorch implementation of DeepDream algorithm

neural-dream This is a PyTorch implementation of DeepDream. The code is based on neural-style-pt. Here we DeepDream a photograph of the Golden Gate Br

The project is an official implementation of our CVPR2019 paper
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) News [2020/07/05] A very nice blog from Towards Data Science introd

Image-to-Image Translation with Conditional Adversarial Networks (Pix2pix) implementation in keras

pix2pix-keras Pix2pix implementation in keras. Original paper: Image-to-Image Translation with Conditional Adversarial Networks (pix2pix) Paper Author

Owner
null
Public Implementation of ChIRo from "Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations"

Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations This directory contains the model architectures and experimental

null 35 Dec 5, 2022
Rank 1st in the public leaderboard of ScanRefer (2021-03-18)

InstanceRefer InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

null 63 Dec 7, 2022
A public available dataset for road boundary detection in aerial images

Topo-boundary This is the official github repo of paper Topo-boundary: A Benchmark Dataset on Topological Road-boundary Detection Using Aerial Images

Zhenhua Xu 79 Jan 4, 2023
atmaCup #11 の Public 4th / Pricvate 5th Solution のリポジトリです。

#11 atmaCup 2021-07-09 ~ 2020-07-21 に行われた #11 [初心者歓迎! / 画像編] atmaCup のリポジトリです。結果は Public 4th / Private 5th でした。 フレームワークは PyTorch で、実装は pytorch-image-m

Tawara 12 Apr 7, 2022
UIUCTF 2021 Public Challenge Repository

UIUCTF-2021-Public UIUCTF 2021 Public Challenge Repository Notes: every challenge folder contains a challenge.yml file in the format for ctfcli, CTFd'

SIGPwny 15 Nov 3, 2022
GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning

GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning, as well as corresponding mitigation strategies.

null 129 Dec 30, 2022
Public repository of the 3DV 2021 paper "Generative Zero-Shot Learning for Semantic Segmentation of 3D Point Clouds"

Generative Zero-Shot Learning for Semantic Segmentation of 3D Point Clouds Björn Michele1), Alexandre Boulch1), Gilles Puy1), Maxime Bucher1) and Rena

valeo.ai 15 Dec 22, 2022
Public repository created to store my custom-made tools for Just Dance (UbiArt Engine)

Woody's Just Dance Tools Public repository created to store my custom-made tools for Just Dance (UbiArt Engine) Development and updates Almost all of

Wodson de Andrade 8 Dec 24, 2022
Genshin-assets - 👧 Public documentation & static assets for Genshin Impact data.

genshin-assets This repo provides easy access to the Genshin Impact assets, primarily for use on static sites. Sources Genshin Optimizer - An Artifact

Zerite Development 5 Nov 22, 2022
Public scripts, services, and configuration for running a smart home K3S network cluster

makerhouse_network Public scripts, services, and configuration for running MakerHouse's home network. This network supports: TODO features here For mo

Scott Martin 1 Jan 15, 2022