Off-policy continuous control in PyTorch, with RDPG, RTD3 & RSAC

Overview

offpcc_logo

arXiv technical report soon available.

we are updating the readme to be as comprehensive as possible

Please ask any questions in Issues, thanks.

Introduction

This PyTorch repo implements off-policy RL algorithms for continuous control, including:

  • Standard algorithms: DDPG, TD3, SAC
  • Image-based algorithm: ConvolutionalSAC
  • Recurrent algorithms: RecurrentDPG, RecurrentTD3, RecurrentSAC, RecurrentSACSharing (see report)

where recurrent algorithms are generally not available in other repos.

Structure of codebase

Here, we talk about the organization of this code. In particular, we will talk about

  • Folder: where are certain files located?
  • Classes: how are classes designed to interact with each other?
  • Training/evaluation loop: how environment interaction, learning and evaluation alternate?

A basic understanding of these will make other details easy to understand from code itself.

Folders

  • file
    • containing plots reproducing stable-baselines3; you don’t need to touch this
  • offpcc (the good stuff; you will be using this)
    • algorithms (where DDPG, TD3 and SAC are implemented)
    • algorithms_recurrent (where RDPG, RTD3 and RSAC are implemented)
    • basics (abstract classes, stuff shared by algorithms or algorithms_recurrent, code for training)
    • basics_sb3 (you don’t need to touch this)
    • configs (gin configs)
    • domains (all custom domains are stored within and registered properly)
  • pics_for_readme
    • random pics; you don’t need to touch this
  • temp
    • potentially outdated stuff; you don’t need to touch this

Relationships between classes

There are three core classes in this repo:

  • Any environment written using OpenAI’s API would have:
    • reset method outputs the current state
    • step method takes in an action, outputs (reward, next state, done, info)
  • OffPolicyRLAlgorithm and RecurrentOffPolicyRLAlgorithm are the base class for all algorithms listed in introduction. You should think about them as neural network (e.g., actors, critics, CNNs, RNNs) wrappers that are augmented with methods to help these networks interact with other stuff:
    • act method takes in state from env, outputs action back to env
    • update_networks method takes in batch from buffer
  • The replay buffers ReplayBuffer and RecurrentReplayBuffer are built to interact with the environment and the algorithm classes
    • push method takes in a transition from env
    • sample method outputs a batch for algorithm’s update_networks method

Their relationships are best illustrated by a diagram:

offpcc_steps

Structure of training/evaluation loop

In this repo, we follow the training/evaluation loop style in spinning-up (this is essentially the script: basics/run_fns and the function train). It follows this basic structure, with details added for tracking stats and etc:

state = env.reset()
for t range(total_steps):  # e.g., 1 million
    # environment interaction
    if t >= update_after:
        action = algorithm.act(state)
    else:
        action = env.action_space.sample()
    next_state, reward, done, info = env.step(action)
   	# learning
    if t >= update_after and (t + 1) % update_every == 0:
        for j in range(update_every):
            batch = buffer.sample()
            algorithm.update_networks(batch)
    # evaluation
    if (t + 1) % num_steps_per_epoch == 0:
        ep_len, ep_ret = test_for_one_episode(test_env, algorithm)

Dependencies

Dependencies are available in requirements.txt; although there might be one or two missing dependencies that you need to install yourself.

Train an agent

Setup (wandb & GPU)

Add this to your bashrc or bash_profile and source it.

You should replace “account_name” with whatever wandb account that you want to use.

export OFFPCC_WANDB_ENTITY="account_name"

From the command line:

cd offpcc
CUDA_VISIBLE_DEVICES=3 OFFPCC_WANDB_PROJECT=project123 python launch.py --env <env-name> --algo <algo-name> --config <config-path> --run_id <id>

For DDPG, TD3, SAC

On pendulum-v0:

python launch.py --env pendulum-v0 --algo sac --config configs/test/template_short.gin --run_id 1

Commands and plots for benchmarking on Pybullet domains are in a Issue called “Performance check against SB3”.

For RecurrentDDPG, RecurrentTD3, RecurrentSAC

On pendulum-p-v0:

python launch.py --env pendulum-p-v0 --algo rsac --config configs/test/template_recurrent_100k.gin --run_id 1

Reproducing paper results

To reproduce paper results, simply run the commands in the previous section with the appropriate env name (listed below) and config files (their file names are highly readable). Mapping between env names used in code and env names used in paper:

  • pendulum-v0: pendulum
  • pendulum-p-v0: pendulum-p
  • pendulum-va-v0: pendulum-v
  • dmc-cartpole-balance-v0: cartpole-balance
  • dmc-cartpole-balance-p-v0: cartpole-balance-p
  • dmc-cartpole-balance-va-v0: cartpole-balance-v
  • dmc-cartpole-swingup-v0: cartpole-swingup
  • dmc-cartpole-swingup-p-v0: cartpole-swingup-p
  • dmc-cartpole-swingup-va-v0: cartpole-swingup-v
  • reacher-pomdp-v0: reacher-pomdp
  • water-maze-simple-pomdp-v0: watermaze
  • bumps-normal-test-v0: push-r-bump

Render learned policy

Create a folder in the same directory as offpcc, called results. In there, create a folder with the name of the environment, e.g., pendulum-p-v0. Within that env folder, create a folder with the name of the algorithm, e.g., rsac. You can get an idea of the algorithms available from the algo_name2class diectionary defined in offpcc/launch.py. Within that algorithm folder, create a folder with the run_id, e.g., 1. Simply put the saved actor (also actor summarizer for recurrent algorithms) into that inner most foler - they can be downloaded from the wandb website after your run finishes. Finally, go back into offpcc, and call

python launch.py --env pendulum-v0 --algo sac --config configs/test/template_short.gin --run_id 1 --render

For bumps-normal-test-v0, you need to modify the test_for_one_episode function within offpcc/basics/run_fns.py because, for Pybullet environments, the env.step must only appear once before the env.reset() call.

You might also like...
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL). Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286
Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

ppo_pytorch_cpp - an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch
ppo_pytorch_cpp - an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

PPO Pytorch C++ This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment t

PyTorch implementation of Constrained Policy Optimization
PyTorch implementation of Constrained Policy Optimization

PyTorch implementation of Constrained Policy Optimization (CPO) This repository has a simple to understand and use implementation of CPO in PyTorch. A

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation for  MINE: Continuous-Depth MPI with Neural Radiance Fields
PyTorch implementation for MINE: Continuous-Depth MPI with Neural Radiance Fields

MINE: Continuous-Depth MPI with Neural Radiance Fields Project Page | Video PyTorch implementation for our ICCV 2021 paper. MINE: Towards Continuous D

PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

Stochastic CSLR This is the PyTorch implementation for the ECCV 2020 paper: Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuou

A clean and robust Pytorch implementation of PPO on continuous action space.
A clean and robust Pytorch implementation of PPO on continuous action space.

PPO-Continuous-Pytorch I found the current implementation of PPO on continuous action space is whether somewhat complicated or not stable. And this is

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

Comments
  • Performance check against SB3

    Performance check against SB3

    Following the style here, we will run

    • DDPG
    • TD3
    • SAC

    on 4 PyBullet environments

    • HalfCheetahBulletEnv-v0
    • AntBulletEnv-v0
    • HopperBulletEnv-v0
    • Walker2DBulletEnv-v0

    for 3 seeds, and report the mean and std of final performance.

    In total, this would be 3 * 4 * 3 = 36 runs.

    performance-check 
    opened by zhihanyang2022 3
Owner
Zhihan
Zhihan
Implementation of Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

advantage-weighted-regression Implementation of Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning, by Peng et al. (

Omar D. Domingues 1 Dec 2, 2021
Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

V-MPO Simple code to demonstrate Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) in Pyt

Nugroho Dewantoro 9 Jun 6, 2022
MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

page_type languages products description sample python azure azure-machine-learning-service azure-devops Code which demonstrates how to set up and ope

null 1 Nov 1, 2021
[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

PG-MORL This repository contains the implementation for the paper Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Contro

MIT Graphics Group 65 Jan 7, 2023
Implementation of algorithms for continuous control (DDPG and NAF).

DEPRECATION This repository is deprecated and is no longer maintaned. Please see a more recent implementation of RL for continuous control at jax-sac.

Ilya Kostrikov 288 Dec 31, 2022
ROS-UGV-Control-Interface - Control interface which can be used in any UGV

ROS-UGV-Control-Interface Cam Closed: Cam Opened:

Ahmet Fatih Akcan 1 Nov 4, 2022
Hand Gesture Volume Control is AIML based project which uses image processing to control the volume of your Computer.

Hand Gesture Volume Control Modules There are basically three modules Handtracking Program Handtracking Module Volume Control Program Handtracking Pro

VITTAL 1 Jan 12, 2022
PyTorch implementation of Memory-based semantic segmentation for off-road unstructured natural environments.

MemSeg: Memory-based semantic segmentation for off-road unstructured natural environments Introduction This repository is a PyTorch implementation of

null 11 Nov 28, 2022
A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

Multi-Agent-Deep-Deterministic-Policy-Gradients A Pytorch implementation of the multi agent deep deterministic policy gradients(MADDPG) algorithm This

Phil Tabor 159 Dec 28, 2022
PyTorch implementation of Trust Region Policy Optimization

PyTorch implementation of TRPO Try my implementation of PPO (aka newer better variant of TRPO), unless you need to you TRPO for some specific reasons.

Ilya Kostrikov 366 Nov 15, 2022