The Unsupervised Reinforcement Learning Benchmark (URLB)

Overview

The Unsupervised Reinforcement Learning Benchmark (URLB)

URLB provides a set of leading algorithms for unsupervised reinforcement learning where agents first pre-train without access to extrinsic rewards and then are finetuned to downstream tasks.

Requirements

We assume you have access to a GPU that can run CUDA 10.2 and CUDNN 8. Then, the simplest way to install all required dependencies is to create an anaconda environment by running

conda env create -f conda_env.yml

After the instalation ends you can activate your environment with

conda activate urlb

Implemented Agents

Agent Command Implementation Author(s) Paper
ICM agent=icm Denis paper
ProtoRL agent=proto Denis paper
DIAYN agent=diayn Misha paper
APT(ICM) agent=icm_apt Hao, Kimin paper
APT(Ind) agent=ind_apt Hao, Kimin paper
APS agent=aps Hao, Kimin paper
SMM agent=smm Albert paper
RND agent=rnd Kevin paper
Disagreement agent=disagreement Catherine paper

Available Domains

We support the following domains.

Domain Tasks
walker stand, walk, run, flip
quadruped walk, run, stand, jump
jaco reach_top_left, reach_top_right, reach_bottom_left, reach_bottom_right

Domain observation mode

Each domain supports two observation modes: states and pixels.

Model Command
states obs_type=states
pixels obs_type=pixels

Instructions

Pre-training

To run pre-training use the pretrain.py script

python pretrain.py agent=icm domain=walker

or, if you want to train a skill-based agent, like DIAYN, run:

python pretrain.py agent=diayn domain=walker

This script will produce several agent snapshots after training for 100k, 500k, 1M, and 2M frames. The snapshots will be stored under the following directory:

./pretrained_models/<obs_type>/<domain>/<agent>/

For example:

./pretrained_models/states/walker/icm/

Fine-tuning

Once you have pre-trained your method, you can use the saved snapshots to initialize the DDPG agent and fine-tune it on a downstream task. For example, let's say you have pre-trained ICM, you can fine-tune it on walker_run by running the following command:

python finetune.py pretrained_agent=icm task=walker_run snapshot_ts=1000000 obs_type=states

This will load a snapshot stored in ./pretrained_models/states/walker/icm/snapshot_1000000.pt, initialize DDPG with it (both the actor and critic), and start training on walker_run using the extrinsic reward of the task.

For methods that use skills, include the agent, and the reward_free tag to false.

python finetune.py pretrained_agent=smm task=walker_run snapshot_ts=1000000 obs_type=states agent=smm reward_free=false

Monitoring

Logs are stored in the exp_local folder. To launch tensorboard run:

tensorboard --logdir exp_local

The console output is also available in a form:

| train | F: 6000 | S: 3000 | E: 6 | L: 1000 | R: 5.5177 | FPS: 96.7586 | T: 0:00:42

a training entry decodes as

F  : total number of environment frames
S  : total number of agent steps
E  : total number of episodes
R  : episode return
FPS: training throughput (frames per second)
T  : total training time
Comments
  • Bug in replay buffer

    Bug in replay buffer

    In line 49 of replay_buffer.py time_step is an ExtendedTimeStepWrapper objective, not a list.

    Why you use indecs to extract information?

    value = time_step[spec.name]

    The error is

    TypeError: tuple indices must be integers or slices, not str

    opened by Baichenjia 7
  • Task identification mechanism in APS

    Task identification mechanism in APS

    Dear Misha Laskin,

    I am very grateful for open-sourcing the well-written code.

    It is really helpful for my research!

    However, I have one question about the implementation of fine-tuning APS.

    https://github.com/rll-research/url_benchmark/blob/710c3eb04e60ef559525bc90136ee4e1acae4c97/finetune.py#L196-L197

    As shown in the code block in finetune.py, the task vector (named meta) is updated periodically "after" initial seed frames.

    However, in the original paper of APS, it is said that the task vector is searched using initial seed frames, and is "fixed" during fine-tuning phase.

    Therefore, I understand that the code should be revised as follows (the inequality sign is reversed): if self.global_step < ( init_step // repeat) and self.global_step % every == 0:

    I wonder whether I miss something,

    and I hope you provide some explanation about my question.

    Best,

    Junsu Kim

    opened by junsu-kim97 5
  • Values used for normalized score calculation

    Values used for normalized score calculation

    I couldn't find the values used for normalized score calculation neither in paper nor in repo. It would be convenient if we'd be able to compare new methods based on the same metric (mean normalized return). Also the values themselves do not appear anywhere in the paper, only on figures, which is a bit confusing.

    opened by Randl 5
  • Buffer Empty

    Buffer Empty

    Hi all,

    I'm trying to get some data to work with ExORL, the buffer directory appears to be empty when saving the dataset. Were there any edits to urlb to generate the datasets used with ExORL? Thanks for all help :smiley:

    I am running pre_training.py with: python pretrain.py agent=aps domain=walker

    I believe this function should be saving the .npz buffer file:

    https://github.com/rll-research/url_benchmark/blob/bb98f0c6d78b3c467fb5a9fa5bbba3b7c0250397/replay_buffer.py#L18

    opened by AOS55 3
  • How to use finetuning code?

    How to use finetuning code?

    I would like to try out using finetuning after pretraining. I follow the instructions and use:

    python finetune.py pretrained_agent=icm task=walker_run snapshot_ts=1000000 obs_type=states
    

    Unfortunately, this gives me the following error:

    Could not override 'pretrained_agent'.
    To append to your config use +pretrained_agent=icm
    Key 'pretrained_agent' is not in struct
        full_key: pretrained_agent
        object_type=dict\
    Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
    

    What should I do here? Thanks!

    opened by VivianXue123 3
  • ICM implementation

    ICM implementation

    Hi,

    I had a few questions/comments regarding the ICM implementation.

    Unlike the original "Curiosity-driven exploration by self-supervised prediction" paper, this implementation doesn't use the inverse dynamics model to learn a feature space in which forward predictions are made. In fact, it seems the inverse model is not being used for anything?

    Also, is it problematic that the same encoder is being used (i) in the process of producing intrinsic rewards, and (ii) in the process of predicting intrinsic rewards (i.e. when predicting DDPG Q-values during pre-training)? I believe in the original paper the ICM module and the RL agent use a separate encoder.

    I'm just wondering if you had any useful insights regarding these design choices. Apologies if I misinterpreted anything from the code or paper.

    opened by RobertMcCarthy97 3
  • should the encoder parameters be updated twice in each iteration?

    should the encoder parameters be updated twice in each iteration?

    Hi, thank you very much for such wonderful work and implementation In the implementation, the encoder has its separate optimiser and does a separate update on top of agent optimiser step. Doesn't the ICM (or ddpg/AC) update the encoder parameters?

    I'm wondering if there is any advantage of using separate optimiser/update for the encoder, and if it's necessary for the model ?

    Thank you

        def update(self, replay_iter, step):
            metrics = dict()
    
            if step % self.update_every_steps != 0:
                return metrics
    
            batch = next(replay_iter)
            obs, action, extr_reward, discount, next_obs = utils.to_torch(
                batch, self.device)
    
            # augment and encode
            obs = self.aug_and_encode(obs)
            with torch.no_grad():
                next_obs = self.aug_and_encode(next_obs)
    
            if self.reward_free:
                metrics.update(self.update_icm(obs, action, next_obs, step))
    
    def update_icm(self, obs, action, next_obs, step):
            metrics = dict()
    
            forward_error, backward_error = self.icm(obs, action, next_obs)
    
            loss = forward_error.mean() + backward_error.mean()
    
            self.icm_opt.zero_grad(set_to_none=True)
            if self.encoder_opt is not None:
                self.encoder_opt.zero_grad(set_to_none=True)
            loss.backward()
            self.icm_opt.step()
            if self.encoder_opt is not None:
                self.encoder_opt.step()
    
            if self.use_tb or self.use_wandb:
                metrics['icm_loss'] = loss.item()
    
            return metrics
    
    opened by kevinNejad 1
  • Could not override 'pretrained_agent'. To append to your config use +pretrained_agent=icm

    Could not override 'pretrained_agent'. To append to your config use +pretrained_agent=icm

    When I try to fine-tune the agent after the pretraining, by below command as mentioned in README.md,

    python finetune.py pretrained_agent=icm task=walker_run snapshot_ts=1000000 obs_type=states
    

    This error comes.

    Could not override 'pretrained_agent'.
    To append to your config use +pretrained_agent=icm
    Key 'pretrained_agent' is not in struct
        full_key: pretrained_agent
        object_type=dict
    

    Should I modify finetune.yaml?

    opened by jsrimr 0
  • SMM intrinsic motivation signs

    SMM intrinsic motivation signs

    Hey,

    Not sure if anyone can clarify just wanted to check on signs with intrinsic reward for SMM

    intr_reward = pred_log_ratios + self.latent_ent_coef * h_z + self.latent_cond_ent_coef * h_z_s.detach()

    The original paper in equation 3 has:

    r_z(s) = log(p*(s)) - log(rho_pi(s|z)) + log(p(z|s)) - log(p(z))

    Why do we add the log(rho_pi(s|z)) == pred_log_ratios and log(p(z)) == self.latent_ent_coef and not subtract them as in equation 3, sorry if this is obvious 😄

    opened by AOS55 0
  • The representation dimension in the code is inconsistent with the paper

    The representation dimension in the code is inconsistent with the paper

    Hello! I noticed that the representation dim of some models (such as ICM) is 512 in Table 3. However, in the code, the representation of these models are 39200. When I use these models, should I add a linear layer after the conv layers to project the representation to 512 dim?

    opened by zhang1999 0
  • How to identify whether the unsupervised RL algorithm do learn something?

    How to identify whether the unsupervised RL algorithm do learn something?

    Nice work for this benchmark, and I am working on this transfering this benchmark to my custom environment. I want to enquire that how can I identify that the unsupervised RL algorithm truely work rather than some random trajectories? Any metrics that can help me identify that?

    opened by waterhorse1 0
  • Use URL as a Package?

    Use URL as a Package?

    I was wondering if there was ever discussion on using the URL agents in a package. For example, I'm working in an environment with discrete actions spaces, so I need a different training script, but would like an easy way to port over the reward models.

    Or, is there another exploration agent library that is better suited for that?

    cc @aliciafmachado

    opened by natolambert 1
  • Questions on numerical results

    Questions on numerical results

    Hi, When reading the paper and checking issue #1 , I found that the numerical results in Appendix C&F are inconsistent with the provided expert scores. For example, in Figure 7, ICM with 10^5 pretraining on walker_walk has about 50% normalized score, and the numerical result in Table 5 is (302+-45). As the expert score is 971 mentioned in #1 , this is equivalent to about 31% normalized score. Did I miss something here? Also, it seems that the score for pretraining methods, in general, cannot compete with SOTA methods like CURL and DrQ, which do not require any pretraining. Is there any explanation for this? Thanks!

    opened by MouseHu 0
  • Why pass 'env' when recording eval video while passing 'obs' when recording train video?

    Why pass 'env' when recording eval video while passing 'obs' when recording train video?

    When I run python pretrain.py agent=icm domain=walker save_train_video=true, I encounter an error

      File "/root/url_benchmark/video.py", line 83, in record
        frame = cv2.resize(obs[-3:].transpose(1, 2, 0),
    

    But recording evaluation video is okay. Then I found that we call .record method differently between training and evaluating.

    When we record eval video,

    self.video_recorder.record(self.eval_env)
    

    we pass 'env'.

    When we record train video,

    self.train_video_recorder.record(time_step.observation)
    

    we pass observation.

    Is this intended?

    opened by jsrimr 0
Owner
null
Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

Karush Suri 8 Nov 7, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
DeepMind Alchemy task environment: a meta-reinforcement learning benchmark

The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure.

DeepMind 188 Dec 25, 2022
RoboDesk A Multi-Task Reinforcement Learning Benchmark

RoboDesk A Multi-Task Reinforcement Learning Benchmark If you find this open source release useful, please reference in your paper: @misc{kannan2021ro

Google Research 66 Oct 7, 2022
A Real-World Benchmark for Reinforcement Learning based Recommender System

RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System RL4RS is a real-world deep reinforcement learning recommender system

null 121 Dec 1, 2022
CURL: Contrastive Unsupervised Representations for Reinforcement Learning

CURL Rainbow Status: Archive (code is provided as-is, no updates expected) This is an implementation of CURL: Contrastive Unsupervised Representations

Aravind Srinivas 46 Dec 12, 2022
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

null 195 Dec 7, 2022
Viewmaker Networks: Learning Views for Unsupervised Representation Learning

Viewmaker Networks: Learning Views for Unsupervised Representation Learning Alex Tamkin, Mike Wu, and Noah Goodman Paper link: https://arxiv.org/abs/2

Alex Tamkin 31 Dec 1, 2022
CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning This repository contains the code and relevant instructions

XiaoMing 5 Aug 19, 2022
Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

TensorLayer Community 7.1k Dec 27, 2022
Learning to trade under the reinforcement learning framework

Trading Using Q-Learning In this project, I will present an adaptive learning model to trade a single stock under the reinforcement learning framework

Uirá Caiado 470 Nov 28, 2022
Learning to Communicate with Deep Multi-Agent Reinforcement Learning in PyTorch

Learning to Communicate with Deep Multi-Agent Reinforcement Learning This is a PyTorch implementation of the original Lua code release. Overview This

Minqi 297 Dec 12, 2022
PyBullet CartPole and Quadrotor environments—with CasADi symbolic a priori dynamics—for learning-based control and reinforcement learning

safe-control-gym Physics-based CartPole and Quadrotor Gym environments (using PyBullet) with symbolic a priori dynamics (using CasADi) for learning-ba

Dynamic Systems Lab 300 Dec 28, 2022
Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

TensorLayer Community 7.1k Dec 29, 2022
[IROS'21] SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning

SurRoL IROS 2021 SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning Features dVRK compati

Med-AIR@CUHK 55 Jan 3, 2023
Offline Reinforcement Learning with Implicit Q-Learning

Offline Reinforcement Learning with Implicit Q-Learning This repository contains the official implementation of Offline Reinforcement Learning with Im

Ilya Kostrikov 125 Dec 31, 2022
Reinforcement Learning with Q-Learning Algorithm on gym's frozen lake environment implemented in python

Reinforcement Learning with Q Learning Algorithm Q learning algorithm is trained on the gym's frozen lake environment. Libraries Used gym Numpy tqdm P

null 1 Nov 10, 2021