The Unsupervised Reinforcement Learning Benchmark (URLB)

Last update: Dec 26, 2022

Related tags

Deep Learning url_benchmark

Overview

The Unsupervised Reinforcement Learning Benchmark (URLB)

URLB provides a set of leading algorithms for unsupervised reinforcement learning where agents first pre-train without access to extrinsic rewards and then are finetuned to downstream tasks.

Requirements

We assume you have access to a GPU that can run CUDA 10.2 and CUDNN 8. Then, the simplest way to install all required dependencies is to create an anaconda environment by running

conda env create -f conda_env.yml

After the instalation ends you can activate your environment with

conda activate urlb

Implemented Agents

Agent	Command	Implementation Author(s)	Paper
ICM	`agent=icm`	Denis	paper
ProtoRL	`agent=proto`	Denis	paper
DIAYN	`agent=diayn`	Misha	paper
APT(ICM)	`agent=icm_apt`	Hao, Kimin	paper
APT(Ind)	`agent=ind_apt`	Hao, Kimin	paper
APS	`agent=aps`	Hao, Kimin	paper
SMM	`agent=smm`	Albert	paper
RND	`agent=rnd`	Kevin	paper
Disagreement	`agent=disagreement`	Catherine	paper

Available Domains

We support the following domains.

Domain	Tasks
`walker`	`stand`, `walk`, `run`, `flip`
`quadruped`	`walk`, `run`, `stand`, `jump`
`jaco`	`reach_top_left`, `reach_top_right`, `reach_bottom_left`, `reach_bottom_right`

Domain observation mode

Each domain supports two observation modes: states and pixels.

Model	Command
states	`obs_type=states`
pixels	`obs_type=pixels`

Instructions

Pre-training

To run pre-training use the pretrain.py script

python pretrain.py agent=icm domain=walker

or, if you want to train a skill-based agent, like DIAYN, run:

python pretrain.py agent=diayn domain=walker

This script will produce several agent snapshots after training for 100k, 500k, 1M, and 2M frames. The snapshots will be stored under the following directory:

./pretrained_models/<obs_type>/<domain>/<agent>/

For example:

./pretrained_models/states/walker/icm/

Fine-tuning

Once you have pre-trained your method, you can use the saved snapshots to initialize the DDPG agent and fine-tune it on a downstream task. For example, let's say you have pre-trained ICM, you can fine-tune it on walker_run by running the following command:

python finetune.py pretrained_agent=icm task=walker_run snapshot_ts=1000000 obs_type=states

This will load a snapshot stored in ./pretrained_models/states/walker/icm/snapshot_1000000.pt, initialize DDPG with it (both the actor and critic), and start training on walker_run using the extrinsic reward of the task.

For methods that use skills, include the agent, and the reward_free tag to false.

python finetune.py pretrained_agent=smm task=walker_run snapshot_ts=1000000 obs_type=states agent=smm reward_free=false

Monitoring

Logs are stored in the exp_local folder. To launch tensorboard run:

tensorboard --logdir exp_local

The console output is also available in a form:

| train | F: 6000 | S: 3000 | E: 6 | L: 1000 | R: 5.5177 | FPS: 96.7586 | T: 0:00:42

a training entry decodes as

F  : total number of environment frames
S  : total number of agent steps
E  : total number of episodes
R  : episode return
FPS: training throughput (frames per second)
T  : total training time

Comments

Bug in replay buffer

In line 49 of replay_buffer.py time_step is an ExtendedTimeStepWrapper objective, not a list.

Why you use indecs to extract information?

value = time_step[spec.name]

The error is

TypeError: tuple indices must be integers or slices, not str

opened by Baichenjia 7
Task identification mechanism in APS

Dear Misha Laskin,

I am very grateful for open-sourcing the well-written code.

It is really helpful for my research!

However, I have one question about the implementation of fine-tuning APS.

https://github.com/rll-research/url_benchmark/blob/710c3eb04e60ef559525bc90136ee4e1acae4c97/finetune.py#L196-L197

As shown in the code block in finetune.py, the task vector (named meta) is updated periodically "after" initial seed frames.

However, in the original paper of APS, it is said that the task vector is searched using initial seed frames, and is "fixed" during fine-tuning phase.

Therefore, I understand that the code should be revised as follows (the inequality sign is reversed): if self.global_step < ( init_step // repeat) and self.global_step % every == 0:

I wonder whether I miss something,

and I hope you provide some explanation about my question.

Best,

Junsu Kim

opened by junsu-kim97 5
Values used for normalized score calculation

I couldn't find the values used for normalized score calculation neither in paper nor in repo. It would be convenient if we'd be able to compare new methods based on the same metric (mean normalized return). Also the values themselves do not appear anywhere in the paper, only on figures, which is a bit confusing.

opened by Randl 5
Buffer Empty

Hi all,

I'm trying to get some data to work with ExORL, the buffer directory appears to be empty when saving the dataset. Were there any edits to urlb to generate the datasets used with ExORL? Thanks for all help :smiley:

I am running pre_training.py with: python pretrain.py agent=aps domain=walker

I believe this function should be saving the .npz buffer file:

https://github.com/rll-research/url_benchmark/blob/bb98f0c6d78b3c467fb5a9fa5bbba3b7c0250397/replay_buffer.py#L18

opened by AOS55 3

How to use finetuning code?

I would like to try out using finetuning after pretraining. I follow the instructions and use:

python finetune.py pretrained_agent=icm task=walker_run snapshot_ts=1000000 obs_type=states

Unfortunately, this gives me the following error:

Could not override 'pretrained_agent'.
To append to your config use +pretrained_agent=icm
Key 'pretrained_agent' is not in struct
    full_key: pretrained_agent
    object_type=dict\
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

What should I do here? Thanks!

opened by VivianXue123 3

ICM implementation

Hi,

I had a few questions/comments regarding the ICM implementation.

Unlike the original "Curiosity-driven exploration by self-supervised prediction" paper, this implementation doesn't use the inverse dynamics model to learn a feature space in which forward predictions are made. In fact, it seems the inverse model is not being used for anything?

Also, is it problematic that the same encoder is being used (i) in the process of producing intrinsic rewards, and (ii) in the process of predicting intrinsic rewards (i.e. when predicting DDPG Q-values during pre-training)? I believe in the original paper the ICM module and the RL agent use a separate encoder.

I'm just wondering if you had any useful insights regarding these design choices. Apologies if I misinterpreted anything from the code or paper.

opened by RobertMcCarthy97 3

should the encoder parameters be updated twice in each iteration?

Hi, thank you very much for such wonderful work and implementation In the implementation, the encoder has its separate optimiser and does a separate update on top of agent optimiser step. Doesn't the ICM (or ddpg/AC) update the encoder parameters?

I'm wondering if there is any advantage of using separate optimiser/update for the encoder, and if it's necessary for the model ?

Thank you

    def update(self, replay_iter, step):
        metrics = dict()

        if step % self.update_every_steps != 0:
            return metrics

        batch = next(replay_iter)
        obs, action, extr_reward, discount, next_obs = utils.to_torch(
            batch, self.device)

        # augment and encode
        obs = self.aug_and_encode(obs)
        with torch.no_grad():
            next_obs = self.aug_and_encode(next_obs)

        if self.reward_free:
            metrics.update(self.update_icm(obs, action, next_obs, step))

def update_icm(self, obs, action, next_obs, step):
        metrics = dict()

        forward_error, backward_error = self.icm(obs, action, next_obs)

        loss = forward_error.mean() + backward_error.mean()

        self.icm_opt.zero_grad(set_to_none=True)
        if self.encoder_opt is not None:
            self.encoder_opt.zero_grad(set_to_none=True)
        loss.backward()
        self.icm_opt.step()
        if self.encoder_opt is not None:
            self.encoder_opt.step()

        if self.use_tb or self.use_wandb:
            metrics['icm_loss'] = loss.item()

        return metrics

opened by kevinNejad 1

Could not override 'pretrained_agent'. To append to your config use +pretrained_agent=icm

When I try to fine-tune the agent after the pretraining, by below command as mentioned in README.md,

python finetune.py pretrained_agent=icm task=walker_run snapshot_ts=1000000 obs_type=states

This error comes.

Could not override 'pretrained_agent'.
To append to your config use +pretrained_agent=icm
Key 'pretrained_agent' is not in struct
    full_key: pretrained_agent
    object_type=dict

Should I modify finetune.yaml?

opened by jsrimr 0

SMM intrinsic motivation signs

Hey,

Not sure if anyone can clarify just wanted to check on signs with intrinsic reward for SMM

intr_reward = pred_log_ratios + self.latent_ent_coef * h_z + self.latent_cond_ent_coef * h_z_s.detach()

The original paper in equation 3 has:

r_z(s) = log(p*(s)) - log(rho_pi(s|z)) + log(p(z|s)) - log(p(z))

Why do we add the log(rho_pi(s|z)) == pred_log_ratios and log(p(z)) == self.latent_ent_coef and not subtract them as in equation 3, sorry if this is obvious 😄

opened by AOS55 0
The representation dimension in the code is inconsistent with the paper

Hello! I noticed that the representation dim of some models (such as ICM) is 512 in Table 3. However, in the code, the representation of these models are 39200. When I use these models, should I add a linear layer after the conv layers to project the representation to 512 dim?

opened by zhang1999 0
How to identify whether the unsupervised RL algorithm do learn something?

Nice work for this benchmark, and I am working on this transfering this benchmark to my custom environment. I want to enquire that how can I identify that the unsupervised RL algorithm truely work rather than some random trajectories? Any metrics that can help me identify that?

opened by waterhorse1 0
Use URL as a Package?

I was wondering if there was ever discussion on using the URL agents in a package. For example, I'm working in an environment with discrete actions spaces, so I need a different training script, but would like an easy way to port over the reward models.

Or, is there another exploration agent library that is better suited for that?

cc @aliciafmachado

opened by natolambert 1
Questions on numerical results

Hi, When reading the paper and checking issue #1 , I found that the numerical results in Appendix C&F are inconsistent with the provided expert scores. For example, in Figure 7, ICM with 10^5 pretraining on walker_walk has about 50% normalized score, and the numerical result in Table 5 is (302+-45). As the expert score is 971 mentioned in #1 , this is equivalent to about 31% normalized score. Did I miss something here? Also, it seems that the score for pretraining methods, in general, cannot compete with SOTA methods like CURL and DrQ, which do not require any pretraining. Is there any explanation for this? Thanks!

opened by MouseHu 0
Why pass 'env' when recording eval video while passing 'obs' when recording train video?
When I run python pretrain.py agent=icm domain=walker save_train_video=true, I encounter an error

File "/root/url_benchmark/video.py", line 83, in record frame = cv2.resize(obs[-3:].transpose(1, 2, 0),

But recording evaluation video is okay. Then I found that we call .record method differently between training and evaluating.

When we record eval video,

self.video_recorder.record(self.eval_env)

we pass 'env'.

When we record train video,

self.train_video_recorder.record(time_step.observation)

we pass observation.

Is this intended?
opened by jsrimr 0

Owner

GitHub

Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

8 Nov 7, 2022

Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

4 Apr 15, 2022

DeepMind Alchemy task environment: a meta-reinforcement learning benchmark

The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure.

188 Dec 25, 2022

RoboDesk A Multi-Task Reinforcement Learning Benchmark

RoboDesk A Multi-Task Reinforcement Learning Benchmark If you find this open source release useful, please reference in your paper: @misc{kannan2021ro

66 Oct 7, 2022

A Real-World Benchmark for Reinforcement Learning based Recommender System

RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System RL4RS is a real-world deep reinforcement learning recommender system

121 Dec 1, 2022

This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D face reconstruction methods from a single image under variations in viewing angle, lighting, and common occlusions.

NoW Evaluation This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard e

71 Dec 30, 2022

CURL: Contrastive Unsupervised Representations for Reinforcement Learning

CURL Rainbow Status: Archive (code is provided as-is, no updates expected) This is an implementation of CURL: Contrastive Unsupervised Representations

46 Dec 12, 2022

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

195 Dec 7, 2022

Viewmaker Networks: Learning Views for Unsupervised Representation Learning

Viewmaker Networks: Learning Views for Unsupervised Representation Learning Alex Tamkin, Mike Wu, and Noah Goodman Paper link: https://arxiv.org/abs/2

31 Dec 1, 2022

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning This repository contains the code and relevant instructions

5 Aug 19, 2022

Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

7.1k Dec 27, 2022

Learning to trade under the reinforcement learning framework

Trading Using Q-Learning In this project, I will present an adaptive learning model to trade a single stock under the reinforcement learning framework

470 Nov 28, 2022

Learning to Communicate with Deep Multi-Agent Reinforcement Learning in PyTorch

Learning to Communicate with Deep Multi-Agent Reinforcement Learning This is a PyTorch implementation of the original Lua code release. Overview This

297 Dec 12, 2022

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

pytorch-a2c-ppo-acktr Update (April 12th, 2021) PPO is great, but Soft Actor Critic can be better for many continuous control tasks. Please check out

3k Jan 9, 2023

PyBullet CartPole and Quadrotor environments—with CasADi symbolic a priori dynamics—for learning-based control and reinforcement learning

safe-control-gym Physics-based CartPole and Quadrotor Gym environments (using PyBullet) with symbolic a priori dynamics (using CasADi) for learning-ba

300 Dec 28, 2022

Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

7.1k Dec 29, 2022

[IROS'21] SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning

SurRoL IROS 2021 SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning Features dVRK compati

55 Jan 3, 2023

Offline Reinforcement Learning with Implicit Q-Learning

Offline Reinforcement Learning with Implicit Q-Learning This repository contains the official implementation of Offline Reinforcement Learning with Im

125 Dec 31, 2022

Reinforcement Learning with Q-Learning Algorithm on gym's frozen lake environment implemented in python

Reinforcement Learning with Q Learning Algorithm Q learning algorithm is trained on the gym's frozen lake environment. Libraries Used gym Numpy tqdm P

1 Nov 10, 2021