CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Oier Mees

Last update: Dec 26, 2022

Related tags

Deep Learning natural-language-processing computer-vision deep-learning robotics pytorch vision manipulation grounding vision-language

Overview

CALVIN

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks Oier Mees, Lukas Hermann, Erick Rosete, Wolfram Burgard

We present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites.

💻 Quick Start

To begin, clone this repository locally

git clone --recurse-submodules https://github.com/mees/calvin.git
$ export CALVIN_ROOT=$(pwd)/calvin

Install requirements:

$ cd $CALVIN_ROOT
$ virtualenv -p $(which python3) --system-site-packages calvin_env # or use conda
$ source calvin_env/bin/activate
$ sh install.sh

Download dataset (choose which split you want to download with the argument D, ABC or ABCD):

$ cd $CALVIN_ROOT/dataset
$ sh download_data.sh D | ABC | ABCD

🏋️‍♂️ Train Baseline Agent

Train baseline models:

$ cd $CALVIN_ROOT/calvin_models/calvin_agent
$ python training.py

You want to scale your training to a multi-gpu setup? Just specify the number of GPUs and DDP will automatically be used for training thanks to Pytorch Lightning. To train on all available GPUs:

$ python training.py trainer.gpus=-1

If you have access to a Slurm cluster, we also provide trainings scripts here.

You can use Hydra's flexible overriding system for changing hyperparameters. For example, to train a model with rgb images from both static camera and the gripper camera:

$ python training.py datamodule/observation_space=lang_rgb_static_gripper model/perceptual_encoder=gripper_cam

To train a model with RGB-D from both cameras:

$ python training.py datamodule/observation_space=lang_rgbd_both model/perceptual_encoder=RGBD_both

To train a model with rgb images from the static camera and visual tactile observations:

$ python training.py datamodule/observation_space=lang_rgb_static_tactile model/perceptual_encoder=static_RGB_tactile

To see all available hyperparameters:

$ python training.py --help

To resume a training, just override the hydra working directory :

$ python training.py hydra.run.dir=runs/my_dir

🖼️ Sensory Observations

CALVIN supports a range of sensors commonly utilized for visuomotor control:

Static camera RGB images - with shape 200x200x3.
Static camera Depth maps - with shape 200x200x1.
Gripper camera RGB images - with shape 200x200x3.
Gripper camera Depth maps - with shape 200x200x1.
Tactile image - with shape 120x160x2x3.
Proprioceptive state - EE position (3), EE orientation in euler angles (3), gripper width (1), joint positions (7), gripper action (1).

🕹️ Action Space

In CALVIN, the agent must perform closed-loop continuous control to follow unconstrained language instructions characterizing complex robot manipulation tasks, sending continuous actions to the robot at 30hz. In order to give researchers and practitioners the freedom to experiment with different action spaces, CALVIN supports the following actions spaces:

Absolute cartesian pose - EE position (3), EE orientation in euler angles (3), gripper action (1).
Relative cartesian displacement - EE position (3), EE orientation in euler angles (3), gripper action (1).
Joint action - Joint positions (7), gripper action (1).

💪 Evaluation: The Calvin Challenge

Long-horizon Multi-task Language Control (LH-MTLC)

The aim of the CALVIN benchmark is to evaluate the learning of long-horizon language-conditioned continuous control policies. In this setting, a single agent must solve complex manipulation tasks by understanding a series of unconstrained language expressions in a row, e.g., “open the drawer. . . pick up the blue block. . . now push the block into the drawer. . . now open the sliding door”. We provide an evaluation protocol with evaluation modes of varying difficulty by choosing different combinations of sensor suites and amounts of training environments. To avoid a biased initial position, the robot is reset to a neutral position before every multi-step sequence.

To evaluate a trained calvin baseline agent, run the following command:

$ cd $CALVIN_ROOT/calvin_models/calvin_agent
$ python evaluation/evaluate_policy.py --dataset_path <PATH/TO/DATASET> --train_folder <PATH/TO/TRAINING/FOLDER>

Optional arguments:

--checkpoint <PATH/TO/CHECKPOINT>: by default, the evaluation loads the last checkpoint in the training log directory. You can instead specify the path to another checkpoint by adding this to the evaluation command.
--debug: print debug information and visualize environment.

If you want to evaluate your own model architecture on the CALVIN challenge, you can implement the CustomModel class in evaluate_policy.py as an interface to your agent. You need to implement the following methods:

__init__(): gets called once at the beginning of the evaluation.
reset(): gets called at the beginning of each evaluation sequence.
step(obs, goal): gets called every step and returns the predicted action.

Then evaluate the model by running:

$ python evaluation/evaluate_policy.py --dataset_path <PATH/TO/DATASET> --custom_model

You are also free to use your own language model instead of using the precomputed language embeddings provided by CALVIN. For this, implement CustomLangEmbeddings in evaluate_policy.py and add --custom_lang_embeddings to the evaluation command.

Multi-task Language Control (MTLC)

Alternatively, you can evaluate the policy on single tasks and without resetting the robot to a neutral position. Note that this evaluation is currently only available for our baseline agent.

$ python evaluation/evaluate_policy_singlestep.py --dataset_path <PATH/TO/DATASET> --train_folder <PATH/TO/TRAINING/FOLDER> [--checkpoint <PATH/TO/CHECKPOINT>] [--debug]

Pre-trained Model

Download the MCIL model checkpoint trained on the static camera rgb images on environment D.

$ wget http://calvin.cs.uni-freiburg.de/model_weights/D_D_static_rgb_baseline.zip
$ unzip D_D_static_rgb_baseline.zip

💬 Relabeling Raw Language Annotations

You want to try learning language conditioned policies in CALVIN with a new awesome language model?

We provide an example script to relabel the annotations with different language model provided in SBert, such as the larger MPNet (paraphrase-mpnet-base-v2) or its corresponding multilingual model (paraphrase-multilingual-mpnet-base-v2). The supported options are "mini", "mpnet" and "multi". If you want to try different SBert models, just change the model name here.

cd $CALVIN_ROOT/calvin_models/calvin_agent
python utils/relabel_with_new_lang_model.py +path=$CALVIN_ROOT/dataset/task_D_D/ +name_folder=new_lang_model_folder model.nlp_model=mpnet

If you additionally want to sample different language annotations for each sequence (from the same task annotations) in the training split run the same command with the parameter reannotate=true.

📈 SOTA Models

Open-source models that outperform the MCIL baselines from CALVIN:

Contact Oier to add your model here.

Reinforcement Learning with CALVIN

Are you interested in trying reinforcement learning agents for the different manipulation tasks in the CALVIN environment? We provide a google colab to showcase how to leverage the CALVIN task indicators to learn RL agents with a sparse reward.

Citation

If you find the dataset or code useful, please cite:

@article{calvin21,
author = {Oier Mees and Lukas Hermann and Erick Rosete-Beas and Wolfram Burgard},
title = {CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks},
journal={arXiv preprint arXiv:2112.03227},
year = 2021,
}

License

MIT License

Comments

I am trying to follow the instruction to run the code, however, after resolving some errors, there are some errors I cannot fix

The command I exexuted :
$ cd $CALVIN_ROOT/calvin_models/calvin_agent $ python training.py

errors I got after fixing some errors. File "/home/nikepupu/anaconda3/envs/calvin/lib/python3.7/site-packages/hydra/_internal/utils.py", line 573, in _locate raise ImportError(f"Error loading module '{path}'") from e ImportError: Error loading module 'lfp.utils.transforms.NormalizeVector'

I investigate the github, and I found out that this file is from here: which is missing

python /home/hermannl/repos/learning_from_play/lfp/training.py

In addition, I need to modify utils.transforms.py to get to this point.

I modify ScaleImageTensor to this: class NormalizeVector(object): """Normalize a tensor vector with mean and standard deviation."""

def __init__(self, mean=0.0, std=1.0):
    **if isinstance(mean, float):
        mean = [mean]
    if isinstance(std, float):
        std = [std]**
    print("success")
    self.std = torch.Tensor(std)
    self.std[self.std == 0.0] = 1.0
    self.mean = torch.Tensor(mean)

def __call__(self, tensor: torch.Tensor) -> torch.Tensor:
    assert isinstance(tensor, torch.Tensor)
    return (tensor - self.mean) / self.std

def __repr__(self):
    return self.__class__.__name__ + "(mean={0}, std={1})".format(self.mean, self.std)

I also modify add depthNoise to this: class AddDepthNoise(object): """Add multiplicative gamma noise to depth image. This is adapted from the DexNet 2.0 code. Their code: https://github.com/BerkeleyAutomation/gqcnn/blob/master/gqcnn/training/tf/trainer_tf.py"""

def __init__(self, shape=1000.0, rate=1000.0):
    self.shape = torch.tensor(shape)
    self.rate = torch.tensor(rate)
    self.dist = torch.distributions.gamma.Gamma(torch.tensor(shape), torch.tensor(rate))

def __call__(self, tensor: torch.Tensor) -> torch.Tensor:
    assert isinstance(tensor, torch.Tensor)
    multiplicative_noise = self.dist.sample()
    return multiplicative_noise * tensor

def __repr__(self):
    # return self.__class__.__name__ + f"{self.shape=},{self.rate=},{self.dist=}"
    **return self.__class__.__name__ + "(self.shape={0}, self.rate={1}, self.dist={2})".format(self.shape, self.rate, self.dist)**

opened by nikepupu 15

Stuck at beginning of training

Hi!

Running the baseline training gets stuck at the very beginning. Do you have any clue why that might be? Is it normal for iterations to take 23.91s/it? There is no error.

The only difference I have with your requirements is the PyTorch version as only the nightly release seems to work with CUDA and Pytorch lightning 1.4.9 on our machine.

root_data_dir: /media/dennisushi/DREVO-P1/DATA/calvin/task_D_D/task_A_A
...
slurm: false
...
[2022-03-17 11:16:03,707][__main__][INFO] - * CUDA:
	- GPU:
		- GeForce RTX 3080
		- GeForce RTX 3080
	- available:         True
	- version:           11.1
* Packages:
	- numpy:             1.21.2
	- pyTorch_debug:     False
	- pyTorch_version:   1.12.0.dev20220224+cu111
	- pytorch-lightning: 1.4.9
	- tqdm:              4.63.0
...
...
[2022-03-17 11:16:22,988][calvin_agent.models.play_lmp][INFO] - Finished validation epoch 0
Global seed set to 42                                                                                                
Epoch 0:   0%|                                                                   | 0/19063 [00:00<00:04, 3979.42it/s][2022-03-17 11:16:23,004][calvin_agent.models.play_lmp][INFO] - Start training epoch 0
Epoch 0:   0%|                                          | 3/19063 [01:35<126:35:39, 23.91s/it, loss=42.1, v_num=6-02]

opened by dennisushi 10

Feature request: single task datasets

Hi there! I think it would be really nice if there was a script and dataset for a selection of individual tasks in CALVIN, so that one could test their method on just a single task. I've started working on this already, does it sound like a useful feature?

opened by ezhang7423 8
Reseting env to state from dataset

Hi,

I'm trying to generating skill id / language annotations for the unlabeled frames in the dataset. I was thinking of using the reset_from_storage method in the environment class to reset to a state from the dataset and using the task checker to check for task success. However, the reset function requires a serialized version of the env/robot state which is not provided. Is there a way I could reset the env from offline data or is there another way for me to get skill annotations for the entire dataset?

Thanks!

opened by aliang8 7

Major concern about evaluation

Hi there! I've found that rolling out ground truth trajectories (labelled by the language annotator) from the dataset is not always evaluated to be successful by the Tasks.get_task_info. This seems to be quite concerning. Perhaps I've done something wrong on my end?

To reproduce, I have forked the repo with minimal changes here: https://github.com/mees/calvin/pull/33 The only difference is in line 47 in calvin_modesl/calvin_agent/evaluation/evaluate_policy_singlestep.py, where instead of rolling out the model I roll out the dataset actions.

The exact commands I ran from beginning to end:

# set up environment
git clone [email protected]:ezhang7423/calvin.git --recursive
cd calvin
conda create --name calvin python=3.8
conda activate calvin
pip install setuptools==57.5.0 torchmetrics==0.6.0
./install.sh

# get pretrained weights and fix the config.yaml
cp ./D_D_static_rgb_baseline/.hydra/config.yaml ./tmp.yaml
wget http://calvin.cs.uni-freiburg.de/model_weights/D_D_static_rgb_baseline.zip
unzip D_D_static_rgb_baseline.zip
unzip D_D_static_rgb_baseline.zip
mv ./tmp.yaml ./D_D_static_rgb_baseline/.hydra/config.yaml

# get data
cd dataset
./download_data.sh D
cd ../

# run the evaluation
python calvin_models/calvin_agent/evaluation/evaluate_policy_singlestep.py --dataset_path $DATA_GRAND_CENTRAL/task_D_D/ --train_folder ./D_D_static_rgb_baseline/ --checkpoint D_D_static_rgb_baseline/mcil_baseline.ckpt

opened by ezhang7423 5

The proportion of the recorded robot interaction data with language instructions

Hi，

Thanks for your excellent benchmark!

I have a question regarding the proportion of the recorded robot interaction data with language instructions.

The CALVIN paper says thet "we annotate only 1% of the recorded robot interaction data with language instructions."

After I download the dataset "task_D_D", "ep_start_end_ids.npy" under the training folder records 512046 unique episodes (saved as .npz file). Under "training/lang_annotations" folder, "auto_lang_ann.npy" records 303794 episodes with 192607 unique episodes. In this way, it seems that 192607 episodes are annotated with language instructions among all the 512046 episodes in the training set. The proportion is 192607/512046 , different from 1% in the paper.

If my analysis is correct, why is the the proportion of the recorded robot interaction data with language instructions in the dataset different from that in paper?

Looking forward to your reply.

opened by geyuying 5
"InvalidGitRepositoryError" while running jupyter notebook

Hi, I'd like to run RL_with_CALVIN.ipynb file in my locally established environment, but I meet the following issue while running the line "env = hydra.utils.instantiate(cfg.env)": ‘InvalidGitRepositoryError: Error instantiating 'calvin_env.envs.play_table_env.PlayTableSimEnv' Exactly I don't know how to solve it, so please give more help.

PS. it seems that there exists some script issues while opening RL_with_CALVIN.ipynb.

opened by 2000222 5
Language annotations from the automatic annotation tool

Hi,

I visualize the episode with language instructions by sampling 4 images ("rgb_static") within the episode in ABC training set (Task_ABC_D) and some language instructions ("task" and "ann") seem to be wrong as shown below,

The language instruction of the above episode is "grasp the blue block and rotate it right". You can check the example, whose ['info']['indx'] is (744905, 744969). Such cases are not rare in ABC training set, but not in D training and validation set.

I further use the automatic annotation tool to re-annoate the episodes as showed in https://github.com/mees/calvin/issues/24 and get the same task information ("rotate_blue_block_right" in the above example) as in the downloaded "auto_lang_ann.npy".

Is there anyway to make the language annotations more accurate?

Looking forward to your reply.

Best regards, Yuying

opened by geyuying 4
`scripts/automatic_lang_annotator_mp.py` appears to be broken

The configuration file that this script uses, lang_ann.yml, does not appear to have two required keys: rollout_sentences and annotations. To replicate this, one can simply try running the script.

opened by ezhang7423 4
Simple script to visualize and run through data from a CALVIN dataset.

In reference to #20 I submit this very rudimentary visualization script. It simply finds all the episode files from a given folder and then allows the user to scrub through the data with the arrow keys. Could benefit from a tqdm indicator for where one is in a dataset and from keys to skip larger parts.

opened by ARoefer 4
Errors with EGL

Thanks for this work. When I followed the readme and ran python training.py datamodule.root_data_dir=/path/to/dataset/, it reported an error as Segmentation fault (core dumped) when loading EGL plugin in calvin_env.

I use a Ubuntu 16.04 sever with a Nvidia 2080Ti card, and the driver is nvidia-container-runtime 3.5.0-1 and cuda is 11.2. I have searched the Internet for a while, such as installing mesa assudo apt-get install libglfw3-dev libgles2-mesa-dev, but still did not work.

I would like to inquire if you know how to enable EGL with my hardware setting and what is the function of EGL, for displaying the robot?

By the way, what is the time of training the baseline on the three datasets provided, i.e., task_D.zip, task_ABC_D.zip, task_ABCD_D.zip?

Thanks very much.

opened by zhaozj89 4

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Related tags

Overview

CALVIN

💻 Quick Start

🏋️‍♂️ Train Baseline Agent

🖼️ Sensory Observations

🕹️ Action Space

💪 Evaluation: The Calvin Challenge

Long-horizon Multi-task Language Control (LH-MTLC)

Multi-task Language Control (MTLC)

Pre-trained Model

💬 Relabeling Raw Language Annotations

📈 SOTA Models

Reinforcement Learning with CALVIN

Citation

License

Comments

Owner

Oier Mees

Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments (CoRL 2020)

Guiding evolutionary strategies by (inaccurate) differentiable robot simulators @ NeurIPS, 4th Robot Learning Workshop

Attention-driven Robot Manipulation (ARM) which includes Q-attention

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Control-Raspberry-Pi-Robot-using-Hand-Gestures - A 4WD Robot car based on Raspberry Pi that controlled by hand gestures(using openCV and mediapipe)

Space robot - (Course Project) Using the space robot to capture the target satellite that is disabled and spinning, then stabilize and fix it up

Learning Domain Invariant Representations in Goal-conditioned Block MDPs

SAPIEN Manipulation Skill Benchmark

Emotional conditioned music generation using transformer-based model.

PyTorch implementation of "Learn to Dance with AIST++: Music Conditioned 3D Dance Generation."

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

Official repository for the paper "Instance-Conditioned GAN"

DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).