CLIPort
CLIPort: What and Where Pathways for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox
CoRL 2021
CLIPort is an end-to-end imitation-learning agent that can learn a single language-conditioned policy for various tabletop tasks. The framework combines the broad semantic understanding (what) of CLIP with the spatial precision (where) of TransporterNets to learn generalizable skills from limited training demonstrations.
For the latest updates, see: cliport.github.io
Guides
- Getting Started: Installation, Quick Tutorial, Checkpoints & Objects
- Data Generation: Dataset, Tasks
- Training & Evaluation: Single Task, Multi Task
- Miscellaneous: Notebooks, Docker Guide, Disclaimers, Real-Robot Training FAQ
- References: Citations, Acknowledgements
Installation
Clone Repo:
git clone https://github.com/cliport/cliport.git
Setup virtualenv and install requirements:
# setup virtualenv with whichever package manager you prefer
virtualenv -p $(which python3.8) --system-site-packages cliport_env
source cliport_env/bin/activate
pip install --upgrade pip
cd cliport
pip install -r requirements.txt
export CLIPORT_ROOT=$(pwd)
python setup.py develop
Note: You might need versions of torch==1.7.1
and torchvision==0.8.2
that are compatible with your CUDA and hardware.
Quickstart
A quick tutorial on evaluating a pre-trained multi-task model.
Download a pre-trained checkpoint for multi-language-conditioned
trained with 1000 demos:
python scripts/quickstart_download.py
Generate a small test
set of 10 instances for stack-block-pyramid-seq-seen-colors
inside $CLIPORT_ROOT/data
:
python cliport/demos.py n=10 \
task=stack-block-pyramid-seq-seen-colors \
mode=test
This will take a few minutes to finish.
Evaluate the best validation checkpoint for stack-block-pyramid-seq-seen-colors
on the test set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=10 \
train_demos=1000 \
exp_folder=cliport_quickstart \
checkpoint_type=test_best \
update_results=True \
disp=True
If you are on a headless machine turn off the visualization with disp=False
.
You can evaluate the same multi-language-conditioned
model on other tasks. First generate a val
set for the task and then specify eval_task=<task_name>
with mode=val
and checkpoint_type=val_missing
(the quickstart doesn't include validation results for all tasks; download all task results from here).
Download
Google Scanned Objects
Download center-of-mass (COM) corrected Google Scanned Objects:
python scripts/google_objects_download.py
Credit: Google.
Pre-trained Checkpoints and Result JSONs
This Google Drive Folder contains pre-trained multi-language-conditioned
checkpoints for n=1,10,100,1000
and validation/test result JSONs for all tasks. The *val-results.json
files contain the name of the best checkpoint (from validation) to be evaluated on the test
set.
Note: Google Drive might complain about bandwidth restrictions. I recommend using rclone with API access enabled.
Evaluate the best validation checkpoint on the test set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=10 \
train_demos=100 \
exp_folder=cliport_exps \
checkpoint_type=test_best \
update_results=True \
disp=True
Training and Evaluation
The following is a guide for training everything from scratch. All tasks follow a 4-phase workflow:
- Generate
train
,val
,test
datasets withdemos.py
- Train agents with
train.py
- Run validation with
eval.py
to find the best checkpoint onval
tasks and save*val-results.json
- Evaluate the best checkpoint in
*val-results.json
ontest
tasks witheval.py
Dataset Generation
Single Task
Generate a train
set of 1000 demonstrations for stack-block-pyramid-seq-seen-colors
inside $CLIPORT_ROOT/data
:
python cliport/demos.py n=1000 \
task=stack-block-pyramid-seq-seen-colors \
mode=train
You can also do a sequential sweep with -m
and comma-separated params task=towers-of-hanoi-seq-seen-colors,stack-block-pyramid-seq-seen-colors
. Use disp=True
to visualize the data generation.
Full Dataset
Run generate_dataset.sh
to generate the full dataset and save it to $CLIPORT_ROOT/data
:
sh scripts/generate_dataset.sh data
Note: This script is not parallelized and will take a long time (maybe days) to finish. The full dataset requires ~1.6TB of storage, which includes both language-conditioned and demo-conditioned (original TransporterNets) tasks. It's recommend that you start with single-task training if you don't have enough storage space.
Single-Task Training & Evaluation
Make sure you have a train
(n demos) and val
(100 demos) set for the task you want to train on.
Training
Train a cliport
agent with 1000
demonstrations on the stack-block-pyramid-seq-seen-colors
task for 200K iterations:
python cliport/train.py train.task=stack-block-pyramid-seq-seen-colors \
train.agent=cliport \
train.attn_stream_fusion_type=add \
train.trans_stream_fusion_type=conv \
train.lang_fusion_type=mult \
train.n_demos=1000 \
train.n_steps=201000 \
train.exp_folder=exps \
dataset.cache=False
Validation
Iteratively evaluate all the checkpoints on val
and save the results in exps/<task>-train/checkpoints/<task>-val-results.json
:
python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=val \
n_demos=100 \
train_demos=1000 \
checkpoint_type=val_missing \
exp_folder=exps
Test
Choose the best checkpoint from validation to run on the test
set and save the results in exps/<task>-train/checkpoints/<task>-test-results.json
:
python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=100 \
train_demos=1000 \
checkpoint_type=test_best \
exp_folder=exps
Multi-Task Training & Evaluation
Training
Train multi-task models by specifying task=multi-language-conditioned
, task=multi-loo-packing-box-pairs-unseen-colors
(loo
stands for leave-one-out or multi-attr tasks) etc.
python cliport/train.py train.task=multi-language-conditioned \
train.agent=cliport \
train.attn_stream_fusion_type=add \
train.trans_stream_fusion_type=conv \
train.lang_fusion_type=mult \
train.n_demos=1000 \
train.n_steps=601000 \
dataset.cache=False \
train.exp_folder=exps \
dataset.type=multi
Important: You need to generate the full dataset of tasks specified in dataset.py
before multi-task training or modify the list of tasks here.
Validation
Run validation with a trained multi-language-conditioned
multi-task model on stack-block-pyramid-seq-seen-colors
:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=val \
n_demos=100 \
train_demos=1000 \
checkpoint_type=val_missing \
type=single \
exp_folder=exps
Test
Evaluate the best checkpoint on the test
set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=100 \
train_demos=1000 \
checkpoint_type=test_best \
type=single \
exp_folder=exps
Disclaimers
- Code Quality Level: Tired grad student.
- Scaling: The code only works for batch size 1. See #issue1 for reference. In theory, there is nothing preventing larger batch sizes other than GPU memory constraints.
- Memory and Storage: There are lots of places where memory usage can be reduced. You don't need 3 copies of the same CLIP ResNet50 and you don't need to save its weights in checkpoints since it's frozen anyway. Dataset sizes could be dramatically reduced with better storage formats and compression.
- Frameworks: There are lots of leftover NumPy bits from when I was trying to reproduce the TransportNets results. I'll try to clean up when I get some time.
- Rotation Augmentation: All tasks use the same distribution for sampling SE(2) rotation perturbations. This obviously leads to issues with tasks that involve spatial relationships like 'left' or 'forward'.
- Evaluation Runs: In an ideal setting, the evaluation metrics should be averaged over 3 or more repetitions with different seeds. This might be feasible if you are working just with multi-task models.
- Duplicate Training Sets: The train sets of some
*seen
and*unseen
tasks are identical, and only the val and test sets differ for purposes of evaluating generalization performance. So you might not need two duplicate train sets or train two separate models. - Other Limitations: Checkout Appendix I in the paper.
Notebooks
Checkout Kevin Zakka's Colab for zero-shot detection with CLIP. This notebook might be a good way of gauging what sort of visual attributes CLIP can ground with language. But note that CLIPort does NOT do "object detection", but instead directly "detects actions".
Others Todos
- Dataset Visualizer
- Affordance Heatmap Visualizer
- Evaluation Results Plot
Docker Guide
Install Docker and NVIDIA Docker.
Modify docker_build.py and docker_run.py to your needs.
Build
Build the image:
python scripts/docker_build.py
Run
Start container:
python scripts/docker_run.py --nvidia_docker
cd ~/cliport
Use scripts/docker_run.py --headless
if you are on a headless machines like a remote server or cloud instance.
Real-Robot Training FAQ
How much training data do I need?
It depends on the complexity of the task. With 5-10 demonstrations the agent should start to do something useful, but it will often make mistakes by picking the wrong object. For robustness you probably need 50-100 demostrations. A good way to gauge how much data you might need is to setup a simulated version of the problem and evaluate agents trained with 1, 10, 100, and 1000 demonstrations.
Why doesn't the agent follow my language instruction?
This means either there is some sort of bias in the dataset that the agent is exploiting, or you don't have enough training data. Also make sure that the task is doable - if a referred attribute is barely legible in the input, then it's going to be hard for agent to figure out what you mean.
Does CLIPort predict height (z-values) of the end-effector?
CLIPort does not predict height values. You can either: (1) come up with a heuristic based on the heightmap to determine the height position, or (2) train a simple MLP like in TransportNets-6DOF to predict z-values.
Shouldn't CLIP help in zero-shot detection of things? Why do I need collect more data?
Note that CLIPort is not doing "object detection". CLIPort fine-tunes CLIP's representations to "detect actions" in SE(2). CLIP by itself has no understanding of actions or affordances; recognizing and localizing objects (e.g. detecting hammer) does not tell you anything about how to manipulate them (e.g. grasping the hammer by the handle).
What are the best hyperparams for real-robot training?
The default settings should work well. Although recently, I have been playing around with using FiLM (Perez et. al, 2017) to fuse language features inspired by BC-0 (Jang et. al, 2021). Qualitatively, it seems like FiLM is better for reading text etc. but I haven't conducted a full quantitative analysis. Try it out yourself with train.agent=two_stream_clip_film_lingunet_lat_transporter
(non-residual FiLM).
How to pick the best checkpoint for real-robot tasks?
Ideally, you should create a validation set with heldout instances and then choose the checkpoint with the lowest translation and rotation errors. You can also reuse the training instances but swap the language instructions with unseen goals.
Why is the agent confusing directions like 'forward' and 'left'?
By default, training samples are augmented with SE(2) rotations sampled from N(0, 60 deg)
. For tasks with rotational symmetries (like moving pieces on a chessboard) you need to be careful with this rotation augmentation parameter.
Acknowledgements
This work use code from the following open-source projects and datasets:
Google Ravens (TransporterNets)
Original: https://github.com/google-research/ravens
License: Apache 2.0
Changes: All PyBullet tasks are directly adapted from the Ravens codebase. The original TransporterNets models were reimplemented in PyTorch.
OpenAI CLIP
Original: https://github.com/openai/CLIP
License: MIT
Changes: Minor modifications to CLIP-ResNet50 to save intermediate features for skip connections.
Google Scanned Objects
Original: Dataset
License: Creative Commons BY 4.0
Changes: Fixed center-of-mass (COM) to be geometric-center for selected objects.
U-Net
Original: https://github.com/milesial/Pytorch-UNet/
License: GPL 3.0
Changes: Used as is in unet.py. Note: This part of the code is GPL 3.0.
Citations
CLIPort
@inproceedings{shridhar2021cliport,
title = {CLIPort: What and Where Pathways for Robotic Manipulation},
author = {Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter},
booktitle = {Proceedings of the 5th Conference on Robot Learning (CoRL)},
year = {2021},
}
CLIP
@article{radford2021learning,
title={Learning transferable visual models from natural language supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
journal={arXiv preprint arXiv:2103.00020},
year={2021}
}
TransporterNets
@inproceedings{zeng2020transporter,
title={Transporter networks: Rearranging the visual world for robotic manipulation},
author={Zeng, Andy and Florence, Pete and Tompson, Jonathan and Welker, Stefan and Chien, Jonathan and Attarian, Maria and Armstrong, Travis and Krasin, Ivan and Duong, Dan and Sindhwani, Vikas and others},
booktitle={Proceedings of the 4th Conference on Robot Learning (CoRL)},
year= {2020},
}
Questions or Issues?
Please file an issue with the issue tracker.