Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Rishabh Jangir

Last update: Nov 24, 2022

Related tags

Deep Learning look-closer

Overview

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Official PyTorch implementation for the paper

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation Rishabh Jangir*, Nicklas Hansen*, Sambaran Ghosal, Mohit Jain, and Xiaolong Wang

[arXiv], [Webpage]

Installation

GPU access with CUDA >=11.1 support is required. Install MuJoCo if you do not have it installed already:

Obtain a license on the MuJoCo website.
Download MuJoCo binaries here.
Unzip the downloaded archive into ~/.mujoco/mujoco200 and place your license key file mjkey.txt at ~/.mujoco.
Use the env variables MUJOCO_PY_MJKEY_PATH and MUJOCO_PY_MUJOCO_PATH to specify the MuJoCo license key path and the MuJoCo directory path.
Append the MuJoCo subdirectory bin path into the env variable LD_LIBRARY_PATH.

Then, the remainder of the dependencies can be installed with the following commands:

conda env create -f setup/conda.yml
conda activate lookcloser

Training

We provide training scripts for solving each of the four tasks using our method. The training scripts can be found in the scripts directory. Training takes approximately 16 hours on a single GPU for 500k timesteps.

Command: bash scripts/multiview.sh runs with the default arguments set towards training the reach environment with image observations with our crossview method.

Please take a look at src/arguments.py for detailed description of arguments and their usage. The different baselines considered in the paper can be run with little modification of the input arguments.

Results

We find that while using multiple views alone improves the sim-to-real performance of SAC, our Transformer-based view fusion is far more robust across all tasks.

See our paper for more results.

Method

Our method improves vision-based robotic manipulation by fusing information from multiple cameras using transformers. The learned RL policy transfers from simulation to a real robot, and solves precision-based manipulation tasks directly from uncalibrated cameras, without access to state information, and with a high degree of variability in task configurations.

Attention Maps

We visualize attention maps learned by our method, and find that it learns to relate concepts shared between the two views, e.g. when querying a point on an object shown the egocentric view, our method attends strongly to the same object in the third-person view, and vice-versa.

Tasks

Together with our method, we also release a set of four image-based robotic manipulation tasks used in our research. Each task is goal-conditioned with the goal specified directly in the image observations, the agent has no access to state information, and task configurations are randomly initialized at the start of each episode. The provided tasks are:

Reach: Reach a randomly positioned mark on the table with the robot's end-effector.
Push: Push a box to a goal position indicated by a mark on the table.
Pegbox: Place a peg attached to the robot's end-effector with a string into a box.
Hammerall: Hammer in an out-of-position peg; each episode, only one of four pegs are randomly initialized out-of-position.

Citation

If you find our work useful in your research, please consider citing the paper as follows:

@article{Jangir2022Look,
  title={Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation},
  author={ Rishabh Jangir and Nicklas Hansen and Sambaral Ghosal and Mohit Jain and Xiaolong Wang},
  booktitle={arXiv},
  primaryclass={cs.LG},
  year={2022}
}

License

This repository is licensed under the MIT license; see LICENSE for more information.

You might also like...

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

57 Dec 15, 2022

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

CoTr: Efficient 3D Medical Image Segmentation by bridging CNN and Transformer This is the official pytorch implementation of the CoTr: Paper: CoTr: Ef

218 Dec 25, 2022

S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration (CVPR 2021)

S2-BNN (Self-supervised Binary Neural Networks Using Distillation Loss) This is the official pytorch implementation of our paper: "S2-BNN: Bridging th

52 Dec 24, 2022

Bridging Vision and Language Model

BriVL BriVL (Bridging Vision and Language Model) 是首个中文通用图文多模态大规模预训练模型。BriVL模型在图文检索任务上有着优异的效果，超过了同期其他常见的多模态预训练模型（例如UNITER、CLIP）。 BriVL论文：WenLan: Bridgi

235 Dec 27, 2022

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories This repo is the code release of EMNLP 2021 con

12 Nov 22, 2022

Comments

No environments in the code

Hi,

There doesn't seem to be any environments provided in the github repo. Can you please add the envs, since it seems that the paper has custom envs (I am assuming this is possible but please let me know if it is not)?

opened by mohitsharma0690 2
No xArm control code

Hello,

How would you control the xArm when you trained the policy in simulation env? Are you using xArm-SDK-python? Could you share your code for that control? Thanks!

opened by hanskyy 1
Attention Maps

Hello,

is the available code capable of producing attention maps? I could not find anything in your project that would point towards this. If not and if you dont plan to release the code, could you give me some insights on how to extract the attention maps? It is not straight forward for me, because you do the attention operation on the latent representation and i dont think the attention is in regards to anything in the original pixel space.

opened by NicoBach 0

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Related tags

Overview

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Installation

Training

Results

Method

Attention Maps

Tasks

Citation

License

You might also like...

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration (CVPR 2021)

Bridging Vision and Language Model

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

[MICCAI'20] AlignShift: Bridging the Gap of Imaging Thickness in 3D Anisotropic Volumes

Official implementation for paper Knowledge Bridging for Empathetic Dialogue Generation (AAAI 2021).

LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)

Comments

No environments in the code

No xArm control code

Attention Maps

Owner

Rishabh Jangir

code for `Look Closer to Segment Better: Boundary Patch Refinement for Instance Segmentation`

A Closer Look at Structured Pruning for Neural Network Compression

PyTorch implementation of the R2Plus1D convolution based ResNet architecture described in the paper "A Closer Look at Spatiotemporal Convolutions for Action Recognition"

Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

CLIPort: What and Where Pathways for Robotic Manipulation

ManipulaTHOR, a framework that facilitates visual manipulation of objects using a robotic arm

RGB-stacking 🛑 🟩 🔷 for robotic manipulation

Unified learning approach for egocentric hand gesture recognition and fingertip detection

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)