Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Alex Pashevich

Last update: Dec 24, 2022

Related tags

Deep Learning E.T.

Overview

Episodic Transformers (E.T.)

Episodic Transformer for Vision-and-Language Navigation
Alexander Pashevich, Cordelia Schmid, Chen Sun

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. This code reproduces the results obtained with E.T. on ALFRED benchmark. To learn more about the benchmark and the original code, please refer to ALFRED repository.

Quickstart

Clone repo:

$ git clone https://github.com/alexpashevich/E.T..git ET
$ export ET_ROOT=$(pwd)/ET
$ export ET_LOGS=$ET_ROOT/logs
$ export ET_DATA=$ET_ROOT/data
$ export PYTHONPATH=$PYTHONPATH:$ET_ROOT

Install requirements:

$ virtualenv -p $(which python3.7) et_env
$ source et_env/bin/activate

$ cd $ET_ROOT
$ pip install --upgrade pip
$ pip install -r requirements.txt

Downloading data and checkpoints

Download ALFRED dataset:

$ cd $ET_DATA
$ sh download_data.sh json_feat

Copy pretrained checkpoints:

$ wget http://pascal.inrialpes.fr/data2/apashevi/et_checkpoints.zip
$ unzip et_checkpoints.zip
$ mv pretrained $ET_LOGS/

Render PNG images and create an LMDB dataset with natural language annotations:

$ python -m alfred.gen.render_trajs
$ python -m alfred.data.create_lmdb with args.visual_checkpoint=$ET_LOGS/pretrained/fasterrcnn_model.pth args.data_output=lmdb_human args.vocab_path=$ET_ROOT/files/human.vocab

Note #1: For rendering, you may need to configure args.x_display to correspond to an X server number running on your machine.
Note #2: We do not use JPG images from the full dataset as they would differ from the images rendered during evaluation due to the JPG compression.

Pretrained models evaluation

Evaluate an E.T. agent trained on human data only:

$ python -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human

Note: make sure that your LMDB database is called exactly lmdb_human as the word embedding won't be loaded otherwise.

Evaluate an E.T. agent trained on human and synthetic data:

$ python -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_synth_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human

Note: For evaluation, you may need to configure eval.x_display to correspond to an X server number running on your machine.

E.T. with human data only

Train an E.T. agent:

$ python -m alfred.model.train with exp.model=transformer exp.name=et_s1 exp.data.train=lmdb_human train.seed=1

Evaluate the trained E.T. agent:

$ python -m alfred.eval.eval_agent with eval.exp=et_s1 eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5

Note: you may need to train up to 5 agents using different random seeds to reproduce the results of the paper.

E.T. with language pretraining

Language encoder pretraining with the translation objective:

$ python -m alfred.model.train with exp.model=speaker exp.name=translator exp.data.train=lmdb_human

Train an E.T. agent with the language pretraining:

$ python -m alfred.model.train with exp.model=transformer exp.name=et_synth_s1 exp.data.train=lmdb_human train.seed=1 exp.pretrained_path=translator

Evaluate the trained E.T. agent:

$ python -m alfred.eval.eval_agent with eval.exp=et_synth_s1 eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5

Note: you may need to train up to 5 agents using different random seeds to reproduce the results of the paper.

E.T. with joint training

You can also generate more synthetic trajectories using generate_trajs.py, create an LMDB and jointly train a model on it. Please refer to the original ALFRED code to know more the data generation. The steps to reproduce the results are the following:

Generate 45K trajectories with alfred.gen.generate_trajs.
Create a synthetic LMDB dataset called lmdb_synth_45K using args.visual_checkpoint=$ET_LOGS/pretrained/fasterrcnn_model.pth and args.vocab_path=$ET_ROOT/files/synth.vocab.
Train an E.T. agent using exp.data.train=lmdb_human,lmdb_synth_45K.

Citation

If you find this repository useful, please cite our work:

@misc{pashevich2021episodic,
  title ={{Episodic Transformer for Vision-and-Language Navigation}},
  author={Alexander Pashevich and Cordelia Schmid and Chen Sun},
  year={2021},
  eprint={2105.06453},
  archivePrefix={arXiv},
}

Comments

Stuck while rendering trajectory

Hi, I'm now suffering from the code stucking and would ask some help to deal with it.

The problem command is as below: python -m alfred.gen.render_trajs

From debugging, I found that the lines below makes it get stuck https://github.com/alexpashevich/E.T./blob/92ee2378d596b55f05e5c1949726577a64215f04/alfred/gen/render_trajs.py#L272-L288

More precisely, the code stuck while performing the lines below. https://github.com/alexpashevich/E.T./blob/92ee2378d596b55f05e5c1949726577a64215f04/alfred/env/thor_env.py#L65-L76

It stuck due to thorEnv so I searched some issue seems related. https://github.com/askforalfred/alfred/issues/120#issue-1330610682

There is no error but the simulator just stop on some timestamp and don't move more than a day.. It would be very appreciate if you help

opened by jeje910 9
Error in trying evaluation task

Thank you for your great work! @alexpashevich

Your paper was very interesting for me and i'm now trying to run your code in my local. But I have an issue when I try to run alfred.eval.eval_agent.

When I try to run the pretrained models evaluation code python3 -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human FileNotFoundError came out and I cannot find which one is wrong. The codes before this line were correctly run and worked well.

The picture below is the error I am currently suffer from.

And the pretrained directory looks as follow.

Should I copy the directory from somewhere or did I miss something before evaluate task?

Sorry for the poor question..

opened by jeje910 6
Share 45k synthetic trajectories

Thank you @alexpashevich for your great work. When I followed your guide to Generate 45K trajectories with alfred.gen.generate_trajs, it mostly failed to generate trajectories in order to reproduce the result. Rather than sharing the entire raw images, would you just share the traj_data.json with us? I think the total size of such files for 45k trajectories is around 3 GB. From these files, we can render the images easily.

Thank you so much!

opened by davidnvq 4
Trajectories were skipped

Thank you @alexpashevich for your great work When I followed your guide to create an LMDB dataset with natural language annotations. It always report the error "string indices must b integers" in line 145 of the thor_env.py. The action "LookUp" seems not be an integer. This error would cause 7080 trajectories ere skipped. Would you like to tell us how to fixed it?

opened by ptwaixingren 3
Questions about the pretrained model.

Thank you @alexpashevich for sharing this great work. I am here for asking whether you would release your different versions of pretrained models you published in your paper and on Alfred leaderboard.

opened by yizhouzhao 2
Question about the number of synthetic language demonstrations

Thanks for your great work!

I wanna ask why the number of synthetic language demonstrations is not equal to the number of expert trajectories in train split? Since each trajectory corresponds to a single PDDL, and your synthetic language demonstration is just generated from PDDL, I supposed that the number of synthetic language demonstrations would be just the same as the number of expert trajectories in train split. However, the number of synthetic language instructions is about five times larger than the expert trajectories. Is there anything I misunderstood or are you using other methods of data augmentation? Thanks for your help!

opened by Gasoonjia 1
Not able to render on Colab

The below command:

python -m alfred.gen.render_trajs

works when i run it on my personal laptop, but does not seem to work on colab.

Are there any changes I can do so that I can get it working on colab.

opened by Vibha111094 1
Stuck while rendering trajectory and Error in trying evaluation task

These issues are present in the current repo.

How to fix them: Issue 1: Stuck while rendering trajectory: https://github.com/alexpashevich/E.T./issues/8 One of the modules in requirements.txt need to be a older version. I don't remember which.

Issue 2: Error in trying evaluation task: https://github.com/alexpashevich/E.T./issues/6 Happens when using cuda You can either comment it out in model.util, I think you can add .cpu() to the cuda tensor like so: feat_extracted = feat_extracted.cpu()

opened by Samuel-Fipps 5

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Related tags

Overview

Episodic Transformers (E.T.)

Quickstart

Downloading data and checkpoints

Pretrained models evaluation

E.T. with human data only

E.T. with language pretraining

E.T. with joint training

Citation

Comments

Owner

Alex Pashevich

This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

GEP (GDB Enhanced Prompt) - a GDB plug-in for GDB command prompt with fzf history search, fish-like autosuggestions, auto-completion with floating window, partial string matching in history, and more!

Unofficial implementation of Perceiver IO: A General Architecture for Structured Inputs & Outputs

[CVPR'21] FedDG: Federated Domain Generalization on Medical Image Segmentation via Episodic Learning in Continuous Frequency Space

Continuum Learning with GEM: Gradient Episodic Memory

A task-agnostic vision-language architecture as a step towards General Purpose Vision

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

SAAVN - Sound Adversarial Audio-Visual Navigation,ICLR2022 (In PyTorch)

Deep Multimodal Neural Architecture Search

Rethinking the U-Net architecture for multimodal biomedical image segmentation

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".