Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Related tags

Deep Learning E.T.
Overview

Episodic Transformers (E.T.)

Episodic Transformer for Vision-and-Language Navigation
Alexander Pashevich, Cordelia Schmid, Chen Sun

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. This code reproduces the results obtained with E.T. on ALFRED benchmark. To learn more about the benchmark and the original code, please refer to ALFRED repository.

Quickstart

Clone repo:

$ git clone https://github.com/alexpashevich/E.T..git ET
$ export ET_ROOT=$(pwd)/ET
$ export ET_LOGS=$ET_ROOT/logs
$ export ET_DATA=$ET_ROOT/data
$ export PYTHONPATH=$PYTHONPATH:$ET_ROOT

Install requirements:

$ virtualenv -p $(which python3.7) et_env
$ source et_env/bin/activate

$ cd $ET_ROOT
$ pip install --upgrade pip
$ pip install -r requirements.txt

Downloading data and checkpoints

Download ALFRED dataset:

$ cd $ET_DATA
$ sh download_data.sh json_feat

Copy pretrained checkpoints:

$ wget http://pascal.inrialpes.fr/data2/apashevi/et_checkpoints.zip
$ unzip et_checkpoints.zip
$ mv pretrained $ET_LOGS/

Render PNG images and create an LMDB dataset with natural language annotations:

$ python -m alfred.gen.render_trajs
$ python -m alfred.data.create_lmdb with args.visual_checkpoint=$ET_LOGS/pretrained/fasterrcnn_model.pth args.data_output=lmdb_human args.vocab_path=$ET_ROOT/files/human.vocab

Note #1: For rendering, you may need to configure args.x_display to correspond to an X server number running on your machine.
Note #2: We do not use JPG images from the full dataset as they would differ from the images rendered during evaluation due to the JPG compression.

Pretrained models evaluation

Evaluate an E.T. agent trained on human data only:

$ python -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human

Note: make sure that your LMDB database is called exactly lmdb_human as the word embedding won't be loaded otherwise.

Evaluate an E.T. agent trained on human and synthetic data:

$ python -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_synth_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human

Note: For evaluation, you may need to configure eval.x_display to correspond to an X server number running on your machine.

E.T. with human data only

Train an E.T. agent:

$ python -m alfred.model.train with exp.model=transformer exp.name=et_s1 exp.data.train=lmdb_human train.seed=1

Evaluate the trained E.T. agent:

$ python -m alfred.eval.eval_agent with eval.exp=et_s1 eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5

Note: you may need to train up to 5 agents using different random seeds to reproduce the results of the paper.

E.T. with language pretraining

Language encoder pretraining with the translation objective:

$ python -m alfred.model.train with exp.model=speaker exp.name=translator exp.data.train=lmdb_human

Train an E.T. agent with the language pretraining:

$ python -m alfred.model.train with exp.model=transformer exp.name=et_synth_s1 exp.data.train=lmdb_human train.seed=1 exp.pretrained_path=translator

Evaluate the trained E.T. agent:

$ python -m alfred.eval.eval_agent with eval.exp=et_synth_s1 eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5

Note: you may need to train up to 5 agents using different random seeds to reproduce the results of the paper.

E.T. with joint training

You can also generate more synthetic trajectories using generate_trajs.py, create an LMDB and jointly train a model on it. Please refer to the original ALFRED code to know more the data generation. The steps to reproduce the results are the following:

  1. Generate 45K trajectories with alfred.gen.generate_trajs.
  2. Create a synthetic LMDB dataset called lmdb_synth_45K using args.visual_checkpoint=$ET_LOGS/pretrained/fasterrcnn_model.pth and args.vocab_path=$ET_ROOT/files/synth.vocab.
  3. Train an E.T. agent using exp.data.train=lmdb_human,lmdb_synth_45K.

Citation

If you find this repository useful, please cite our work:

@misc{pashevich2021episodic,
  title ={{Episodic Transformer for Vision-and-Language Navigation}},
  author={Alexander Pashevich and Cordelia Schmid and Chen Sun},
  year={2021},
  eprint={2105.06453},
  archivePrefix={arXiv},
}
Comments
  • Stuck while rendering trajectory

    Stuck while rendering trajectory

    Hi, I'm now suffering from the code stucking and would ask some help to deal with it.

    The problem command is as below: python -m alfred.gen.render_trajs

    From debugging, I found that the lines below makes it get stuck https://github.com/alexpashevich/E.T./blob/92ee2378d596b55f05e5c1949726577a64215f04/alfred/gen/render_trajs.py#L272-L288

    More precisely, the code stuck while performing the lines below. https://github.com/alexpashevich/E.T./blob/92ee2378d596b55f05e5c1949726577a64215f04/alfred/env/thor_env.py#L65-L76

    It stuck due to thorEnv so I searched some issue seems related. https://github.com/askforalfred/alfred/issues/120#issue-1330610682

    There is no error but the simulator just stop on some timestamp and don't move more than a day.. It would be very appreciate if you help

    opened by jeje910 9
  • Error in trying evaluation task

    Error in trying evaluation task

    Thank you for your great work! @alexpashevich

    Your paper was very interesting for me and i'm now trying to run your code in my local. But I have an issue when I try to run alfred.eval.eval_agent.

    When I try to run the pretrained models evaluation code python3 -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human FileNotFoundError came out and I cannot find which one is wrong. The codes before this line were correctly run and worked well.

    The picture below is the error I am currently suffer from. ET Question

    And the pretrained directory looks as follow. ET ERROR2

    Should I copy the directory from somewhere or did I miss something before evaluate task?

    Sorry for the poor question..

    opened by jeje910 6
  • Share 45k synthetic trajectories

    Share 45k synthetic trajectories

    Thank you @alexpashevich for your great work. When I followed your guide to Generate 45K trajectories with alfred.gen.generate_trajs, it mostly failed to generate trajectories in order to reproduce the result. Rather than sharing the entire raw images, would you just share the traj_data.json with us? I think the total size of such files for 45k trajectories is around 3 GB. From these files, we can render the images easily.

    Thank you so much!

    opened by davidnvq 4
  • Trajectories were skipped

    Trajectories were skipped

    Thank you @alexpashevich for your great work When I followed your guide to create an LMDB dataset with natural language annotations. It always report the error "string indices must b integers" in line 145 of the thor_env.py. The action "LookUp" seems not be an integer. This error would cause 7080 trajectories ere skipped. Would you like to tell us how to fixed it?

    opened by ptwaixingren 3
  • Questions about the pretrained model.

    Questions about the pretrained model.

    Thank you @alexpashevich for sharing this great work. I am here for asking whether you would release your different versions of pretrained models you published in your paper and on Alfred leaderboard.

    opened by yizhouzhao 2
  • Question about the number of synthetic language demonstrations

    Question about the number of synthetic language demonstrations

    Thanks for your great work!

    I wanna ask why the number of synthetic language demonstrations is not equal to the number of expert trajectories in train split? Since each trajectory corresponds to a single PDDL, and your synthetic language demonstration is just generated from PDDL, I supposed that the number of synthetic language demonstrations would be just the same as the number of expert trajectories in train split. However, the number of synthetic language instructions is about five times larger than the expert trajectories. Is there anything I misunderstood or are you using other methods of data augmentation? Thanks for your help!

    opened by Gasoonjia 1
  • Not able to render on Colab

    Not able to render on Colab

    The below command:

    python -m alfred.gen.render_trajs

    works when i run it on my personal laptop, but does not seem to work on colab.

    Are there any changes I can do so that I can get it working on colab.

    opened by Vibha111094 1
  • Stuck while rendering trajectory and Error in trying evaluation task

    Stuck while rendering trajectory and Error in trying evaluation task

    These issues are present in the current repo.

    How to fix them: Issue 1: Stuck while rendering trajectory: https://github.com/alexpashevich/E.T./issues/8 One of the modules in requirements.txt need to be a older version. I don't remember which.

    Issue 2: Error in trying evaluation task: https://github.com/alexpashevich/E.T./issues/6 Happens when using cuda You can either comment it out in model.util, I think you can add .cpu() to the cuda tensor like so: feat_extracted = feat_extracted.cpu()

    opened by Samuel-Fipps 5
Owner
Alex Pashevich
PhD student at Thoth (Inria Alpes, France)
Alex Pashevich
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Multimodal Deep Learning ?? ?? ?? Announcing the multimodal deep learning repository that contains implementation of various deep learning-based model

Deep Cognition and Language Research (DeCLaRe) Lab 398 Dec 30, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
GEP (GDB Enhanced Prompt) - a GDB plug-in for GDB command prompt with fzf history search, fish-like autosuggestions, auto-completion with floating window, partial string matching in history, and more!

GEP (GDB Enhanced Prompt) GEP (GDB Enhanced Prompt) is a GDB plug-in which make your GDB command prompt more convenient and flexibility. Why I need th

Alan Li 23 Dec 21, 2022
Unofficial implementation of Perceiver IO: A General Architecture for Structured Inputs & Outputs

Perceiver IO Unofficial implementation of Perceiver IO: A General Architecture for Structured Inputs & Outputs Usage import torch from src.perceiver.

Timur Ganiev 111 Nov 15, 2022
[CVPR'21] FedDG: Federated Domain Generalization on Medical Image Segmentation via Episodic Learning in Continuous Frequency Space

FedDG: Federated Domain Generalization on Medical Image Segmentation via Episodic Learning in Continuous Frequency Space by Quande Liu, Cheng Chen, Ji

Quande Liu 178 Jan 6, 2023
Continuum Learning with GEM: Gradient Episodic Memory

Gradient Episodic Memory for Continual Learning Source code for the paper: @inproceedings{GradientEpisodicMemory, title={Gradient Episodic Memory

Facebook Research 360 Dec 27, 2022
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

AI2 79 Dec 23, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

Hila Chefer 489 Jan 7, 2023
Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

YicongHong 34 Nov 15, 2022
Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh

Arjun Majumdar 44 Dec 14, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 43 Nov 21, 2022
Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Deformable Attention Implementation of Deformable Attention from this paper in Pytorch, which appears to be an improvement to what was proposed in DET

Phil Wang 128 Dec 24, 2022
The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

PointNav-VO The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation Project Page | Paper Table of Contents Setup

Xiaoming Zhao 41 Dec 15, 2022
SAAVN - Sound Adversarial Audio-Visual Navigation,ICLR2022 (In PyTorch)

SAAVN SAAVN Code release for paper "Sound Adversarial Audio-Visual Navigation,IC

YinfengYu 10 Aug 30, 2022
Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

Vision and Language Group@ MIL 23 Dec 21, 2022
Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

Nabil Ibtehaz 308 Jan 5, 2023
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision Download links and PyTorch implementation of "Towers of Ba

Blakey Wu 40 Dec 14, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023