History Aware Multimodal Transformer for Vision-and-Language Navigation

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

This repository is the official implementation of History Aware Multimodal Transformer for Vision-and-Language Navigation. Project webpage: https://cshizhe.github.io/projects/vln_hamt.html

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. In this work, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single-step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR) high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back).

framework

Installation

  1. Install Matterport3D simulators: follow instructions here. We use the latest version (all inputs and outputs are batched).
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
  1. Install requirements:
conda create --name vlnhamt python=3.8.5
conda activate vlnhamt
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
  1. Download data from Dropbox, including processed annotations, features and pretrained models. Put the data in `datasets' directory.

  2. (Optional) If you want to train HAMT end-to-end, you should download original Matterport3D data.

Extracting features (optional)

Scripts to extract visual features are in preprocess directory:

CUDA_VISIBLE_DEVICES=0 python preprocess/precompute_img_features_vit.py \
    --model_name vit_base_patch16_224 --out_image_logits \
    --connectivity_dir datasets/R2R/connectivity \
    --scan_dir datasets/Matterport3D/v1_unzip_scans \
    --num_workers 4 \
    --output_file datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5

Training with proxy tasks

Stage 1: Pretrain with fixed ViT features

NODE_RANK=0
NUM_GPUS=4
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch \
    --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \
    pretrain_src/main_r2r.py --world_size ${NUM_GPUS} \
    --model_config pretrain_src/config/r2r_model_config.json \
    --config pretrain_src/config/pretrain_r2r.json \
    --output_dir datasets/R2R/exprs/pretrain/cmt-vitbase-6tasks

Stage 2: Train ViT in an end-to-end manner

Change the config file as `pretrain_r2r_e2e.json'.

Fine-tuning for sequential action prediction

cd finetune_src
bash scripts/run_r2r.bash
bash scripts/run_r2r_back.bash
bash scripts/run_r2r_last.bash
bash scripts/run_r4r.bash
bash scripts/run_reverie.bash
bash scripts/run_cvdn.bash

Citation

If you find this work useful, please consider citing:

@InProceedings{chen2021hamt,
author       = {Chen, Shizhe and Guhur, Pierre-Louis and Schmid, Cordelia and Laptev, Ivan},
title        = {History Aware multimodal Transformer for Vision-and-Language Navigation},
booktitle    = {NeurIPS},
year         = {2021},
}

Acknowledgement

Some of the codes are built upon pytorch-image-models, UNITER and Recurrent-VLN-BERT. Thanks them for their great works!

Comments
  • Questions about ViT end2end training

    Questions about ViT end2end training

    Hi, When we only change the config to pretrain_r2r_e2e.json, and still run the main_r2r.py, it seems that the image files are not used. Besides, as described in the paper, ViT e2e training is done in the second stage. Thus I train a model checkpoint that finishes first stage pretrain, and I directly use pretrain_r2r_e2e.json (also modify the ckpt path) and main_r2r_image.py to run the ViT e2e training. However, there's some dim mismatch error occurs in image_pretrain.py line 80 (which points to vilmodel.py line 550). Could you help check if there's anything I need to change? Thanks so much!

    opened by jialuli-luka 4
  • provided features differs from extracted features

    provided features differs from extracted features

    Hi,

    Thanks for the great work!

    I noticed that if we use preprocess/precompute_img_features_vit.py to extract visual features, the outputted features are slightly different from features in provided HDF5 files. For pretrained ViT-B/16 from TIMM, there's about 0.01 average difference between outputted features and features in datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5. For e2e pretrained ViT-B/16 from datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt, there's about 0.007 average difference between outputted features and features in datasets/R2R/features/pth_vit_base_patch16_224_imagenet_r2r.e2e.ft.22k.hdf5.

    I also ensured this isn't caused by potentially different batchsize in forward step. Feature differences caused by different batchsize shouldn't be around 1e-6.

    Could you provide any suggestions on the reason causing these differences?

    opened by wz0919 2
  • Question about the version of CLIP used in RxR.

    Question about the version of CLIP used in RxR.

    Hi, thanks for your great work!

    I noted that you used CLIP to extract feature to train on RxR dataset. May I ask which version did you use to extract the features? ViT/B-16 or ViT/B-32? The version from openAI or huggingface??

    Thanks for your attention on this matter! Best regards,

    opened by MarSaKi 1
  • Could you sepecify the models that can reproduce the reported results?

    Could you sepecify the models that can reproduce the reported results?

    Hi Shizhe,

    Thank you for releasing the code! I am wondering if you can specify or release the models for us to reproduce the reported results. I would appreciate it greatly if you can provide additional instructions!

    opened by Xin-Ye-1 1
  • Questions about pre-trained model

    Questions about pre-trained model

    Hi, Thanks for the great work! I'm trying to directly load the pretrained model you provide "../datasets/R2R/trained_models/vitbase-6tasks-pretrain/model_step_130000.pt" and then fine-tune on r2r task. However, I encounter the problem:

     RuntimeError: Error(s) in loading state_dict for NavCMT:
    size mismatch for bert.hist_embeddings.position_embeddings.weight: copying a param with shape torch.Size([50, 768]) from checkpoint, the shape in current model is torch.Size([100, 768]).
    

    I also followed your provided scripts to pre-train the model (w/o e2e vit), and then fine-tune on r2r, which seems to work fine.
    Just hope to check whether the provided pre-trained model (w/o e2e vit) is the correct one, and if I need to modify any code/hyperparameter to load the pre-trained model?

    Thanks so much!

    opened by jialuli-luka 1
  • Suggest to loosen the dependency on networkx

    Suggest to loosen the dependency on networkx

    Dear developers,

    Your project VLN-HAMT requires "networkx==2.5.1" in its dependency. After analyzing the source code, we found that the following versions of networkx can also be suitable without affecting your project, i.e., networkx 2.5. Therefore, we suggest to loosen the dependency on networkx from "networkx==2.5.1" to "networkx>=2.5,<=2.5.1" to avoid any possible conflict for importing more packages or for downstream projects that may use ddos_script.

    May I pull a request to further loosen the dependency on networkx?

    By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?



    Details:

    Your project (commit id: 08918ddcea7b7822831a5b535038732f8dfeab23) directly uses 5 APIs from package networkx.

    networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path, networkx.classes.function.set_node_attributes, networkx.classes.graph.Graph.__init__, networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path_length, networkx.classes.graph.Graph.add_edge
    

    Beginning fromwhich, 1 functions are then indirectly called, including 0 networkx's internal APIs and 1 outsider APIs as follows:

    [/cshizhe/VLN-HAMT]
    +--networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path
    +--networkx.classes.function.set_node_attributes
    +--networkx.classes.graph.Graph.__init__
    |      +--networkx.convert.to_networkx_graph
    |      |      +--networkx.convert.from_dict_of_dicts
    |      |      +--networkx.convert.from_dict_of_lists
    |      |      +--warnings.warn
    |      |      +--networkx.convert.from_edgelist
    +--networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path_length
    +--networkx.classes.graph.Graph.add_edge
    

    Since all these functions have not been changed between any version for package "networkx" from [2.5] and 2.5.1. Therefore, we believe it is safe to loosen the corresponding dependency.

    opened by Agnes-U 0
  • Could you please share the running scripts for IL+RL training from scratch?

    Could you please share the running scripts for IL+RL training from scratch?

    Hi, Shizhe. Thanks very much for the great HAMT work! I was recently using your code trying to run the VLN experiments myself. I noticed that you provided the running scripts for pretraining and fine-tuning, but not for training the model on R2R from scratch. And I guess the model configuration for training-from-scratch should be different from finetuning-after-pretraining, e.g., the --fix_lang_embedding and --fix_hist_embedding should not be set as the two embeddings are totally random right? So I hope you could share the scripts for training HAMT from scratch.

    opened by Jackie-Chou 3
Owner
Shizhe Chen
Shizhe Chen
Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

YicongHong 34 Nov 15, 2022
GEP (GDB Enhanced Prompt) - a GDB plug-in for GDB command prompt with fzf history search, fish-like autosuggestions, auto-completion with floating window, partial string matching in history, and more!

GEP (GDB Enhanced Prompt) GEP (GDB Enhanced Prompt) is a GDB plug-in which make your GDB command prompt more convenient and flexibility. Why I need th

Alan Li 23 Dec 21, 2022
Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh

Arjun Majumdar 44 Dec 14, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 43 Nov 21, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

null 1 Dec 24, 2021
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision Download links and PyTorch implementation of "Towers of Ba

Blakey Wu 40 Dec 14, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Jan 4, 2023
This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

Vietnamese sign lagnuage recognition using MHI and CNN This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm

Phat Pham 3 Feb 24, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer This repo is the official implementation for TransBTS: Multimodal Brain Tumor Segmenta

Raymond 247 Dec 28, 2022
multimodal transformer

This repo holds the code to perform experiments with the multimodal autoregressive probabilistic model Transflower. Overview of the repo It is structu

Guillermo Valle 68 Dec 13, 2022
METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

Zi-Yi Dou 257 Jan 6, 2023
Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

ZJUNLP 68 Dec 28, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

null 87 Nov 29, 2022
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022