History Aware Multimodal Transformer for Vision-and-Language Navigation

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

This repository is the official implementation of History Aware Multimodal Transformer for Vision-and-Language Navigation. Project webpage: https://cshizhe.github.io/projects/vln_hamt.html

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. In this work, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single-step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR) high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back).

framework

Installation

  1. Install Matterport3D simulators: follow instructions here. We use the latest version (all inputs and outputs are batched).
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
  1. Install requirements:
conda create --name vlnhamt python=3.8.5
conda activate vlnhamt
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
  1. Download data from Dropbox, including processed annotations, features and pretrained models. Put the data in `datasets' directory.

  2. (Optional) If you want to train HAMT end-to-end, you should download original Matterport3D data.

Extracting features (optional)

Scripts to extract visual features are in preprocess directory:

CUDA_VISIBLE_DEVICES=0 python preprocess/precompute_img_features_vit.py \
    --model_name vit_base_patch16_224 --out_image_logits \
    --connectivity_dir datasets/R2R/connectivity \
    --scan_dir datasets/Matterport3D/v1_unzip_scans \
    --num_workers 4 \
    --output_file datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5

Training with proxy tasks

Stage 1: Pretrain with fixed ViT features

NODE_RANK=0
NUM_GPUS=4
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch \
    --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \
    pretrain_src/main_r2r.py --world_size ${NUM_GPUS} \
    --model_config pretrain_src/config/r2r_model_config.json \
    --config pretrain_src/config/pretrain_r2r.json \
    --output_dir datasets/R2R/exprs/pretrain/cmt-vitbase-6tasks

Stage 2: Train ViT in an end-to-end manner

Change the config file as `pretrain_r2r_e2e.json'.

Fine-tuning for sequential action prediction

cd finetune_src
bash scripts/run_r2r.bash
bash scripts/run_r2r_back.bash
bash scripts/run_r2r_last.bash
bash scripts/run_r4r.bash
bash scripts/run_reverie.bash
bash scripts/run_cvdn.bash

Citation

If you find this work useful, please consider citing:

@InProceedings{chen2021hamt,
author       = {Chen, Shizhe and Guhur, Pierre-Louis and Schmid, Cordelia and Laptev, Ivan},
title        = {History Aware multimodal Transformer for Vision-and-Language Navigation},
booktitle    = {NeurIPS},
year         = {2021},
}

Acknowledgement

Some of the codes are built upon pytorch-image-models, UNITER and Recurrent-VLN-BERT. Thanks them for their great works!

Comments
  • Questions about ViT end2end training

    Questions about ViT end2end training

    Hi, When we only change the config to pretrain_r2r_e2e.json, and still run the main_r2r.py, it seems that the image files are not used. Besides, as described in the paper, ViT e2e training is done in the second stage. Thus I train a model checkpoint that finishes first stage pretrain, and I directly use pretrain_r2r_e2e.json (also modify the ckpt path) and main_r2r_image.py to run the ViT e2e training. However, there's some dim mismatch error occurs in image_pretrain.py line 80 (which points to vilmodel.py line 550). Could you help check if there's anything I need to change? Thanks so much!

    opened by jialuli-luka 4
  • provided features differs from extracted features

    provided features differs from extracted features

    Hi,

    Thanks for the great work!

    I noticed that if we use preprocess/precompute_img_features_vit.py to extract visual features, the outputted features are slightly different from features in provided HDF5 files. For pretrained ViT-B/16 from TIMM, there's about 0.01 average difference between outputted features and features in datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5. For e2e pretrained ViT-B/16 from datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt, there's about 0.007 average difference between outputted features and features in datasets/R2R/features/pth_vit_base_patch16_224_imagenet_r2r.e2e.ft.22k.hdf5.

    I also ensured this isn't caused by potentially different batchsize in forward step. Feature differences caused by different batchsize shouldn't be around 1e-6.

    Could you provide any suggestions on the reason causing these differences?

    opened by wz0919 2
  • Question about the version of CLIP used in RxR.

    Question about the version of CLIP used in RxR.

    Hi, thanks for your great work!

    I noted that you used CLIP to extract feature to train on RxR dataset. May I ask which version did you use to extract the features? ViT/B-16 or ViT/B-32? The version from openAI or huggingface??

    Thanks for your attention on this matter! Best regards,

    opened by MarSaKi 1
  • Could you sepecify the models that can reproduce the reported results?

    Could you sepecify the models that can reproduce the reported results?

    Hi Shizhe,

    Thank you for releasing the code! I am wondering if you can specify or release the models for us to reproduce the reported results. I would appreciate it greatly if you can provide additional instructions!

    opened by Xin-Ye-1 1
  • Questions about pre-trained model

    Questions about pre-trained model

    Hi, Thanks for the great work! I'm trying to directly load the pretrained model you provide "../datasets/R2R/trained_models/vitbase-6tasks-pretrain/model_step_130000.pt" and then fine-tune on r2r task. However, I encounter the problem:

     RuntimeError: Error(s) in loading state_dict for NavCMT:
    size mismatch for bert.hist_embeddings.position_embeddings.weight: copying a param with shape torch.Size([50, 768]) from checkpoint, the shape in current model is torch.Size([100, 768]).
    

    I also followed your provided scripts to pre-train the model (w/o e2e vit), and then fine-tune on r2r, which seems to work fine.
    Just hope to check whether the provided pre-trained model (w/o e2e vit) is the correct one, and if I need to modify any code/hyperparameter to load the pre-trained model?

    Thanks so much!

    opened by jialuli-luka 1
  • Suggest to loosen the dependency on networkx

    Suggest to loosen the dependency on networkx

    Dear developers,

    Your project VLN-HAMT requires "networkx==2.5.1" in its dependency. After analyzing the source code, we found that the following versions of networkx can also be suitable without affecting your project, i.e., networkx 2.5. Therefore, we suggest to loosen the dependency on networkx from "networkx==2.5.1" to "networkx>=2.5,<=2.5.1" to avoid any possible conflict for importing more packages or for downstream projects that may use ddos_script.

    May I pull a request to further loosen the dependency on networkx?

    By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?



    Details:

    Your project (commit id: 08918ddcea7b7822831a5b535038732f8dfeab23) directly uses 5 APIs from package networkx.

    networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path, networkx.classes.function.set_node_attributes, networkx.classes.graph.Graph.__init__, networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path_length, networkx.classes.graph.Graph.add_edge
    

    Beginning fromwhich, 1 functions are then indirectly called, including 0 networkx's internal APIs and 1 outsider APIs as follows:

    [/cshizhe/VLN-HAMT]
    +--networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path
    +--networkx.classes.function.set_node_attributes
    +--networkx.classes.graph.Graph.__init__
    |      +--networkx.convert.to_networkx_graph
    |      |      +--networkx.convert.from_dict_of_dicts
    |      |      +--networkx.convert.from_dict_of_lists
    |      |      +--warnings.warn
    |      |      +--networkx.convert.from_edgelist
    +--networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path_length
    +--networkx.classes.graph.Graph.add_edge
    

    Since all these functions have not been changed between any version for package "networkx" from [2.5] and 2.5.1. Therefore, we believe it is safe to loosen the corresponding dependency.

    opened by Agnes-U 0
  • Could you please share the running scripts for IL+RL training from scratch?

    Could you please share the running scripts for IL+RL training from scratch?

    Hi, Shizhe. Thanks very much for the great HAMT work! I was recently using your code trying to run the VLN experiments myself. I noticed that you provided the running scripts for pretraining and fine-tuning, but not for training the model on R2R from scratch. And I guess the model configuration for training-from-scratch should be different from finetuning-after-pretraining, e.g., the --fix_lang_embedding and --fix_hist_embedding should not be set as the two embeddings are totally random right? So I hope you could share the scripts for training HAMT from scratch.

    opened by Jackie-Chou 3
Owner
Shizhe Chen
Shizhe Chen
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 44 Jan 6, 2023
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
Fine-tune GPT-3 with a Google Chat conversation history

Google Chat GPT-3 This repo will help you fine-tune GPT-3 with a Google Chat conversation history. The trained model will be able to converse as one o

Nate Baer 7 Dec 10, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

Aryan Shekarlaban 105 Jan 4, 2023
Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Zhenhailong Wang 2 Jul 15, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 7, 2022
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

Zhilin Yang 3.3k Dec 28, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 3, 2023
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

null 41 Jan 3, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
C.J. Hutto 3.8k Dec 30, 2022