History Aware Multimodal Transformer for Vision-and-Language Navigation

Shizhe Chen

Last update: Nov 23, 2022

Related tags

Text Data & NLP VLN-HAMT

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

This repository is the official implementation of History Aware Multimodal Transformer for Vision-and-Language Navigation. Project webpage: https://cshizhe.github.io/projects/vln_hamt.html

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. In this work, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single-step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR) high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back).

Installation

Install Matterport3D simulators: follow instructions here. We use the latest version (all inputs and outputs are batched).

export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH

Install requirements:

conda create --name vlnhamt python=3.8.5
conda activate vlnhamt
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Download data from Dropbox, including processed annotations, features and pretrained models. Put the data in `datasets' directory.
(Optional) If you want to train HAMT end-to-end, you should download original Matterport3D data.

Extracting features (optional)

Scripts to extract visual features are in preprocess directory:

CUDA_VISIBLE_DEVICES=0 python preprocess/precompute_img_features_vit.py \
    --model_name vit_base_patch16_224 --out_image_logits \
    --connectivity_dir datasets/R2R/connectivity \
    --scan_dir datasets/Matterport3D/v1_unzip_scans \
    --num_workers 4 \
    --output_file datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5

Training with proxy tasks

Stage 1: Pretrain with fixed ViT features

NODE_RANK=0
NUM_GPUS=4
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch \
    --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \
    pretrain_src/main_r2r.py --world_size ${NUM_GPUS} \
    --model_config pretrain_src/config/r2r_model_config.json \
    --config pretrain_src/config/pretrain_r2r.json \
    --output_dir datasets/R2R/exprs/pretrain/cmt-vitbase-6tasks

Stage 2: Train ViT in an end-to-end manner

Change the config file as `pretrain_r2r_e2e.json'.

Fine-tuning for sequential action prediction

cd finetune_src
bash scripts/run_r2r.bash
bash scripts/run_r2r_back.bash
bash scripts/run_r2r_last.bash
bash scripts/run_r4r.bash
bash scripts/run_reverie.bash
bash scripts/run_cvdn.bash

Citation

If you find this work useful, please consider citing:

@InProceedings{chen2021hamt,
author       = {Chen, Shizhe and Guhur, Pierre-Louis and Schmid, Cordelia and Laptev, Ivan},
title        = {History Aware multimodal Transformer for Vision-and-Language Navigation},
booktitle    = {NeurIPS},
year         = {2021},
}

Acknowledgement

Some of the codes are built upon pytorch-image-models, UNITER and Recurrent-VLN-BERT. Thanks them for their great works!

Comments

Questions about ViT end2end training

Hi, When we only change the config to pretrain_r2r_e2e.json, and still run the main_r2r.py, it seems that the image files are not used. Besides, as described in the paper, ViT e2e training is done in the second stage. Thus I train a model checkpoint that finishes first stage pretrain, and I directly use pretrain_r2r_e2e.json (also modify the ckpt path) and main_r2r_image.py to run the ViT e2e training. However, there's some dim mismatch error occurs in image_pretrain.py line 80 (which points to vilmodel.py line 550). Could you help check if there's anything I need to change? Thanks so much!

opened by jialuli-luka 4
provided features differs from extracted features

Hi,

Thanks for the great work!

I noticed that if we use preprocess/precompute_img_features_vit.py to extract visual features, the outputted features are slightly different from features in provided HDF5 files. For pretrained ViT-B/16 from TIMM, there's about 0.01 average difference between outputted features and features in datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5. For e2e pretrained ViT-B/16 from datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt, there's about 0.007 average difference between outputted features and features in datasets/R2R/features/pth_vit_base_patch16_224_imagenet_r2r.e2e.ft.22k.hdf5.

I also ensured this isn't caused by potentially different batchsize in forward step. Feature differences caused by different batchsize shouldn't be around 1e-6.

Could you provide any suggestions on the reason causing these differences?

opened by wz0919 2
Question about the version of CLIP used in RxR.

Hi, thanks for your great work!

I noted that you used CLIP to extract feature to train on RxR dataset. May I ask which version did you use to extract the features? ViT/B-16 or ViT/B-32? The version from openAI or huggingface??

Thanks for your attention on this matter! Best regards,

opened by MarSaKi 1
Could you sepecify the models that can reproduce the reported results?

Hi Shizhe,

Thank you for releasing the code! I am wondering if you can specify or release the models for us to reproduce the reported results. I would appreciate it greatly if you can provide additional instructions!

opened by Xin-Ye-1 1
Questions about pre-trained model
Hi, Thanks for the great work! I'm trying to directly load the pretrained model you provide "../datasets/R2R/trained_models/vitbase-6tasks-pretrain/model_step_130000.pt" and then fine-tune on r2r task. However, I encounter the problem:

RuntimeError: Error(s) in loading state_dict for NavCMT: size mismatch for bert.hist_embeddings.position_embeddings.weight: copying a param with shape torch.Size([50, 768]) from checkpoint, the shape in current model is torch.Size([100, 768]).

I also followed your provided scripts to pre-train the model (w/o e2e vit), and then fine-tune on r2r, which seems to work fine.
Just hope to check whether the provided pre-trained model (w/o e2e vit) is the correct one, and if I need to modify any code/hyperparameter to load the pre-trained model?

Thanks so much!
opened by jialuli-luka 1
Suggest to loosen the dependency on networkx
Dear developers,

Your project VLN-HAMT requires "networkx==2.5.1" in its dependency. After analyzing the source code, we found that the following versions of networkx can also be suitable without affecting your project, i.e., networkx 2.5. Therefore, we suggest to loosen the dependency on networkx from "networkx==2.5.1" to "networkx>=2.5,<=2.5.1" to avoid any possible conflict for importing more packages or for downstream projects that may use ddos_script.

May I pull a request to further loosen the dependency on networkx?

By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

Details:

Your project (commit id: 08918ddcea7b7822831a5b535038732f8dfeab23) directly uses 5 APIs from package networkx.

networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path, networkx.classes.function.set_node_attributes, networkx.classes.graph.Graph.__init__, networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path_length, networkx.classes.graph.Graph.add_edge

Beginning fromwhich, 1 functions are then indirectly called, including 0 networkx's internal APIs and 1 outsider APIs as follows:

[/cshizhe/VLN-HAMT] +--networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path +--networkx.classes.function.set_node_attributes +--networkx.classes.graph.Graph.__init__ | +--networkx.convert.to_networkx_graph | | +--networkx.convert.from_dict_of_dicts | | +--networkx.convert.from_dict_of_lists | | +--warnings.warn | | +--networkx.convert.from_edgelist +--networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path_length +--networkx.classes.graph.Graph.add_edge

Since all these functions have not been changed between any version for package "networkx" from [2.5] and 2.5.1. Therefore, we believe it is safe to loosen the corresponding dependency.
opened by Agnes-U 0
Could you please share the running scripts for IL+RL training from scratch?

Hi, Shizhe. Thanks very much for the great HAMT work! I was recently using your code trying to run the VLN experiments myself. I noticed that you provided the running scripts for pretraining and fine-tuning, but not for training the model on R2R from scratch. And I guess the model configuration for training-from-scratch should be different from finetuning-after-pretraining, e.g., the --fix_lang_embedding and --fix_hist_embedding should not be set as the two embeddings are totally random right? So I hope you could share the scripts for training HAMT from scratch.

opened by Jackie-Chou 3

History Aware Multimodal Transformer for Vision-and-Language Navigation

Related tags

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

Installation

Extracting features (optional)

Training with proxy tasks

Fine-tuning for sequential action prediction

Citation

Acknowledgement

Comments

Questions about ViT end2end training

provided features differs from extracted features

Question about the version of CLIP used in RxR.

Could you sepecify the models that can reproduce the reported results?

Questions about pre-trained model

Suggest to loosen the dependency on networkx

Could you please share the running scripts for IL+RL training from scratch?

Owner

Shizhe Chen

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Fine-tune GPT-3 with a Google Chat conversation history

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

VD-BERT: A Unified Vision and Dialog Transformer with BERT

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.