History Aware Multimodal Transformer for Vision-and-Language Navigation

Shizhe Chen

Last update: Nov 23, 2022

Related tags

Deep Learning VLN-HAMT

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

This repository is the official implementation of History Aware Multimodal Transformer for Vision-and-Language Navigation. Project webpage: https://cshizhe.github.io/projects/vln_hamt.html

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. In this work, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single-step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR) high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back).

Installation

Install Matterport3D simulators: follow instructions here. We use the latest version (all inputs and outputs are batched).

export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH

Install requirements:

conda create --name vlnhamt python=3.8.5
conda activate vlnhamt
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Download data from Dropbox, including processed annotations, features and pretrained models. Put the data in `datasets' directory.
(Optional) If you want to train HAMT end-to-end, you should download original Matterport3D data.

Extracting features (optional)

Scripts to extract visual features are in preprocess directory:

CUDA_VISIBLE_DEVICES=0 python preprocess/precompute_img_features_vit.py \
    --model_name vit_base_patch16_224 --out_image_logits \
    --connectivity_dir datasets/R2R/connectivity \
    --scan_dir datasets/Matterport3D/v1_unzip_scans \
    --num_workers 4 \
    --output_file datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5

Training with proxy tasks

Stage 1: Pretrain with fixed ViT features

NODE_RANK=0
NUM_GPUS=4
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch \
    --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \
    pretrain_src/main_r2r.py --world_size ${NUM_GPUS} \
    --model_config pretrain_src/config/r2r_model_config.json \
    --config pretrain_src/config/pretrain_r2r.json \
    --output_dir datasets/R2R/exprs/pretrain/cmt-vitbase-6tasks

Stage 2: Train ViT in an end-to-end manner

Change the config file as `pretrain_r2r_e2e.json'.

Fine-tuning for sequential action prediction

cd finetune_src
bash scripts/run_r2r.bash
bash scripts/run_r2r_back.bash
bash scripts/run_r2r_last.bash
bash scripts/run_r4r.bash
bash scripts/run_reverie.bash
bash scripts/run_cvdn.bash

Citation

If you find this work useful, please consider citing:

@InProceedings{chen2021hamt,
author       = {Chen, Shizhe and Guhur, Pierre-Louis and Schmid, Cordelia and Laptev, Ivan},
title        = {History Aware multimodal Transformer for Vision-and-Language Navigation},
booktitle    = {NeurIPS},
year         = {2021},
}

Acknowledgement

Some of the codes are built upon pytorch-image-models, UNITER and Recurrent-VLN-BERT. Thanks them for their great works!

Comments

Questions about ViT end2end training

Hi, When we only change the config to pretrain_r2r_e2e.json, and still run the main_r2r.py, it seems that the image files are not used. Besides, as described in the paper, ViT e2e training is done in the second stage. Thus I train a model checkpoint that finishes first stage pretrain, and I directly use pretrain_r2r_e2e.json (also modify the ckpt path) and main_r2r_image.py to run the ViT e2e training. However, there's some dim mismatch error occurs in image_pretrain.py line 80 (which points to vilmodel.py line 550). Could you help check if there's anything I need to change? Thanks so much!

opened by jialuli-luka 4
provided features differs from extracted features

Hi,

Thanks for the great work!

I noticed that if we use preprocess/precompute_img_features_vit.py to extract visual features, the outputted features are slightly different from features in provided HDF5 files. For pretrained ViT-B/16 from TIMM, there's about 0.01 average difference between outputted features and features in datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5. For e2e pretrained ViT-B/16 from datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt, there's about 0.007 average difference between outputted features and features in datasets/R2R/features/pth_vit_base_patch16_224_imagenet_r2r.e2e.ft.22k.hdf5.

I also ensured this isn't caused by potentially different batchsize in forward step. Feature differences caused by different batchsize shouldn't be around 1e-6.

Could you provide any suggestions on the reason causing these differences?

opened by wz0919 2
Question about the version of CLIP used in RxR.

Hi, thanks for your great work!

I noted that you used CLIP to extract feature to train on RxR dataset. May I ask which version did you use to extract the features? ViT/B-16 or ViT/B-32? The version from openAI or huggingface??

Thanks for your attention on this matter! Best regards,

opened by MarSaKi 1
Could you sepecify the models that can reproduce the reported results?

Hi Shizhe,

Thank you for releasing the code! I am wondering if you can specify or release the models for us to reproduce the reported results. I would appreciate it greatly if you can provide additional instructions!

opened by Xin-Ye-1 1
Questions about pre-trained model
Hi, Thanks for the great work! I'm trying to directly load the pretrained model you provide "../datasets/R2R/trained_models/vitbase-6tasks-pretrain/model_step_130000.pt" and then fine-tune on r2r task. However, I encounter the problem:

RuntimeError: Error(s) in loading state_dict for NavCMT: size mismatch for bert.hist_embeddings.position_embeddings.weight: copying a param with shape torch.Size([50, 768]) from checkpoint, the shape in current model is torch.Size([100, 768]).

I also followed your provided scripts to pre-train the model (w/o e2e vit), and then fine-tune on r2r, which seems to work fine.
Just hope to check whether the provided pre-trained model (w/o e2e vit) is the correct one, and if I need to modify any code/hyperparameter to load the pre-trained model?

Thanks so much!
opened by jialuli-luka 1
Suggest to loosen the dependency on networkx
Dear developers,

Your project VLN-HAMT requires "networkx==2.5.1" in its dependency. After analyzing the source code, we found that the following versions of networkx can also be suitable without affecting your project, i.e., networkx 2.5. Therefore, we suggest to loosen the dependency on networkx from "networkx==2.5.1" to "networkx>=2.5,<=2.5.1" to avoid any possible conflict for importing more packages or for downstream projects that may use ddos_script.

May I pull a request to further loosen the dependency on networkx?

By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

Details:

Your project (commit id: 08918ddcea7b7822831a5b535038732f8dfeab23) directly uses 5 APIs from package networkx.

networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path, networkx.classes.function.set_node_attributes, networkx.classes.graph.Graph.__init__, networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path_length, networkx.classes.graph.Graph.add_edge

Beginning fromwhich, 1 functions are then indirectly called, including 0 networkx's internal APIs and 1 outsider APIs as follows:

[/cshizhe/VLN-HAMT] +--networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path +--networkx.classes.function.set_node_attributes +--networkx.classes.graph.Graph.__init__ | +--networkx.convert.to_networkx_graph | | +--networkx.convert.from_dict_of_dicts | | +--networkx.convert.from_dict_of_lists | | +--warnings.warn | | +--networkx.convert.from_edgelist +--networkx.algorithms.shortest_paths.weighted.all_pairs_dijkstra_path_length +--networkx.classes.graph.Graph.add_edge

Since all these functions have not been changed between any version for package "networkx" from [2.5] and 2.5.1. Therefore, we believe it is safe to loosen the corresponding dependency.
opened by Agnes-U 0
Could you please share the running scripts for IL+RL training from scratch?

Hi, Shizhe. Thanks very much for the great HAMT work! I was recently using your code trying to run the VLN experiments myself. I noticed that you provided the running scripts for pretraining and fine-tuning, but not for training the model on R2R from scratch. And I guess the model configuration for training-from-scratch should be different from finetuning-after-pretraining, e.g., the --fix_lang_embedding and --fix_hist_embedding should not be set as the two embeddings are totally random right? So I hope you could share the scripts for training HAMT from scratch.

opened by Jackie-Chou 3

History Aware Multimodal Transformer for Vision-and-Language Navigation

Related tags

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

Installation

Extracting features (optional)

Training with proxy tasks

Fine-tuning for sequential action prediction

Citation

Acknowledgement

Comments

Questions about ViT end2end training

provided features differs from extracted features

Question about the version of CLIP used in RxR.

Could you sepecify the models that can reproduce the reported results?

Questions about pre-trained model

Suggest to loosen the dependency on networkx

Could you please share the running scripts for IL+RL training from scratch?

Owner

Shizhe Chen

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

GEP (GDB Enhanced Prompt) - a GDB plug-in for GDB command prompt with fzf history search, fish-like autosuggestions, auto-completion with floating window, partial string matching in history, and more!

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

multimodal transformer

METER: Multimodal End-to-end TransformER

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)