Code of paper: A Recurrent Vision-and-Language BERT for Navigation

YicongHong

Last update: Dec 21, 2022

Related tags

Overview

Recurrent VLN-BERT

Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation
Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould

[Paper & Appendices | GitHub]

Prerequisites

Installation

Install the Matterport3D Simulator. Please find the versions of packages in our environment here.

Install the Pytorch-Transformers. In particular, we use this version (same as OSCAR) in our experiments.

Data Preparation

Please follow the instructions below to prepare the data in directories:

MP3D navigability graphs: connectivity
- Download the connectivity maps [23.8MB].
R2R data: data
- Download the R2R data [5.8MB].
Augmented data: data/prevalent
- Download the collected triplets in PREVALENT [1.5GB] (pre-processed for easy use).
MP3D image features: img_features
- Download the Scene features [4.2GB] (ResNet-152-Places365).

Initial OSCAR and PREVALENT weights

Please refer to vlnbert_init.py to set up the directories.

Pre-trained OSCAR weights
- Download the base-no-labels following this guide.
Pre-trained PREVALENT weights
- Download the pytorch_model.bin from here.

Trained Network Weights

Recurrent-VLN-BERT: snap
- Download the trained network weights [2.5GB] for our OSCAR-based and PREVALENT-based models.

R2R Navigation

Please read Peter Anderson's VLN paper for the R2R Navigation task.

Reproduce Testing Results

To replicate the performance reported in our paper, load the trained network weights and run validation:

bash run/test_agent.bash

You can simply switch between the OSCAR-based and the PREVALENT-based VLN models by changing the arguments vlnbert (oscar or prevalent) and load (trained model paths).

Training

Navigator

To train the network from scratch, simply run:

bash run/train_agent.bash

The trained Navigator will be saved under snap/.

Citation

If you use or discuss our Recurrent VLN-BERT, please cite our paper:

@article{hong2020recurrent,
  title={A Recurrent Vision-and-Language BERT for Navigation},
  author={Hong, Yicong and Wu, Qi and Qi, Yuankai and Rodriguez-Opazo, Cristian and Gould, Stephen},
  journal={arXiv preprint arXiv:2011.13922},
  year={2020}
}

Comments

Unable to test code

Hello Yicong,

Can you please add a section in the README about using Matterport3DSimulator docker image with your code? The documentation is missing details on where to put the ResNet zip, prevalent JSON, and the PyTorch model. It is unclear how MatterPort3DSimulator works with your code.

Thanks

opened by gmuraleekrishna 7
the data file R2R_test.json wasn't used when testing?

Hi,yicong! I have reproduced this codebase. While I tried run/test_agent.bash, I notice the data file R2R_test.json wasn't used by the test. So I set the key parameter 'submit' as 1 and rewirte the file 'id_paths.json' to test without any other change. And I get the following results.

`Optimizer: Using AdamW Namespace(IMAGENET_FEATURES='img_features/ResNet-152-imagenet.tsv', angle_feat_size=128, aug=None, batchSize=16, description='VLNBERT-test-Prevalent', dropout=0.5, epsilon=0.1, featdropout=0.4, feature_size=2048, features='places365', feedback='sample', gamma=0.9, ignoreid=-100, iters=300000, load='snap/VLNBERT-PREVALENT-final/state_dict/best_val_unseen', loadOptim=False, log_dir='snap/VLNBERT-test-Prevalent', lr=1e-05, maxAction=15, maxInput=80, ml_weight=0.2, name='VLNBERT-test-Prevalent', normalize_loss='total', optim='adamW', optimizer=<class 'torch.optim.adamw.AdamW'>, submit=1, teacher='final', teacher_weight=1.0, test_only=0, train='validlistener', vlnbert='prevalent', weight_decay=0.0, zero_init=False)

Start loading the image feature ... (~50 seconds) Finish Loading the image feature from img_features/ResNet-152-places365.tsv in 54.7334 seconds The feature size is 2048 Loading navigation graphs for 61 scans R2RBatch loaded with 14039 instructions, using splits: train The feature size is 2048 Loading navigation graphs for 59 scans R2RBatch loaded with 1501 instructions, using splits: val_train_seen The feature size is 2048 Loading navigation graphs for 56 scans R2RBatch loaded with 1021 instructions, using splits: val_seen The feature size is 2048 Loading navigation graphs for 11 scans R2RBatch loaded with 2349 instructions, using splits: val_unseen The feature size is 2048 Loading navigation graphs for 18 scans R2RBatch loaded with 4173 instructions, using splits: test

Initalizing the VLN-BERT model ... Loaded the listener model at iter 114000 from snap/VLNBERT-PREVALENT-final/state_dict/best_val_unseen result length 1501 Env name: val_train_seen, nav_error: 0.8354, oracle_error: 0.6634, steps: 5.1845, lengths: 10.0276, success_rate: 0.9394, oracle_rate: 0.9520, spl: 0.9124 result length 1021 Env name: val_seen, nav_error: 2.8968, oracle_error: 1.9405, steps: 5.5436, lengths: 11.1379, success_rate: 0.7228, oracle_rate: 0.7826, spl: 0.6775 result length 2349 Env name: val_unseen, nav_error: 3.9255, oracle_error: 2.5431, steps: 6.1243, lengths: 12.0028, success_rate: 0.6279, oracle_rate: 0.7024, spl: 0.5688 result length 4173 Env name: test, nav_error: 9.0420, oracle_error: 0.0000, steps: 6.1107, lengths: 12.3490, success_rate: 0.0357, oracle_rate: 1.0000, spl: 0.0000`

I am really shocked by the results on the test data. Do I make some mistakes? Where is it?

opened by LiHui1116 6
ModuleNotFoundError: No module named 'transformers.pytorch_transformers'

Hello, I was trying to run the model with bash run/test_agent.bash as instructed in your readme but i get the error: Optimizer: Using AdamW To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html Traceback (most recent call last): File "r2r_src/train.py", line 13, in <module> from agent import Seq2SeqAgent File "/Recurrent-VLN-BERT/r2r_src/agent.py", line 21, in <module> import model_OSCAR, model_PREVALENT File "/Recurrent-VLN-BERT/r2r_src/model_OSCAR.py", line 7, in <module> from vlnbert.vlnbert_init import get_vlnbert_models File "/Recurrent-VLN-BERT/r2r_src/vlnbert/vlnbert_init.py", line 3, in <module> from transformers.pytorch_transformers import (BertConfig, BertTokenizer) ModuleNotFoundError: No module named 'transformers.pytorch_transformers' I have transformers and pytorch transformers installed also the old version of pytorch-pretrained-bert and am unsure of what is causing this, any help? thanks in advance

opened by Jaluco 3
Mismatch between weights?
Hi there,

Congratulations for your CVPR paper and for releasing your code. I was wondering whether you could clarify the structure of the checkpoints you released. I'm interested in the OSCAR version of your model and I tried to load it. However, it looks like the following parameters cannot be found:

'img_projection.weight', 'img_projection.bias'

I tried to inspect the VLNBert class in the file vlnbert_OSCAR.py and it looks like there is not a module called img_projection. Instead, seems there is one in the vlnbert_PREVALENT.py file. In addition, even in the original OSCAR codebase I cannot find a mention to the img_projection layer (https://github.com/microsoft/Oscar/blob/master/oscar/modeling/modeling_bert.py). Could you please verify that the released model checkpoints are correct and referring to the correct models?

Thanks, Alessandro
opened by aleSuglia 3
is it possible to have a branch for REVERIE

Hello,

Thanks a lot for maintaining your open-source code!

As mentioned in #9, is it possible to have models and code available for REVERIE? I would like to have a fair comparison of your approach.

opened by volkancirik 2
Details about the no init. OSCAR model

Hi Yicong, I wonder how do you initialize the no init. OSCAR model to get the results reported in the paper. Did you initialize all the parameters randomly or use some pretrained weights, e.g., initialize the language part with Bert pretrained weights?

opened by Jackie-Chou 2
Why don't you use ‘speaker’ during training?

Hi! I don't see any codes about 'speaker', a useful way to make data augmentation for R2R. I am wondering why you delete the speaker part in your codes? Or have you done the experiments to show that using speaker doesn't work well in your method? Thanks a lot!

opened by CrystalSixone 2
The vocab size

Hi, yicong,

Thanks for your great work! I found the vocab size of R2R is 991，but the vocab size of Prevalent aug data is 1101. Additionaly, the Prevalent instructions is generated based on a speaker model trained on R2R dataset. Do you have any idea about this?

Thanks,

opened by MarSaKi 2
Failed to build Matterport3D Simulator

Hi Yicong,

This is not directly related to your code, but I've spent hours trying to follow the Matterport3DSimulator repo to build it, I encountered issues either building with or without Docker.

With docker, MatterSim can be built, but it is only available for system python, since I used anaconda on the lab server, importing matterSim will fail in my anaconda environment.

Without docker, the build failed. It's some line in the code has an error. line 59 of src/lib/NavGraph.cpp. CV_LOAD_IMAGE_ANYDEPTH is not defined in the scope. I only downloaded matterport_skybox_images, and this might be the problem (however, the readme.md in matterport3dsimulator says matterport_skybox_images is what you need to get the simulator to build and work) I wonder what data did you download from matterport 3D dataset?

Best, Jason

opened by jasonppy 2

Why split instructions?

Hi Yicong,

Thanks for open source your code!

I wonder why do you split instructions in /r2r_src/env.py, line 129 to 142

# Split multiple instructions into separate entries
for j, instr in enumerate(item['instructions']):
    try:
        new_item = dict(item)
        new_item['instr_id'] = '%s_%d' % (item['path_id'], j)
        new_item['instructions'] = instr

        ''' BERT tokenizer '''
        instr_tokens = tokenizer.tokenize(instr)
        padded_instr_tokens, num_words = pad_instr_tokens(instr_tokens, args.maxInput)
        new_item['instr_encoding'] = tokenizer.convert_tokens_to_ids(padded_instr_tokens)

        if new_item['instr_encoding'] is not None:  # Filter the wrong data
            self.data.append(new_item)
            scans.append(item['scan'])
    except:
        continue

This is done for original path-instruction but not for prevalent_aug.json. I wonder why do you do this. I understand that instructions in the original data is a bit long, but if you split then in to separate VLN jobs, while the desired path is always the complete path, how can an agent (or human) possibly do that?

Best, Jason

opened by jasonppy 2

Specify license for the code

Hello,

Thanks again for your codebase. It was very useful indeed and congratulations for your accepted paper. I was wondering whether you could please add a license to your codebase so that it's very clear how this code can be used by third parties.

Thanks, Alessandro

opened by aleSuglia 2

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Related tags

Overview

Recurrent VLN-BERT

Prerequisites

Installation

Data Preparation

Initial OSCAR and PREVALENT weights

Trained Network Weights

R2R Navigation

Reproduce Testing Results

Training

Navigator

Citation

Comments

Owner

YicongHong

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

History Aware Multimodal Transformer for Vision-and-Language Navigation

VD-BERT: A Unified Vision and Dialog Transformer with BERT

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Pre-training BERT masked language models with custom vocabulary

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

End-2-end speech synthesis with recurrent neural networks

A Japanese tokenizer based on recurrent neural networks

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI