Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Last update: Jan 3, 2023

Related tags

Text Data & NLP embert

Overview

EmBERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for language-conditioned task completion. Additionally, we bridge the gap between successful object-centric navigation models used for non-interactive agents and the language-guided visual task completion benchmark, ALFRED, by introducing object navigation targets for EmBERT training. We achieve competitive performance on the ALFRED benchmark, and EmBERT marks the first transformer-based model to successfully handle the long-horizon, dense, multi-modal histories of ALFRED, and the first ALFRED model to utilize object-centric navigation targets.

In this repository, we provide the entire codebase which is used for training and evaluating EmBERT performance on the ALFRED dataset. It's mostly based on AllenNLP and PyTorch-Lightning therefore it's inherently easily to extend.

Setup

We used Anaconda for our experiments. Please create an anaconda environment and then install the project dependencies with the following command:

pip install -r requirements.txt

As next step, we will download the ALFRED data using the script scripts/download_alfred_data.sh as follows:

sh scripts/donwload_alfred_data.sh json_feat

Before doing so, make sure that you have installed p7zip because is used to extract the trajectory files.

MaskRCNN fine-tuning

We provide the code to fine-tune a MaskRCNN model on the ALFRED dataset. To create the vision dataset, use the script scripts/generate_vision_dataset.sh. This will create the dataset splits required by the training process. After this, it's possible to run the model fine-tuning using:

PYTHONPATH=. python vision/finetune.py --batch_size 8 --gradient_clip_val 5 --lr 3e-4 --gpus 1 --accumulate_grad_batches 2 --num_workers 4 --save_dir storage/models/vision/maskrcnn_bs_16_lr_3e-4_epochs_46_7k_batches --max_epochs 46 --limit_train_batches 7000

We provide this code for reference however in our experiments we used the MaskRCNN model from MOCA which applies more sophisticated data augmentation techniques to improve performance on the ALFRED dataset.

ALFRED Visual Features extraction

MaskRCNN

The visual feature extraction script is responsible for generating the MaskRCNN features as well as orientation information for every bounding box. For the MaskrCNN model, we use the pretrained model from MOCA. You can download it from their GitHub page. First, we create the directory structure and then download the model weights:

mkdir -p storage/models/vision/moca_maskrcnn;
wget https://alfred-colorswap.s3.us-east-2.amazonaws.com/weight_maskrcnn.pt -O storage/models/vision/moca_maskrcnn/weight_maskrcnn.pt;

We extract visual features for training trajectories using the following command:

sh scripts/generate_moca_maskrcnn.sh

You can refer to the actual extraction script scripts/generate_maskrcnn_horizon0.py for additional parameters. We executed this command on an p3.2xlarge instance with NVIDIA V100. This command will populate the directory storage/data/alfred/json_feat_2.1.0/ with the visual features for each trajectory step. In particular, the parameter --features_folder will specify the subdirectory (for each trajectory) that will contain the compressed NumPy files constituting the features. Each NumPy file has the following structure:

dict(
    box_features=np.array,
    roi_angles=np.array,
    boxes=np.array,
    masks=np.array,
    class_probs=np.array,
    class_labels=np.array,
    num_objects=int,
    pano_id=int
)

Data-augmentation procedure

In our paper, we describe a procedure to augment the ALFREd trajectories with object and corresponding receptacle information. In particular, we reply the trajectories and we make sure to track object and its receptacle during a subgoal. The data augmentation script will create a new trajectory file called ref_traj_data.json that mimics the same data structure of the original ALFRED dataset but adds to it a few fields for each action.

To start generating the refined data, use the following script:

PYTHONPATH=. python scripts/generate_landmarks.py

EmBERT Training

Vocabulary creation

We use AllenNLP for training our models. Before starting the training we will generate the vocabulary for the model using the following command:

allennlp build-vocab training_configs/embert/embert_oscar.jsonnet storage/models/embert/vocab.tar.gz --include-package grolp

Training

First, we need to download the OSCAR checkpoint before starting the training process. We used a version of OSCAR which doesn't use object labels which can be freely downloaded following the instruction on GitHub. Make sure to download this file in the folder storage/models/pretrained using the following commands:

mkdir -p storage/models/pretrained/;
wget https://biglmdiag.blob.core.windows.net/oscar/pretrained_models/base-no-labels.zip -O storage/models/pretrained/oscar.zip;
unzip storage/models/pretrained/oscar.zip -d storage/models/pretrained/;
mv storage/models/pretrained/base-no-labels/ep_67_588997/pytorch_model.bin storage/models/pretrained/oscar-base-no-labels.bin;
rm storage/models/pretrained/oscar.zip;

A new model can be trained using the following command:

allennlp train training_configs/embert/embert_widest.jsonnet -s storage/models/alfred/embert --include-package grolp

When training for the first time, make sure to add to the previous command the following parameters: --preprocess --num_workers 4. This will make sure that the dataset is preprocessed and cached in order to speedup training. We run training using AWS EC2 instances p3.8xlarge with 16 workers on a single GPU per configuration.

The configuration file training_configs/embert/embert_widest.jsonnet contains all the parameters that you might be interested in if you want to change the way the model works or any reference to the actual features files. If you're interested in how to change the model itself, please refer to the model definition. The parameters in the constructor of the class will reflect the ones reported in the configuration file. In general, this project has been developed by using AllenNLP has a reference framework. We refer the reader to the official AllenNLP documentation for more details about how to structure a project.

EmBERT evaluation

We modified the original ALFRED evaluation script to make sure that the results are completely reproducible. Refer to the original repository for more information.

To run the evaluation on the valid_seen and valid_unseen you can use the provided script scripts/run_eval.sh in order to evaluate your model. The EmBERT trainer has different ways of saving checkpoints. At the end of the training, it will automatically save the best model in an archive named model.tar.gz in the destination folder (the one specified with -s). To evaluate it run the following command:

sh scripts/run_eval.sh <your_model_path>/model.tar.gz

It's also possible to run the evaluation of a specific checkpoint. This can be done by running the previous command as follows:

sh scripts/run_eval.sh <your_model_path>/model-epoch=6.ckpt

In this way the evaluation script will load the checkpoint at epoch 6 in the path . When specifying a checkpoint directly, make sure that the folder contains both config.json file and vocabulary directory because they are required by the script to load all the correct model parameters.

Citation

If you're using this codebase please cite our work:

@article{suglia:embert,
  title={Embodied {BERT}: A Transformer Model for Embodied, Language-guided Visual Task Completion},
  author={Alessandro Suglia and Qiaozi Gao and Jesse Thomason and Govind Thattai and Gaurav Sukhatme},
  journal={arXiv},
  year={2021},
  url={https://arxiv.org/abs/2108.04927}
}

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

65 Sep 21, 2022

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

49 Dec 26, 2022

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022

Utilizing RBERT model for KLUE Relation Extraction task

RBERT for Relation Extraction task for KLUE Project Description Relation Extraction task is one of the task of Korean Language Understanding Evaluatio

14 Nov 15, 2022

TalkNet: Audio-visual active speaker detection Model

Is someone talking? TalkNet: Audio-visual active speaker detection Model This repository contains the code for our ACM MM 2021 paper, TalkNet, an acti

142 Dec 14, 2022

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

2.3k Jan 1, 2023

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

36 Dec 18, 2022

Comments

`vocab.tar.gz` not found
Hi, thanks a lot for sharing the code for EmBert! I am trying to generate the vocabulary for the model by the following command on the README:

allennlp build-vocab training_configs/embert/embert_oscar.jsonnet storage/models/embert/vocab.tar.gz --include-package grolp

But I receive the following error.

FileNotFoundError: file storage/models/embert/vocab.tar.gz not found

vocab.tar.gz seems important to train the model. Kindly make this file available or advise on where to find it.
opened by vidhiJain 1
Spelling error in Setup command in the README.md

The command given in dataset download in the README.md is sh scripts/donwload_alfred_data.sh json_feat

It should be : sh scripts/download_alfred_data.sh json_feat

It's a spelling error in the download_alfred_data.sh

opened by varun0308 0

allennlp.common.checks.ConfigurationError: key "dataset" is required at location "data_loader."

Hello, I'm trying to run the training procedure allennlp build-vocab ... and allennlp train ..., but got an error:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/common/params.py", line 238, in pop
    value = self.params.pop(key)
KeyError: 'dataset'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/thor/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/commands/__init__.py", line 119, in main
    args.func(args)
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/commands/build_vocab.py", line 75, in build_vocab_from_args
    make_vocab_from_params(params, temp_dir)
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/training/util.py", line 468, in make_vocab_from_params
    data_loaders = data_loaders_from_params(params, serialization_dir=serialization_dir)
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/training/util.py", line 118, in data_loaders_from_params
    data_loaders["train"] = DataLoader.from_params(
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/common/from_params.py", line 589, in from_params
    return retyped_subclass.from_params(
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/common/from_params.py", line 621, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/common/from_params.py", line 199, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/common/from_params.py", line 303, in pop_and_construct_arg
    popped_params = params.pop(name, default) if default != _NO_DEFAULT else params.pop(name)
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/common/params.py", line 243, in pop
    raise ConfigurationError(msg)
allennlp.common.checks.ConfigurationError: key "dataset" is required at location "data_loader."

This error occurs at both build-vocab and train phase. I'm not familiar with allennlp. If I add "dataset": "alfred" into the "data_loader" field, a more confusing error occurs:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/thor/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/commands/__init__.py", line 119, in main
    args.func(args)
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/commands/build_vocab.py", line 75, in build_vocab_from_args
    make_vocab_from_params(params, temp_dir)
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/training/util.py", line 491, in make_vocab_from_params
    vocab = Vocabulary.from_params(vocab_params, instances=instances)
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/common/from_params.py", line 589, in from_params
    return retyped_subclass.from_params(
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/common/from_params.py", line 623, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/data/vocabulary.py", line 309, in from_instances
    for instance in Tqdm.tqdm(instances, desc="building vocab"):
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/ubuntu/miniconda3/envs/thor/lib/python3.8/site-packages/allennlp/training/util.py", line 485, in <genexpr>
    for instance in data_loader.iter_instances()
TypeError: 'NoneType' object is not iterable

Is there any solution for this error?

opened by RavenKiller 0

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Related tags

Overview

EmBERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Setup

MaskRCNN fine-tuning

ALFRED Visual Features extraction

MaskRCNN

Data-augmentation procedure

EmBERT Training

Vocabulary creation

Training

EmBERT evaluation

Citation

You might also like...

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Code for Text Prior Guided Scene Text Image Super-Resolution

Utilizing RBERT model for KLUE Relation Extraction task

TalkNet: Audio-visual active speaker detection Model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

Comments

`vocab.tar.gz` not found

Spelling error in Setup command in the README.md

allennlp.common.checks.ConfigurationError: key "dataset" is required at location "data_loader."

Owner

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Code-autocomplete, a code completion plugin for Python

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Learning Spatio-Temporal Transformer for Visual Tracking

SGMC: Spectral Graph Matrix Completion

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers