The Most Important Thing.

Our code is developed based on:

LXMERT: Learning Cross-Modality Encoder Representations from Transformers (https://github.com/airsplay/lxmert)

Here is their readme. I will update ours after a few deadlines.

Introduction

PyTorch code for the EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". Slides of our EMNLP 2019 talk are avialable here.

To analyze the output of pre-trained model (instead of fine-tuning on downstreaming tasks), please load the weight https://nlp1.cs.unc.edu/data/github_pretrain/lxmert20/Epoch20_LXRT.pth, which is trained as in section pre-training. The default weight here is trained with a slightly different protocal as this code.

Results (with this Github version)

Split	VQA	GQA	NLVR2
Local Validation	69.90%	59.80%	74.95%
Test-Dev	72.42%	60.00%	74.45% (Test-P)
Test-Standard	72.54%	60.33%	76.18% (Test-U)

All the results in the table are produced exactly with this code base. Since VQA and GQA test servers only allow limited number of 'Test-Standard' submissions, we use our remaining submission entry from the VQA/GQA challenges 2019 to get these results. For NLVR2, we only test once on the unpublished test set (test-U).

We use this code (with model ensemble) to participate in VQA 2019 and GQA 2019 challenge in May 2019. We are the only team ranking top-3 in both challenges.

Pre-trained models

The pre-trained model (870 MB) is available at http://nlp1.cs.unc.edu/data/model_LXRT.pth, and can be downloaded with:

mkdir -p snap/pretrained 
wget --no-check-certificate https://nlp1.cs.unc.edu/data/model_LXRT.pth -P snap/pretrained

If download speed is slower than expected, the pre-trained model could also be downloaded from other sources. Please help put the downloaded file at snap/pretrained/model_LXRT.pth.

We also provide data and commands to pre-train the model in pre-training. The default setup needs 4 GPUs and takes around a week to finish. The pre-trained weights with this code base could be downloaded from https://nlp1.cs.unc.edu/data/github_pretrain/lxmert/EpochXX_LXRT.pth, XX from 01 to 12. It is pre-trained for 12 epochs (instead of 20 in EMNLP paper) thus the fine-tuned reuslts are about 0.3% lower on each datasets.

Fine-tune on Vision-and-Language Tasks

We fine-tune our LXMERT pre-trained model on each task with following hyper-parameters:

Dataset	Batch Size	Learning Rate	Epochs	Load Answers
VQA	32	5e-5	4	Yes
GQA	32	1e-5	4	Yes
NLVR2	32	5e-5	4	No

Although the fine-tuning processes are almost the same except for different hyper-parameters, we provide descriptions for each dataset to take care of all details.

General

The code requires Python 3 and please install the Python dependencies with the command:

pip install -r requirements.txt

By the way, a Python 3 virtual environment could be set up and run with:

virtualenv name_of_environment -p python3
source name_of_environment/bin/activate

VQA

Fine-tuning

Please make sure the LXMERT pre-trained model is either downloaded or pre-trained.

Download the re-distributed json files for VQA 2.0 dataset. The raw VQA 2.0 dataset could be downloaded from the official website.

mkdir -p data/vqa
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vqa/train.json -P data/vqa/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vqa/nominival.json -P  data/vqa/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vqa/minival.json -P data/vqa/

Download faster-rcnn features for MS COCO train2014 (17 GB) and val2014 (8 GB) images (VQA 2.0 is collected on MS COCO dataset). The image features are also available on Google Drive and Baidu Drive (see Alternative Download for details).

mkdir -p data/mscoco_imgfeat
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat && rm data/mscoco_imgfeat/train2014_obj36.zip
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/val2014_obj36.zip -d data && rm data/mscoco_imgfeat/val2014_obj36.zip

Before fine-tuning on whole VQA 2.0 training set, verifying the script and model on a small training set (512 images) is recommended. The first argument 0 is GPU id. The second argument vqa_lxr955_tiny is the name of this experiment.
```
bash run/vqa_finetune.bash 0 vqa_lxr955_tiny --tiny
```
If no bug came out, then the model is ready to be trained on the whole VQA corpus:
```
bash run/vqa_finetune.bash 0 vqa_lxr955
```

It takes around 8 hours (2 hours per epoch * 4 epochs) to converge. The logs and model snapshots will be saved under folder snap/vqa/vqa_lxr955. The validation result after training will be around 69.7% to 70.2%.

Local Validation

The results on the validation set (our minival set) are printed while training. The validation result is also saved to snap/vqa/[experiment-name]/log.log. If the log file was accidentally deleted, the validation result in training is also reproducible from the model snapshot:

bash run/vqa_test.bash 0 vqa_lxr955_results --test minival --load snap/vqa/vqa_lxr955/BEST

Submitted to VQA test server

Download our re-distributed json file containing VQA 2.0 test data.

wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vqa/test.json -P data/vqa/

Download the faster rcnn features for MS COCO test2015 split (16 GB).

wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/test2015_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/test2015_obj36.zip -d data && rm data/mscoco_imgfeat/test2015_obj36.zip

Since VQA submission system requires submitting whole test data, we need to run inference over all test splits (i.e., test dev, test standard, test challenge, and test held-out). It takes around 10~15 mins to run test inference (448K instances to run).
```
bash run/vqa_test.bash 0 vqa_lxr955_results --test test --load snap/vqa/vqa_lxr955/BEST
```

The test results will be saved in snap/vqa_lxr955_results/test_predict.json. The VQA 2.0 challenge for this year is host on EvalAI at https://evalai.cloudcv.org/web/challenges/challenge-page/163/overview It still allows submission after the challenge ended. Please check the official website of VQA Challenge for detailed information and follow the instructions in EvalAI to submit. In general, after registration, the only thing remaining is to upload the test_predict.json file and wait for the result back.

The testing accuracy with exact this code is 72.42% for test-dev and 72.54% for test-standard. The results with the code base are also publicly shown on the VQA 2.0 leaderboard with entry LXMERT github version.

GQA

Fine-tuning

Please make sure the LXMERT pre-trained model is either downloaded or pre-trained.

Download the re-distributed json files for GQA balanced version dataset. The original GQA dataset is available in the Download section of its website and the script to preprocess these datasets is under data/gqa/process_raw_data_scripts.

mkdir -p data/gqa
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/gqa/train.json -P data/gqa/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/gqa/valid.json -P data/gqa/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/gqa/testdev.json -P data/gqa/

Download Faster R-CNN features for Visual Genome and GQA testing images (30 GB). GQA's training and validation data are collected from Visual Genome. Its testing images come from MS COCO test set (I have verified this with one of GQA authors Drew A. Hudson). The image features are also available on Google Drive and Baidu Drive (see Alternative Download for details).

mkdir -p data/vg_gqa_imgfeat
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/vg_gqa_obj36.zip -P data/vg_gqa_imgfeat
unzip data/vg_gqa_imgfeat/vg_gqa_obj36.zip -d data && rm data/vg_gqa_imgfeat/vg_gqa_obj36.zip
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/gqa_testdev_obj36.zip -P data/vg_gqa_imgfeat
unzip data/vg_gqa_imgfeat/gqa_testdev_obj36.zip -d data && rm data/vg_gqa_imgfeat/gqa_testdev_obj36.zip

Before fine-tuning on whole GQA training+validation set, verifying the script and model on a small training set (512 images) is recommended. The first argument 0 is GPU id. The second argument gqa_lxr955_tiny is the name of this experiment.
```
bash run/gqa_finetune.bash 0 gqa_lxr955_tiny --tiny
```
If no bug came out, then the model is ready to be trained on the whole GQA corpus (train + validation), and validate on the testdev set:
```
bash run/gqa_finetune.bash 0 gqa_lxr955
```

It takes around 16 hours (4 hours per epoch * 4 epochs) to converge. The logs and model snapshots will be saved under folder snap/gqa/gqa_lxr955. The validation result after training will be around 59.8% to 60.1%.

Local Validation

The results on testdev is printed out while training and saved in snap/gqa/gqa_lxr955/log.log. It could be also re-calculated with command:

bash run/gqa_test.bash 0 gqa_lxr955_results --load snap/gqa/gqa_lxr955/BEST --test testdev --batchSize 1024

Note: Our local testdev result is usually 0.1% to 0.5% lower than the submitted testdev result. The reason is that the test server takes an advanced evaluation system while our local evaluator only calculates the exact matching. Please use this official evaluator (784 MB) if you want to have the exact number without submitting.

Submitted to GQA test server

Download our re-distributed json file containing GQA test data.

wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/gqa/submit.json -P data/gqa/

Since GQA submission system requires submitting the whole test data, we need to run inference over all test splits. It takes around 30~60 mins to run test inference (4.2M instances to run).
```
bash run/gqa_test.bash 0 gqa_lxr955_results --load snap/gqa/gqa_lxr955/BEST --test submit --batchSize 1024
```
After running test script, a json file submit_predict.json under snap/gqa/gqa_lxr955_results will contain all the prediction results and is ready to be submitted. The GQA challenge 2019 is hosted by EvalAI at https://evalai.cloudcv.org/web/challenges/challenge-page/225/overview. After registering the account, uploading the submit_predict.json and waiting for the results are the only thing remained. Please also check GQA official website in case the test server is changed.

The testing accuracy with exactly this code is 60.00% for test-dev and 60.33% for test-standard. The results with the code base are also publicly shown on the GQA leaderboard with entry LXMERT github version.

NLVR2

Fine-tuning

Download the NLVR2 data from the official GitHub repo.
```
git submodule update --init
```

Process the NLVR2 data to json files.

bash -c "cd data/nlvr2/process_raw_data_scripts && python process_dataset.py"

Download the NLVR2 image features for train (21 GB) & valid (1.6 GB) splits. The image features are also available on Google Drive and Baidu Drive (see Alternative Download for details). To access to the original images, please follow the instructions on NLVR2 official Github. The images could either be downloaded with the urls or by signing an agreement form for data usage. And the feature could be extracted as described in feature extraction

mkdir -p data/nlvr2_imgfeat
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/nlvr2_imgfeat/train_obj36.zip -P data/nlvr2_imgfeat
unzip data/nlvr2_imgfeat/train_obj36.zip -d data && rm data/nlvr2_imgfeat/train_obj36.zip
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/nlvr2_imgfeat/valid_obj36.zip -P data/nlvr2_imgfeat
unzip data/nlvr2_imgfeat/valid_obj36.zip -d data && rm data/nlvr2_imgfeat/valid_obj36.zip

Before fine-tuning on whole NLVR2 training set, verifying the script and model on a small training set (512 images) is recommended. The first argument 0 is GPU id. The second argument nlvr2_lxr955_tiny is the name of this experiment. Do not worry if the result is low (50~55) on this tiny split, the whole training data would bring the performance back.
```
bash run/nlvr2_finetune.bash 0 nlvr2_lxr955_tiny --tiny
```
If no bugs are popping up from the previous step, it means that the code, the data, and image features are ready. Please use this command to train on the full training set. The result on NLVR2 validation (dev) set would be around 74.0 to 74.5.
```
bash run/nlvr2_finetune.bash 0 nlvr2_lxr955
```

Inference on Public Test Split

Download NLVR2 image features for the public test split (1.6 GB).

wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/nlvr2_imgfeat/test_obj36.zip -P data/nlvr2_imgfeat
unzip data/nlvr2_imgfeat/test_obj36.zip -d data/nlvr2_imgfeat && rm data/nlvr2_imgfeat/test_obj36.zip

Test on the public test set (corresponding to 'test-P' on NLVR2 leaderboard) with:

bash run/nlvr2_test.bash 0 nlvr2_lxr955_results --load snap/nlvr2/nlvr2_lxr955/BEST --test test --batchSize 1024

The test accuracy would be shown on the screen after around 5~10 minutes. It also saves the predictions in the file test_predict.csv under snap/nlvr2_lxr955_reuslts, which is compatible to NLVR2 official evaluation script. The official eval script also calculates consistency ('Cons') besides the accuracy. We could use this official script to verify the results by running:
```
python data/nlvr2/nlvr/nlvr2/eval/metrics.py snap/nlvr2/nlvr2_lxr955_results/test_predict.csv data/nlvr2/nlvr/nlvr2/data/test1.json
```

The accuracy of public test ('test-P') set should be almost same to the validation set ('dev'), which is around 74.0% to 74.5%.

Unreleased Test Sets

To be tested on the unreleased held-out test set (test-U on the leaderboard ), the code needs to be sent. Please check the NLVR2 official github and NLVR project website for details.

General Debugging Options

Since it takes a few minutes to load the features, the code has an option to prototype with a small amount of training data.

# Training with 512 images:
bash run/vqa_finetune.bash 0 --tiny 
# Training with 4096 images:
bash run/vqa_finetune.bash 0 --fast

Pre-training

Download our aggregated LXMERT dataset from MS COCO, Visual Genome, VQA, and GQA (around 700MB in total). The joint answer labels are saved in data/lxmert/all_ans.json.

mkdir -p data/lxmert
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/

[Skip this if you have run VQA fine-tuning.] Download the detection features for MS COCO images.

mkdir -p data/mscoco_imgfeat
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat && rm data/mscoco_imgfeat/train2014_obj36.zip
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/val2014_obj36.zip -d data && rm data/mscoco_imgfeat/val2014_obj36.zip

[Skip this if you have run GQA fine-tuning.] Download the detection features for Visual Genome images.

mkdir -p data/vg_gqa_imgfeat
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/vg_gqa_obj36.zip -P data/vg_gqa_imgfeat
unzip data/vg_gqa_imgfeat/vg_gqa_obj36.zip -d data && rm data/vg_gqa_imgfeat/vg_gqa_obj36.zip

Test on a small split of the MS COCO + Visual Genome datasets:
```
bash run/lxmert_pretrain.bash 0,1,2,3 --multiGPU --tiny
```
Run on the whole MS COCO and Visual Genome related datasets (i.e., VQA, GQA, COCO caption, VG Caption, VG QA). Here, we take a simple single-stage pre-training strategy (20 epochs with all pre-training tasks) rather than the two-stage strategy in our paper (10 epochs without image QA and 10 epochs with image QA). The pre-training finishes in 8.5 days on 4 GPUs. By the way, I hope that my experience in this project would help anyone with limited computational resources.
```
bash run/lxmert_pretrain.bash 0,1,2,3 --multiGPU
```
Multiple GPUs: Argument 0,1,2,3 indicates taking 4 GPUs to pre-train LXMERT. If the server does not have 4 GPUs (I am sorry to hear that), please consider halving the batch-size or using the NVIDIA/apex library to support half-precision computation. The code uses the default data parallelism in PyTorch and thus extensible to less/more GPUs. The python main thread would take charge of the data loading. On 4 GPUs, we do not find that the data loading becomes a bottleneck (around 5% overhead).

GPU Types: We find that either Titan XP, GTX 2080, and Titan V could support this pre-training. However, GTX 1080, with its 11G memory, is a little bit small thus please change the batch_size to 224 (instead of 256).
I have verified these pre-training commands with 12 epochs. The pre-trained weights from previous process could be downloaded from https://nlp1.cs.unc.edu/data/github_pretrain/lxmert/EpochXX_LXRT.pth, XX from 01 to 12. The results are roughly the same (around 0.3% lower in downstream tasks because of fewer epochs).

Explanation of arguments in the pre-training script run/lxmert_pretrain.bash:

python src/pretrain/lxmert_pretrain_new.py \
    # The pre-training tasks
    --taskMaskLM --taskObjPredict --taskMatched --taskQA \  
    
    # Vision subtasks
    # obj / attr: detected object/attribute label prediction.
    # feat: RoI feature regression.
    --visualLosses obj,attr,feat \
    
    # Mask rate for words and objects
    --wordMaskRate 0.15 --objMaskRate 0.15 \
    
    # Training and validation sets
    # mscoco_nominival + mscoco_minival = mscoco_val2014
    # visual genome - mscoco = vgnococo
    --train mscoco_train,mscoco_nominival,vgnococo --valid mscoco_minival \
    
    # Number of layers in each encoder
    --llayers 9 --xlayers 5 --rlayers 5 \
    
    # Train from scratch (Using intialized weights) instead of loading BERT weights.
    --fromScratch \

    # Hyper parameters
    --batchSize 256 --optim bert --lr 1e-4 --epochs 20 \
    --tqdm --output $output ${@:2}

Alternative Dataset and Features Download Links

All default download links are provided by our servers in UNC CS department and under our NLP group website but the network bandwidth might be limited. We thus provide a few other options with Google Drive and Baidu Drive.

The files in online drives are almost structured in the same way as our repo but have a few differences due to specific policies. After downloading the data and features from the drives, please re-organize them under data/ folder according to the following example:

REPO ROOT
 |
 |-- data                  
 |    |-- vqa
 |    |    |-- train.json
 |    |    |-- minival.json
 |    |    |-- nominival.json
 |    |    |-- test.json
 |    |
 |    |-- mscoco_imgfeat
 |    |    |-- train2014_obj36.tsv
 |    |    |-- val2014_obj36.tsv
 |    |    |-- test2015_obj36.tsv
 |    |
 |    |-- vg_gqa_imgfeat -- *.tsv
 |    |-- gqa -- *.json
 |    |-- nlvr2_imgfeat -- *.tsv
 |    |-- nlvr2 -- *.json
 |    |-- lxmert -- *.json          # Pre-training data
 | 
 |-- snap
 |-- src

Please also kindly contact us if anything is missing!

Google Drive

As an alternative way to download feature from our UNC server, you could also download the feature from google drive with link https://drive.google.com/drive/folders/1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing. The structure of the folders on drive is:

Google Drive Root
 |-- data                  # The raw data and image features without compression
 |    |-- vqa
 |    |-- gqa
 |    |-- mscoco_imgfeat
 |    |-- ......
 |
 |-- image_feature_zips    # The image-feature zip files (Around 45% compressed)
 |    |-- mscoco_imgfeat.zip
 |    |-- nlvr2_imgfeat.zip
 |    |-- vg_gqa_imgfeat.zip
 |
 |-- snap -- pretrained -- model_LXRT.pth # The pytorch pre-trained model weights.

Note: image features in zip files (e.g., mscoco_mgfeat.zip) are the same to which in data/ (i.e., data/mscoco_imgfeat). If you want to save network bandwidth, please download the feature zips and skip downloading the *_imgfeat folders under data/.

Baidu Drive

Since Google Drive is not officially available across the world, we also create a mirror on Baidu drive (i.e., Baidu PAN). The dataset and features could be downloaded with shared link https://pan.baidu.com/s/1m0mUVsq30rO6F1slxPZNHA and access code wwma.

Baidu Drive Root
 |
 |-- vqa
 |    |-- train.json
 |    |-- minival.json
 |    |-- nominival.json
 |    |-- test.json
 |
 |-- mscoco_imgfeat
 |    |-- train2014_obj36.zip
 |    |-- val2014_obj36.zip
 |    |-- test2015_obj36.zip
 |
 |-- vg_gqa_imgfeat -- *.zip.*  # Please read README.txt under this folder
 |-- gqa -- *.json
 |-- nlvr2_imgfeat -- *.zip.*   # Please read README.txt under this folder
 |-- nlvr2 -- *.json
 |-- lxmert -- *.json
 | 
 |-- pretrained -- model_LXRT.pth

Since Baidu Drive does not support extremely large files, we split a few features zips into multiple small files. Please follow the README.txt under baidu_drive/vg_gqa_imgfeat and baidu_drive/nlvr2_imgfeat to concatenate back to the feature zips with command cat.

Code and Project Explanation

All code is in folder src. The basics in lxrt. The python files related to pre-training and fine-tuning are saved in src/pretrain and src/tasks respectively.
I kept folders containing image features (e.g., mscoco_imgfeat) separated from vision-and-language dataset (e.g., vqa, lxmert) because multiple vision-and-language datasets would share common images.
We use the name lxmert for our framework and use the name lxrt (Language, Cross-Modality, and object-Relationship Transformers) to refer to our our models.
To be consistent with the name lxrt (Language, Cross-Modality, and object-Relationship Transformers), we use lxrXXX to denote the number of layers. E.g., lxr955 (used in current pre-trained model) indicates a model with 9 Language layers, 5 cross-modality layers, and 5 object-Relationship layers. If we consider a single-modality layer as a half of cross-modality layer, the total number of layers is (9 + 5) / 2 + 5 = 12, which is the same as BERT_BASE.
We share the weight between the two cross-modality attention sub-layers. Please check the visual_attention variable, which is used to compute both lang->visn attention and visn->lang attention. (I am sorry that the name visual_attention is misleading because I deleted the lang_attention there.) Sharing weights is mostly used for saving computational resources and it also (intuitively) helps forcing the features from visn/lang into a joint subspace.
The box coordinates are not normalized from [0, 1] to [-1, 1], which looks like a typo but actually not ;). Normalizing the coordinate would not affect the output of box encoder (mathematically and almost numerically). ~~(Hint: consider the LayerNorm in positional encoding)~~

Faster R-CNN Feature Extraction

We use the Faster R-CNN feature extractor demonstrated in "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering", CVPR 2018 and its released code at Bottom-Up-Attention github repo. It was trained on Visual Genome dataset and implemented based on a specific Caffe version.

To extract features with this Caffe Faster R-CNN, we publicly release a docker image airsplay/bottom-up-attention on docker hub that takes care of all the dependencies and library installation . Instructions and examples are demonstrated below. You could also follow the installation instructions in the bottom-up attention github to setup the tool: https://github.com/peteanderson80/bottom-up-attention.

The BUTD feature extractor is widely used in many other projects. If you want to reproduce the results from their paper, feel free to use our docker as a tool.

Feature Extraction with Docker

Docker is a easy-to-use virtualization tool which allows you to plug and play without installing libraries.

The built docker file for bottom-up-attention is released on docker hub and could be downloaded with command:

sudo docker pull airsplay/bottom-up-attention

The Dockerfile could be downloaed here, which allows using other CUDA versions.

After pulling the docker, you could test running the docker container with command:

docker run --gpus all --rm -it airsplay/bottom-up-attention bash

If errors about --gpus all popped up, please read the next section.

Docker GPU Access

Note that the purpose of the argument --gpus all is to expose GPU devices to the docker container, and it requires Docker >= 19.03 along with nvidia-container-toolkit:

For running Docker with an older version, either update it to 19.03 or use the flag --runtime=nvidia instead of --gpus all.

An Example: Feature Extraction for NLVR2

We demonstrate how to extract Faster R-CNN features of NLVR2 images.

Please first follow the instructions on the NLVR2 official repo to get the images.

Download the pre-trained Faster R-CNN model. Instead of using the default pre-trained model (trained with 10 to 100 boxes), we use the 'alternative pretrained model' which was trained with 36 boxes.

wget --no-check-certificate 'https://www.dropbox.com/s/2h4hmgcvpaewizu/resnet101_faster_rcnn_final_iter_320000.caffemodel?dl=1' -O data/nlvr2_imgfeat/resnet101_faster_rcnn_final_iter_320000.caffemodel

Run docker container with command:
```
docker run --gpus all -v /path/to/nlvr2/images:/workspace/images:ro -v /path/to/lxrt_public/data/nlvr2_imgfeat:/workspace/features --rm -it airsplay/bottom-up-attention bash
```
-v mounts the folders on host os to the docker image container.

Note0: If it says something about 'privilege', add sudo before the command.

Note1: If it says something about '--gpus all', it means that the GPU options are not correctly set. Please read Docker GPU Access for the instructions to allow GPU access.

Note2: /path/to/nlvr2/images would contain subfolders train, dev, test1 and test2.

Note3: Both paths '/path/to/nlvr2/images/' and '/path/to/lxrt_public' requires absolute paths.

Extract the features inside the docker container. The extraction script is copied from butd/tools/generate_tsv.py and modified by Jie Lei and me.

cd /workspace/features
CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split train 
CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split valid
CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split test

It would takes around 5 to 6 hours for the training split and 1 to 2 hours for the valid and test splits. Since it is slow, I recommend to run them parallelly if there are multiple GPUs. It could be achieved by changing the gpu_id in CUDA_VISIBLE_DEVICES=$gpu_id.

The features will be saved in train.tsv, valid.tsv, and test.tsv under the directory data/nlvr2_imgfeat, outside the docker container. I have verified the extracted image features are the same to the ones I provided in NLVR2 fine-tuning.

Yet Another Example: Feature Extraction for MS COCO Images

Download the MS COCO train2014, val2014, and test2015 images from MS COCO official website.

Download the pre-trained Faster R-CNN model.

mkdir -p data/mscoco_imgfeat
wget --no-check-certificate 'https://www.dropbox.com/s/2h4hmgcvpaewizu/resnet101_faster_rcnn_final_iter_320000.caffemodel?dl=1' -O data/mscoco_imgfeat/resnet101_faster_rcnn_final_iter_320000.caffemodel

Run the docker container with the command:
```
docker run --gpus all -v /path/to/mscoco/images:/workspace/images:ro -v $(pwd)/data/mscoco_imgfeat:/workspace/features --rm -it airsplay/bottom-up-attention bash
```
Note: Option -v mounts the folders outside container to the paths inside the container.

Note1: Please use the absolute path to the MS COCO images folder images. The images folder containing the train2014, val2014, and test2015 sub-folders. (It's the standard way to save MS COCO images.)

Extract the features inside the docker container.

cd /workspace/features
CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split train 
CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split valid
CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split test

Exit from the docker container (by executing exit command in bash). The extracted features would be saved under folder data/mscoco_imgfeat.

Reference

If you find this project helps, please cite our paper :)

@inproceedings{tan2019lxmert,
  title={LXMERT: Learning Cross-Modality Encoder Representations from Transformers},
  author={Tan, Hao and Bansal, Mohit},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
  year={2019}
}

Acknowledgement

We thank the funding support from ARO-YIP Award #W911NF-18-1-0336, & awards from Google, Facebook, Salesforce, and Adobe.

We thank Peter Anderson for providing the faster R-CNN code and pre-trained models under Bottom-Up-Attention Github Repo. We thank Hengyuan Hu for his PyTorch VQA implementation, our VQA implementation borrows its pre-processed answers. We thank hugginface for releasing the excellent PyTorch code PyTorch Transformers.

We thank Drew A. Hudson to answer all our questions about GQA specification. We thank Alane Suhr for helping test LXMERT on NLVR2 unreleased test split and provide a detailed analysis.

We thank all the authors and annotators of vision-and-language datasets (i.e., MS COCO, Visual Genome, VQA, GQA, NLVR2 ), which allows us to develop a pre-trained model for vision-and-language tasks.

We thank Jie Lei and Licheng Yu for their helpful discussions. I also want to thank Shaoqing Ren to teach me vision knowledge when I was in MSRA. We also thank you to help look into our code. Please kindly contact us if you find any issue. Comments are always welcome.

LXRThanks.

Hi, I followed the data preparation steps and ran the following script for pretraining: bash run/fsb2.bash 0,1,2,3 --multiGPU

However I could not replicate results. I got:

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45750/45750 [13:10:30<00:00,  1.04s/it]
The training loss for Epoch 19 is 2.9797

The losses are Mask_LM: 1.2318 Matched: 0.1805 Obj: 0.1795 Attr: 0.1946 Feat: 0.1154 QA: 1.0779 
Overall Accu 0.3837, gqa Accu 0.4904, visual7w Accu 0.2781, vqa Accu 0.4422, 
The valid loss is 4.2465

The losses are Mask_LM: 1.2290 Matched: 0.2033 Obj: 0.3964 Attr: 0.3826 Feat: 0.1148 QA: 1.9204 
Overall Accu 0.3205, gqa Accu 0.3857, visual7w Accu 0.2134, vqa Accu 0.3945

Logs:

The training loss for Epoch 0 is 8.2564
The losses are 
 Mask_LM: 2.6516 
 Matched: 0.4026 
 Obj: 1.0078 
 Attr: 0.7238 
 Feat: 0.1612 
 QA: 3.3093 
The valid loss is 5.6922
The losses are 
Mask_LM: 1.8312 
Matched: 0.2772 
Obj: 0.5260 
Attr: 0.4617 
Feat: 0.1323 
QA: 2.4638 
The training loss for Epoch 1 is 5.2532
The losses are 
 Mask_LM: 1.7080 
 Matched: 0.2697 
 Obj: 0.4669 
 Attr: 0.4206 
 Feat: 0.1281 
 QA: 2.2600 
The valid loss is 4.9769
The losses are 
Mask_LM: 1.5969 
Matched: 0.2467 
Obj: 0.4433 
Attr: 0.4062 
Feat: 0.1245 
QA: 2.1593 
The training loss for Epoch 2 is 4.7584
The losses are 
 Mask_LM: 1.5700 
 Matched: 0.2473 
 Obj: 0.4005 
 Attr: 0.3749 
 Feat: 0.1235 
 QA: 2.0424 
The valid loss is 4.7286
The losses are 
Mask_LM: 1.5294 
Matched: 0.2359 
Obj: 0.4148 
Attr: 0.3820 
Feat: 0.1209 
QA: 2.0455 
The training loss for Epoch 3 is 4.5075
The losses are 
 Mask_LM: 1.5085 
 Matched: 0.2360 
 Obj: 0.3651 
 Attr: 0.3498 
 Feat: 0.1212 
 QA: 1.9270 
The valid loss is 4.6296
The losses are 
Mask_LM: 1.4914 
Matched: 0.2276 
Obj: 0.4026 
Attr: 0.3739 
Feat: 0.1189 
QA: 2.0153 
The training loss for Epoch 4 is 4.3400
The losses are 
 Mask_LM: 1.4701 
 Matched: 0.2287 
 Obj: 0.3395 
 Attr: 0.3309 
 Feat: 0.1198 
 QA: 1.8509 
The valid loss is 4.5121
The losses are 
Mask_LM: 1.4305 
Matched: 0.2238 
Obj: 0.3985 
Attr: 0.3721 
Feat: 0.1183 
QA: 1.9690 
The training loss for Epoch 5 is 4.2073
The losses are 
 Mask_LM: 1.4387 
 Matched: 0.2230 
 Obj: 0.3199 
 Attr: 0.3160 
 Feat: 0.1190 
 QA: 1.7907 
The valid loss is 4.4802
The losses are 
Mask_LM: 1.4268 
Matched: 0.2227 
Obj: 0.3922 
Attr: 0.3676 
Feat: 0.1174 
QA: 1.9535 
The training loss for Epoch 6 is 4.0967
The losses are 
 Mask_LM: 1.4124 
 Matched: 0.2177 
 Obj: 0.3026 
 Attr: 0.3029 
 Feat: 0.1182 
 QA: 1.7429 
The valid loss is 4.4530
The losses are 
Mask_LM: 1.3876 
Matched: 0.2185 
Obj: 0.3933 
Attr: 0.3662 
Feat: 0.1172 
QA: 1.9703 
The training loss for Epoch 7 is 4.0009
The losses are 
 Mask_LM: 1.3924 
 Matched: 0.2133 
 Obj: 0.2879 
 Attr: 0.2910 
 Feat: 0.1177 
 QA: 1.6985 
The valid loss is 4.4338
The losses are 
Mask_LM: 1.3653 
Matched: 0.2194 
Obj: 0.3882 
Attr: 0.3647 
Feat: 0.1162 
QA: 1.9800 
The training loss for Epoch 8 is 3.9061
The losses are 
 Mask_LM: 1.3725 
 Matched: 0.2096 
 Obj: 0.2742 
 Attr: 0.2801 
 Feat: 0.1173 
 QA: 1.6525 
The valid loss is 4.3689
The losses are 
Mask_LM: 1.3534 
Matched: 0.2108 
Obj: 0.3913 
Attr: 0.3659 
Feat: 0.1167 
QA: 1.9309 
The training loss for Epoch 9 is 3.8103
The losses are 
 Mask_LM: 1.3548 
 Matched: 0.2061 
 Obj: 0.2619 
 Attr: 0.2699 
 Feat: 0.1170 
 QA: 1.6006 
The valid loss is 4.3516
The losses are 
Mask_LM: 1.3365 
Matched: 0.2110 
Obj: 0.3924 
Attr: 0.3672 
Feat: 0.1163 
QA: 1.9283 
The training loss for Epoch 10 is 3.7179
The losses are 
 Mask_LM: 1.3403 
 Matched: 0.2026 
 Obj: 0.2503 
 Attr: 0.2599 
 Feat: 0.1167 
 QA: 1.5481 
The valid loss is 4.3291
The losses are 
Mask_LM: 1.3158 
Matched: 0.2123 
Obj: 0.3907 
Attr: 0.3690 
Feat: 0.1158 
QA: 1.9256 
The training loss for Epoch 11 is 3.6245
The losses are 
 Mask_LM: 1.3225 
 Matched: 0.1996 
 Obj: 0.2393 
 Attr: 0.2502 
 Feat: 0.1165 
 QA: 1.4964 
The valid loss is 4.3434
The losses are 
Mask_LM: 1.3161 
Matched: 0.2071 
Obj: 0.3919 
Attr: 0.3704 
Feat: 0.1158 
QA: 1.9421 
The training loss for Epoch 12 is 3.5338
The losses are 
 Mask_LM: 1.3094 
 Matched: 0.1966 
 Obj: 0.2292 
 Attr: 0.2414 
 Feat: 0.1163 
 QA: 1.4410 
The valid loss is 4.3054
The losses are 
Mask_LM: 1.3118 
Matched: 0.2107 
Obj: 0.3947 
Attr: 0.3711 
Feat: 0.1152 
QA: 1.9019 
The training loss for Epoch 13 is 3.4445
The losses are 
 Mask_LM: 1.2954 
 Matched: 0.1933 
 Obj: 0.2196 
 Attr: 0.2327 
 Feat: 0.1161 
 QA: 1.3873 
The valid loss is 4.2761
The losses are 
Mask_LM: 1.2812 
Matched: 0.2085 
Obj: 0.3934 
Attr: 0.3748 
Feat: 0.1149 
QA: 1.9035 
The training loss for Epoch 14 is 3.3505
The losses are 
 Mask_LM: 1.2793 
 Matched: 0.1912 
 Obj: 0.2109 
 Attr: 0.2248 
 Feat: 0.1160 
 QA: 1.3284 
The valid loss is 4.2573
The losses are 
Mask_LM: 1.2694 
Matched: 0.2081 
Obj: 0.3945 
Attr: 0.3774 
Feat: 0.1153 
QA: 1.8926 
The training loss for Epoch 15 is 3.2673
The losses are 
 Mask_LM: 1.2700 
 Matched: 0.1884 
 Obj: 0.2027 
 Attr: 0.2169 
 Feat: 0.1158 
 QA: 1.2736 
The valid loss is 4.2739
The losses are 
Mask_LM: 1.2708 
Matched: 0.2076 
Obj: 0.3966 
Attr: 0.3795 
Feat: 0.1152 
QA: 1.9043 
The training loss for Epoch 16 is 3.1854
The losses are 
 Mask_LM: 1.2578 
 Matched: 0.1861 
 Obj: 0.1954 
 Attr: 0.2101 
 Feat: 0.1157 
 QA: 1.2203 
The valid loss is 4.2779
The losses are 
Mask_LM: 1.2621 
Matched: 0.2054 
Obj: 0.3989 
Attr: 0.3832 
Feat: 0.1152 
QA: 1.9131 
The training loss for Epoch 17 is 3.1064
The losses are 
 Mask_LM: 1.2475 
 Matched: 0.1840 
 Obj: 0.1888 
 Attr: 0.2037 
 Feat: 0.1156 
 QA: 1.1667 
The valid loss is 4.2353
The losses are 
Mask_LM: 1.2372 
Matched: 0.2080 
Obj: 0.3985 
Attr: 0.3835 
Feat: 0.1150 
QA: 1.8931 
The training loss for Epoch 18 is 3.0343
The losses are 
 Mask_LM: 1.2373 
 Matched: 0.1819 
 Obj: 0.1833 
 Attr: 0.1982 
 Feat: 0.1154 
 QA: 1.1182 
The valid loss is 4.2217
The losses are 
Mask_LM: 1.2327 
Matched: 0.2059 
Obj: 0.3965 
Attr: 0.3824 
Feat: 0.1149 
QA: 1.8894 
The training loss for Epoch 19 is 2.9797
The losses are 
 Mask_LM: 1.2318 
 Matched: 0.1805 
 Obj: 0.1795 
 Attr: 0.1946 
 Feat: 0.1154 
 QA: 1.0779 
The valid loss is 4.2465
The losses are 
Mask_LM: 1.2290 
Matched: 0.2033 
Obj: 0.3964 
Attr: 0.3826 
Feat: 0.1148

Happen to have any suggestions for improving?

GluonMM is a library of transformer models for computer vision and multi-modality research

GluonMM is a library of transformer models for computer vision and multi-modality research. It contains reference implementations of widely adopted baseline models and also research work from Amazon Research.

42 Dec 2, 2022

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Unified Multi-modal Transformers This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Vi

Applied Research Center (ARC), Tencent PCG

84 Jan 4, 2023

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

489 Jan 7, 2023

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

DensePhrases DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches th

540 Dec 30, 2022

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

697 Jan 6, 2023

Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Transformer Based Multi-Source Domain Adaptation Dustin Wright and Isabelle Augenstein To appear in EMNLP 2020. Read the preprint: https://arxiv.org/a

36 Dec 5, 2022

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

27 Dec 20, 2022

Question about Pre-training

Thanks for giving the code. I have a question when I run the command of pre-training of your part as the figure shows:

I have configured Python 3.6, Torch 1.1, and UTF-8 according to your preset environment. Is there something wrong with my configuration environment？

opened by ht374 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
the file named "data/mscoco_imgfeat/test2015_obj64.tsv"

Hello Doctor Yang, could you help me find or get the file named "data/mscoco_imgfeat/test2015_obj64.tsv", I can't download it from https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip. And do you try to deal with classification task by the way of "catt". I want to do some personal research with this method. Thank you very much! Best wish to you.

opened by hefei1019 0

《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

Related tags

Overview

The Most Important Thing.

Our code is developed based on:

LXMERT: Learning Cross-Modality Encoder Representations from Transformers (https://github.com/airsplay/lxmert)

Here is their readme. I will update ours after a few deadlines.

Introduction

Results (with this Github version)

Pre-trained models

Fine-tune on Vision-and-Language Tasks

General

VQA

Fine-tuning

Local Validation

Submitted to VQA test server

GQA

Fine-tuning

Local Validation

Submitted to GQA test server

NLVR2

Fine-tuning

Inference on Public Test Split

Unreleased Test Sets

General Debugging Options

Pre-training

Alternative Dataset and Features Download Links

Google Drive

Baidu Drive

Code and Project Explanation

Faster R-CNN Feature Extraction

Feature Extraction with Docker

Docker GPU Access

An Example: Feature Extraction for NLVR2

Yet Another Example: Feature Extraction for MS COCO Images

Reference

Acknowledgement

You might also like...

GluonMM is a library of transformer models for computer vision and multi-modality research

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games

EMNLP 2020 - Summarizing Text on Any Aspects

Comments

Couldn't replicate results

Question about Pre-training

CVE-2007-4559 Patch

Patching CVE-2007-4559

the file named "data/mscoco_imgfeat/test2015_obj64.tsv"

Owner

MARS: Learning Modality-Agnostic Representation for Scalable Cross-media Retrieva

CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification (ICCV2021)

PyTorch implementation of the cross-modality generative model that synthesizes dance from music.

Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space