Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

Zhe Gan

Last update: Dec 31, 2022

Related tags

Overview

VILLA: Vision-and-Language Adversarial Training

This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports adversarial finetuning of UNITER on VQA, VCR, NLVR2, and SNLI-VE. Adversarial pre-training with in-domain data will be available soon. Both VILLA-base and VILLA-large pre-trained checkpoints are released.

Most of the code in this repo are copied/modified from UNITER.

Requirements

We provide Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Quick Start

NOTE: Please run bash scripts/download_pretrained.sh $PATH_TO_STORAGE to get our latest pretrained VILLA checkpoints. This will download both the base and large models.

We use VQA as an end-to-end example for using this code base.

Download processed data and pretrained models with the following command.

bash scripts/download_vqa.sh $PATH_TO_STORAGE

After downloading you should see the following folder structure:

├── finetune 
├── img_db
│   ├── coco_test2015
│   ├── coco_test2015.tar
│   ├── coco_train2014
│   ├── coco_train2014.tar
│   ├── coco_val2014
│   ├── coco_val2014.tar
│   ├── vg
│   └── vg.tar
├── pretrained
    ├── uniter-base.pt
│   └── villa-base.pt
└── txt_db
    ├── vqa_devval.db
    ├── vqa_devval.db.tar
    ├── vqa_test.db
    ├── vqa_test.db.tar
    ├── vqa_train.db
    ├── vqa_train.db.tar
    ├── vqa_trainval.db
    ├── vqa_trainval.db.tar
    ├── vqa_vg.db
    └── vqa_vg.db.tar

You can put different pre-trained checkpoints inside the /pretrained folder based on your need.

Launch the Docker container for running the experiments.
```
# docker image should be automatically pulled
source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
    $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained
```
The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /src instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

Run finetuning for the VQA task.

# inside the container
horovodrun -np $N_GPU python train_vqa_adv.py --config $YOUR_CONFIG_JSON

# specific example
horovodrun -np 4 python train_vqa_adv.py --config config/train-vqa-base-4gpu-adv.json

Run inference for the VQA task and then evaluate.
```
# inference
python inf_vqa.py --txt_db /txt/vqa_test.db --img_db /img/coco_test2015 \
--output_dir $VQA_EXP --checkpoint 6000 --pin_mem --fp16
```
The result file will be written at $VQA_EXP/results_test/results_6000_all.json, which can be submitted to the evaluation server
Customization
```
# training options
python train_vqa_adv.py --help
```
- command-line argument overwrites JSON config files
- JSON config overwrites argparse default value.
- use horovodrun to run multi-GPU training
- --gradient_accumulation_steps emulates multi-gpu training
- --checkpoint selects UNITER or VILLA pre-trained checkpoints
- --adv_training decides using adv. training or not
- --adv_modality takes values from ['text'], ['image'], ['text','image'], and ['text','image','alter'], the last two correspond to adding perturbations on two modalities simultaneously or alternatively

Downstream Tasks Finetuning

VCR

NOTE: train and inference should be ran inside the docker container

download data

bash scripts/download_vcr.sh $PATH_TO_STORAGE

train

horovodrun -np 4 python train_vcr_adv.py --config config/train-vcr-base-4gpu-adv.json \
    --output_dir $VCR_EXP

inference

horovodrun -np 4 python inf_vcr.py --txt_db /txt/vcr_test.db \
    --img_db "/img/vcr_gt_test/;/img/vcr_test/" \
    --split test --output_dir $VCR_EXP --checkpoint 8000 \
    --pin_mem --fp16

The result file will be written at $VCR_EXP/results_test/results_8000_all.csv, which can be submitted to VCR leaderboard for evaluation.

NLVR2

NOTE: train and inference should be ran inside the docker container

download data

bash scripts/download_nlvr2.sh $PATH_TO_STORAGE

train

horovodrun -np 4 python train_nlvr2_adv.py --config config/train-nlvr2-base-1gpu-adv.json \
    --output_dir $NLVR2_EXP

inference

python inf_nlvr2.py --txt_db /txt/nlvr2_test1.db/ --img_db /img/nlvr2_test/ \
--train_dir /storage/nlvr-base/ --ckpt 6500 --output_dir . --fp16

Visual Entailment (SNLI-VE)

NOTE: train should be ran inside the docker container

download data

bash scripts/download_ve.sh $PATH_TO_STORAGE

train

horovodrun -np 2 python train_ve_adv.py --config config/train-ve-base-2gpu-adv.json \
    --output_dir $VE_EXP

Adversarial Training of LXMERT

To keep things simple, we provide another separate repo that can be used to reproduce our results on adversarial finetuning of LXMERT on VQA, GQA, and NLVR2.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{gan2020large,
  title={Large-Scale Adversarial Training for Vision-and-Language Representation Learning},
  author={Gan, Zhe and Chen, Yen-Chun and Li, Linjie and Zhu, Chen and Cheng, Yu and Liu, Jingjing},
  booktitle={NeurIPS},
  year={2020}
}

@inproceedings{chen2020uniter,
  title={Uniter: Universal image-text representation learning},
  author={Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
  booktitle={ECCV},
  year={2020}
}

License

MIT

Comments

training setup

Hi, Thanks for your excellent work. I am not sure the batchsize in your paper is same as it in the code? In code, 3072 refers to total tokens, corresponding to about real 32 examples each iteration.

a) Maybe 32(real batchsize)*8(Grad. Accu) is dominant factor? b) Our V100 machine (16G) can not process the 3072 tokens, so maybe 1024 tokens(about 8 real examples), 8 Gpus, 4(Grad. Accu) is another workable plan? c) Besides, the train-vqa-large-8gpu-adv.json you released can reproduce the paper result? Some parameters seem to be set differently from the paper (e.g. Adv .Lr ..)

We deeply hope to reproduce your best results in our limited resource scenario. Thank a lot.

opened by yixuan-qiao 9
About the reproduction of VCR experiment results

Hi， Thanks for your great work! When i use the following command to train a model, it seems can't reach the expected results in the paper. horovodrun -np 1 python train_vcr_adv.py --config config/train-vcr-base-4gpu-adv.json \ --output_dir vcr/output_base Only use one GPU，I got these results 100%|##########| 8000/8000 [4:58:12<00:00, 1.98s/it][1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - ============Step 8000============= [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - 1280000 examples trained at 71 ex/s [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - =========================================== [1,0]<stderr>: [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - start running validation... [1,0]<stderr>: [[[[1,0]<stderr>:09/10/2021 08:54:06 - INFO - __main__ - validation finished in 307 seconds, score_qa: 72.28 score_qar: 75.06 score: 54.35

I am confused that this result is a few percentage points different from the one mentioned in the paper. What should i do? Thanks in advance!!!

opened by Tclz 3
When will the adversarial training code of pretraining in indomain dataset be released?

Hi, zhe;

Thanks for your excellent work. Recently I want to reproduce some results in Villa and conduct pre-training on indomain datasets. I am curious about whether it is possible to mimic the adversarial training codes in train_vqa_adv.py to pretraining stage simply? Is there any specific configuration for adversarial training in pretraining stage?

opened by youngfly11 3
As the epoch increased, so did the GPU memory

Hi , Thanks for your great work! When I fine tuning the VQA ,I met the problems that: As the epoch increased, so did the GPU memory,Eventually,It will exceed the GPU's highest memory which causes the stopping.

And when using multiple GPUs for training, GPU0 uses more internal memory than any other.

This problem has been bothering me for a long time, and I want to ask do you know what is the reason?

Thanks for your reply~:)

opened by clytze0216 1
Features of img_pos_feat

Hello,

I noticed that img_pos_feat have 7 features. I assumed that 4 of them are coordinates of the boxes. What are the other 3? Is there a code where I can see how 7 features were derived?

opened by JurijsNazarovs 0
Checkpoints of Villa models to run on validation set
Hello,

Thanks for your work and available code. I have downloaded your checkpoints using download_pretrained.sh

It downloaded several VILLA models, where one of them is villa-base.pt. Then I would like to run the validation on the checkpoint model as

python train_vqa_adv.py --config config/train-vqa-base-1gpu-adv.json --checkpoint saved_data/pretrained/villa-base.pt --valid_steps 1

However, I noticed that when model is loaded from the checkpoint, weights of self.vqa_output are not updated. What would be your suggestion if I want to take your best model and use it to run on a validation set?
opened by JurijsNazarovs 0
VQA pre-processing

I'd like to apply this model to my own VQA-like dataset. However, the dataset is in json format (like the original VQA dataset), so I need to convert it to lmdb file format. So, if you have the code to convert the original VQA data to lmdb format, could you please provide the code? Specifically, how did you calculate the "target" values in the text lmdb?

opened by uehara-mech 0
How to extract features to do image retrieval
Thank you for this amazing piece of work.

I'm interested in using VILLA or UNITER to do image retrieval.

I'd like to pre-extract features from VILLA for a folder of images and then retrieve them at inference time by using a text query.

I note that in your paper you publish image retrieval and text retrieval metrics.

I've run the code as noted in the UNITER repo:

# text annotation preprocessing bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann # image feature extraction (Tested on Titan-Xp; may not run on latest GPUs) bash scripts/extract_imgfeat.sh $PATH_TO_IMG_FOLDER $PATH_TO_IMG_NPY # image preprocessing bash scripts/create_imgdb.sh $PATH_TO_IMG_NPY $PATH_TO_STORAGE/img_db

Most of the scripts and examples I can see in the repo require both images and text to be presented to the model.

Do you have any examples or advice on how to get text-only representations/features that could be used to then retrieve images by their pre-encoded features?

Thanks for any help or guidance you can provide.
opened by eugeneware 4

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

Related tags

Overview

VILLA: Vision-and-Language Adversarial Training

Requirements

Quick Start

Downstream Tasks Finetuning

VCR

NLVR2

Visual Entailment (SNLI-VE)

Adversarial Training of LXMERT

Citation

License

Comments

Owner

Zhe Gan

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Ongoing research training transformer language models at scale, including: BERT & GPT-2

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"