Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part


VILLA: Vision-and-Language Adversarial Training

This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports adversarial finetuning of UNITER on VQA, VCR, NLVR2, and SNLI-VE. Adversarial pre-training with in-domain data will be available soon. Both VILLA-base and VILLA-large pre-trained checkpoints are released.

Overview of VILLA

Most of the code in this repo are copied/modified from UNITER.


We provide Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Quick Start

NOTE: Please run bash scripts/ $PATH_TO_STORAGE to get our latest pretrained VILLA checkpoints. This will download both the base and large models.

We use VQA as an end-to-end example for using this code base.

  1. Download processed data and pretrained models with the following command.

    bash scripts/ $PATH_TO_STORAGE

    After downloading you should see the following folder structure:

    ├── finetune 
    ├── img_db
    │   ├── coco_test2015
    │   ├── coco_test2015.tar
    │   ├── coco_train2014
    │   ├── coco_train2014.tar
    │   ├── coco_val2014
    │   ├── coco_val2014.tar
    │   ├── vg
    │   └── vg.tar
    ├── pretrained
    │   └──
    └── txt_db
        ├── vqa_devval.db
        ├── vqa_devval.db.tar
        ├── vqa_test.db
        ├── vqa_test.db.tar
        ├── vqa_train.db
        ├── vqa_train.db.tar
        ├── vqa_trainval.db
        ├── vqa_trainval.db.tar
        ├── vqa_vg.db
        └── vqa_vg.db.tar

    You can put different pre-trained checkpoints inside the /pretrained folder based on your need.

  2. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /src instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

  3. Run finetuning for the VQA task.

    # inside the container
    horovodrun -np $N_GPU python --config $YOUR_CONFIG_JSON
    # specific example
    horovodrun -np 4 python --config config/train-vqa-base-4gpu-adv.json
  4. Run inference for the VQA task and then evaluate.

    # inference
    python --txt_db /txt/vqa_test.db --img_db /img/coco_test2015 \
    --output_dir $VQA_EXP --checkpoint 6000 --pin_mem --fp16

    The result file will be written at $VQA_EXP/results_test/results_6000_all.json, which can be submitted to the evaluation server

  5. Customization

    # training options
    python --help
    • command-line argument overwrites JSON config files
    • JSON config overwrites argparse default value.
    • use horovodrun to run multi-GPU training
    • --gradient_accumulation_steps emulates multi-gpu training
    • --checkpoint selects UNITER or VILLA pre-trained checkpoints
    • --adv_training decides using adv. training or not
    • --adv_modality takes values from ['text'], ['image'], ['text','image'], and ['text','image','alter'], the last two correspond to adding perturbations on two modalities simultaneously or alternatively

Downstream Tasks Finetuning


NOTE: train and inference should be ran inside the docker container

  1. download data
    bash scripts/ $PATH_TO_STORAGE
  2. train
    horovodrun -np 4 python --config config/train-vcr-base-4gpu-adv.json \
        --output_dir $VCR_EXP
  3. inference
    horovodrun -np 4 python --txt_db /txt/vcr_test.db \
        --img_db "/img/vcr_gt_test/;/img/vcr_test/" \
        --split test --output_dir $VCR_EXP --checkpoint 8000 \
        --pin_mem --fp16
    The result file will be written at $VCR_EXP/results_test/results_8000_all.csv, which can be submitted to VCR leaderboard for evaluation.


NOTE: train and inference should be ran inside the docker container

  1. download data
    bash scripts/ $PATH_TO_STORAGE
  2. train
    horovodrun -np 4 python --config config/train-nlvr2-base-1gpu-adv.json \
        --output_dir $NLVR2_EXP
  3. inference
    python --txt_db /txt/nlvr2_test1.db/ --img_db /img/nlvr2_test/ \
    --train_dir /storage/nlvr-base/ --ckpt 6500 --output_dir . --fp16

Visual Entailment (SNLI-VE)

NOTE: train should be ran inside the docker container

  1. download data
    bash scripts/ $PATH_TO_STORAGE
  2. train
    horovodrun -np 2 python --config config/train-ve-base-2gpu-adv.json \
        --output_dir $VE_EXP

Adversarial Training of LXMERT

To keep things simple, we provide another separate repo that can be used to reproduce our results on adversarial finetuning of LXMERT on VQA, GQA, and NLVR2.


If you find this code useful for your research, please consider citing:

  title={Large-Scale Adversarial Training for Vision-and-Language Representation Learning},
  author={Gan, Zhe and Chen, Yen-Chun and Li, Linjie and Zhu, Chen and Cheng, Yu and Liu, Jingjing},

  title={Uniter: Universal image-text representation learning},
  author={Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing},



  • training setup

    training setup

    Hi, Thanks for your excellent work. I am not sure the batchsize in your paper is same as it in the code? In code, 3072 refers to total tokens, corresponding to about real 32 examples each iteration.

    a) Maybe 32(real batchsize)*8(Grad. Accu) is dominant factor? b) Our V100 machine (16G) can not process the 3072 tokens, so maybe 1024 tokens(about 8 real examples), 8 Gpus, 4(Grad. Accu) is another workable plan? c) Besides, the train-vqa-large-8gpu-adv.json you released can reproduce the paper result? Some parameters seem to be set differently from the paper (e.g. Adv .Lr ..)

    We deeply hope to reproduce your best results in our limited resource scenario. Thank a lot.

    opened by yixuan-qiao 9
  • About the reproduction of VCR experiment results

    About the reproduction of VCR experiment results

    Hi, Thanks for your great work! When i use the following command to train a model, it seems can't reach the expected results in the paper. horovodrun -np 1 python --config config/train-vcr-base-4gpu-adv.json \ --output_dir vcr/output_base Only use one GPU,I got these results 100%|##########| 8000/8000 [4:58:12<00:00, 1.98s/it][1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - ============Step 8000============= [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - 1280000 examples trained at 71 ex/s [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - =========================================== [1,0]<stderr>: [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - start running validation... [1,0]<stderr>: [[[[1,0]<stderr>:09/10/2021 08:54:06 - INFO - __main__ - validation finished in 307 seconds, score_qa: 72.28 score_qar: 75.06 score: 54.35

    I am confused that this result is a few percentage points different from the one mentioned in the paper. What should i do? Thanks in advance!!!

    opened by Tclz 3
  • When will the adversarial training code of pretraining in indomain dataset be released?

    When will the adversarial training code of pretraining in indomain dataset be released?

    Hi, zhe;

    Thanks for your excellent work. Recently I want to reproduce some results in Villa and conduct pre-training on indomain datasets. I am curious about whether it is possible to mimic the adversarial training codes in to pretraining stage simply? Is there any specific configuration for adversarial training in pretraining stage?

    opened by youngfly11 3
  • As the epoch increased, so did the GPU  memory

    As the epoch increased, so did the GPU memory

    Hi , Thanks for your great work! When I fine tuning the VQA ,I met the problems that: As the epoch increased, so did the GPU memory,Eventually,It will exceed the GPU's highest memory which causes the stopping.

    And when using multiple GPUs for training, GPU0 uses more internal memory than any other.

    This problem has been bothering me for a long time, and I want to ask do you know what is the reason?

    Thanks for your reply~:)

    opened by clytze0216 1
  • Features of img_pos_feat

    Features of img_pos_feat


    I noticed that img_pos_feat have 7 features. I assumed that 4 of them are coordinates of the boxes. What are the other 3? Is there a code where I can see how 7 features were derived?

    opened by JurijsNazarovs 0
  • Checkpoints of Villa models to run on validation set

    Checkpoints of Villa models to run on validation set


    Thanks for your work and available code. I have downloaded your checkpoints using

    It downloaded several VILLA models, where one of them is Then I would like to run the validation on the checkpoint model as

    python --config config/train-vqa-base-1gpu-adv.json --checkpoint saved_data/pretrained/  --valid_steps 1

    However, I noticed that when model is loaded from the checkpoint, weights of self.vqa_output are not updated. What would be your suggestion if I want to take your best model and use it to run on a validation set?

    opened by JurijsNazarovs 0
  • VQA pre-processing

    VQA pre-processing

    I'd like to apply this model to my own VQA-like dataset. However, the dataset is in json format (like the original VQA dataset), so I need to convert it to lmdb file format. So, if you have the code to convert the original VQA data to lmdb format, could you please provide the code? Specifically, how did you calculate the "target" values in the text lmdb?

    opened by uehara-mech 0
  • How to extract features to do image retrieval

    How to extract features to do image retrieval

    Thank you for this amazing piece of work.

    I'm interested in using VILLA or UNITER to do image retrieval.

    I'd like to pre-extract features from VILLA for a folder of images and then retrieve them at inference time by using a text query.

    I note that in your paper you publish image retrieval and text retrieval metrics.

    I've run the code as noted in the UNITER repo:

    # text annotation preprocessing
    bash scripts/ $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann
    # image feature extraction (Tested on Titan-Xp; may not run on latest GPUs)
    bash scripts/ $PATH_TO_IMG_FOLDER $PATH_TO_IMG_NPY
    # image preprocessing
    bash scripts/ $PATH_TO_IMG_NPY $PATH_TO_STORAGE/img_db

    Most of the scripts and examples I can see in the repo require both images and text to be presented to the model.

    Do you have any examples or advice on how to get text-only representations/features that could be used to then retrieve images by their pre-encoded features?

    Thanks for any help or guidance you can provide.

    opened by eugeneware 4
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

null 9 Dec 20, 2022
This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

EleutherAI 42 Dec 13, 2022
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of, which in itself is

BigScience Workshop 316 Jan 3, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project:

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 7.5k Feb 17, 2021
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project:

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.1k Feb 17, 2021
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.4k Feb 17, 2021
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project:

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
Mirco Ravanelli 2.3k Dec 27, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 9, 2021
Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger In this project, our aim is to tune, compare, and contrast the perf

Chirag Daryani 0 Dec 25, 2021
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 9, 2021