An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Overview

Bottom-Up and Top-Down Attention for Visual Question Answering

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

The implementation follows the VQA system described in "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering" (https://arxiv.org/abs/1707.07998) and "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge" (https://arxiv.org/abs/1708.02711).

Results

Model Validation Accuracy Training Time
Reported Model 63.15 12 - 18 hours (Tesla K40)
Implemented Model 63.58 40 - 50 minutes (Titan Xp)

The accuracy was calculated using the VQA evaluation metric.

About

This is part of a project done at CMU for the course 11-777 Advanced Multimodal Machine Learning and a joint work between Hengyuan Hu, Alex Xiao, and Henry Huang.

As part of our project, we implemented bottom up attention as a strong VQA baseline. We were planning to integrate object detection with VQA and were very glad to see that Peter Anderson and Damien Teney et al. had already done that beautifully. We hope this clean and efficient implementation can serve as a useful baseline for future VQA explorations.

Implementation Details

Our implementation follows the overall structure of the papers but with the following simplifications:

  1. We don't use extra data from Visual Genome.
  2. We use only a fixed number of objects per image (K=36).
  3. We use a simple, single stream classifier without pre-training.
  4. We use the simple ReLU activation instead of gated tanh.

The first two points greatly reduce the training time. Our implementation takes around 200 seconds per epoch on a single Titan Xp while the one described in the paper takes 1 hour per epoch.

The third point is simply because we feel the two stream classifier and pre-training in the original paper is over-complicated and not necessary.

For the non-linear activation unit, we tried gated tanh but couldn't make it work. We also tried gated linear unit (GLU) and it works better than ReLU. Eventually we choose ReLU due to its simplicity and since the gain from using GLU is too small to justify the fact that GLU doubles the number of parameters.

With these simplifications we would expect the performance to drop. For reference, the best result on validation set reported in the paper is 63.15. The reported result without extra data from visual genome is 62.48, the result using only 36 objects per image is 62.82, the result using two steam classifier but not pre-trained is 62.28 and the result using ReLU is 61.63. These numbers are cited from the Table 1 of the paper: "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge". With all the above simplification aggregated, our first implementation got around 59-60 on validation set.

To shrink the gap, we added some simple but powerful modifications. Including:

  1. Add dropout to alleviate overfitting
  2. Double the number of neurons
  3. Add weight normalization (BN seems not work well here)
  4. Switch to Adamax optimizer
  5. Gradient clipping

These small modifications bring the number back to ~62.80. We further change the concatenation based attention module in the original paper to a projection based module. This new attention module is inspired by the paper "Modeling Relationships in Referential Expressions with Compositional Modular Networks" (https://arxiv.org/pdf/1611.09978.pdf), but with some modifications (implemented in attention.NewAttention). With the help of this new attention, we boost the performance to ~63.58, surpassing the reported best result with no extra data and less computation cost.

Usage

Prerequisites

Make sure you are on a machine with a NVIDIA GPU and Python 2 with about 70 GB disk space.

  1. Install PyTorch v0.3 with CUDA and Python 2.7.
  2. Install h5py.

Data Setup

All data should be downloaded to a 'data/' directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/download.sh from the repository root. The features are provided by and downloaded from the original authors' repo. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh from the repository root to process the data to the correct format.

Training

Simply run python main.py to start training. The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default flags should give you the result provided in the table above.

Comments
  • code error!!

    code error!!

    hello, I have a problem, please give me a resolution, thank you very much!! I have pytorch0.3 to pytorch0.4. Traceback (most recent call last): File "main.py", line 54, in train(model, train_loader, eval_loader, args.epochs, args.output) File "/root/YXJ/QA/Model/bottom-up-attention-vqa-master/train.py", line 43, in train pred = model(v, b, q, a) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 113, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 118, in replicate return replicate(module, device_ids) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate param_copies = Broadcast.apply(devices, *params) RuntimeError: slice() cannot be applied to a 0-dim tensor

    opened by y11203090135 6
  • getting memory error on Tesla K80

    getting memory error on Tesla K80

    getting error on loading: Traceback (most recent call last): File "main.py", line 33, in train_dset = VQAFeatureDataset('train', dictionary) File "/home/sujitmishra/bottom-up-attention-vqa/dataset.py", line 120, in init self.features = np.array(hf.get('image_features')) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/home/sujitmishra/py2/local/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 690, in array arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype) MemoryError

    How much gpu does it need for training?

    opened by sujit420 6
  • Massive RAM requirement to load .tsv

    Massive RAM requirement to load .tsv

    Thanks for the great implementation. Unfortunately after downloading the data and processing successfully, I run the main.py file but find that even with 32GB of RAM, the run fails while loading features from the .hdf5 files from running out of RAM. Is there a workaround to this, or am I doing something wrong here?

    Thank you!

    opened by brandonjabr 5
  • Dimension of spatial features

    Dimension of spatial features

    Hi @hengyuan-hu

    Thank you for your fantastic work. I am trying to adapt your code to image captioning. I follow your code to read out the tsv file and I found that the shape of spatial: spatials_features.shape = (82783, 36, 6). I really do not know where this feature came from? Could you please explain it to me? I only know the image_features.shape = (82783, 36, 2048) is because the image feature is 2048-d.

    Thanks!

    opened by coldmanck 5
  • Code for calculating accuracy metric

    Code for calculating accuracy metric

    The README file says that accuracy was calculated with the VQA metric. Is the code for that calculation in the repo? I am unsure how to use the upper bound score alongside the actual score to get to this metric.

    opened by erobic 4
  • I can't find the

    I can't find the "dictionary.pkl" in data.

    The original "data" folder only contains the "train_ids.pkl" and the "val_ids.pkl", instead of the "dictionary.pkl". I have downloaded all the files in the "tools/download.sh", but the "dictionary.pkl" is not in the list, too. My “data” folder is shown in the figure. 1537445867

    opened by jingchenchen 3
  • evaluating on test?

    evaluating on test?

    Hi there! Thanks so much for this implementation!

    I was wondering how I could use the trained model to score the official test set? I noticed that @jnhwkim had a similar question here but wasn't sure of the detailed step-by-step instructions...

    https://github.com/hengyuan-hu/bottom-up-attention-vqa/issues/3

    (sorry if this is "standard" VQA stuff---I'm new to this task so any help would be appreciated!)

    Thanks!

    opened by yoonkim 3
  • Run trained model on a single image

    Run trained model on a single image

    I've successfully trained the model and can load the state dict from the .pth model into a new instance. Is there any way I can now test it on a new image/question, and see the response?

    Thank you!

    opened by brandonjabr 3
  • Reported accuracy of Implemented Model

    Reported accuracy of Implemented Model

    Hi!

    The README says that running the code using the default parameter leads to the 'table results.' However, I'm run the code in an AWS p2.x16large machine and got:

    
    	train_loss: 3.83, score: 48.52
    	eval score: 45.38 (92.66)
    epoch 22, time: 325.48
    	train_loss: 3.80, score: 48.76
    	eval score: 45.37 (92.66)
    epoch 23, time: 329.43
    	train_loss: 3.78, score: 49.02
    	eval score: 45.27 (92.66)
    epoch 24, time: 329.51
    	train_loss: 3.75, score: 49.34
    	eval score: 45.09 (92.66)
    epoch 25, time: 326.05
    	train_loss: 3.73, score: 49.50
    	eval score: 45.40 (92.66)
    **epoch 26, time: 327.16
    	train_loss: 3.71, score: 49.75
    	eval score: 45.43 (92.66)**
    epoch 27, time: 326.92
    	train_loss: 3.69, score: 49.97
    	eval score: 45.03 (92.66)
    epoch 28, time: 328.04
    	train_loss: 3.67, score: 50.22
    	eval score: 45.35 (92.66)
    epoch 29, time: 332.07
    	train_loss: 3.66, score: 50.44
    	eval score: 45.17 (92.66)
    

    Any thought about why is this happening?

    Thanks, Belen

    opened by bcsaldias 2
  • How l2-normalization over feature is implemented ?

    How l2-normalization over feature is implemented ?

    this paper states that L2 normalization of the image features is crucial for good performance. However, you just use pool5 data, which is average pooled to become a 2048 vector in generate_tsv.py

    However, nether did your repository bottom-up-attention-vqa nor the feature exactor repository bottom-up-attention implement the L2-normaliation. I implemented it at the very beginning of the forward procedure by v = v / torch.norm(v, 2). But the validation score decreased by 0.5.

    Can anybody explain it ? Thanks~

    opened by ZhuFengdaaa 2
  • Add visual genome as extra data

    Add visual genome as extra data

    Hi, I try to add this dataset to train, and follow this paper's guide, use ' questions whose correct answers overlap the output vocabulary determined on the VQA v2 dataset', answers are all processed with your processed_answer function. But I got 970,000 questions or so, and it is much larger than 485,000 questions reported in the paper, Have any ideas?

    opened by greathope 2
  • problem with questions and answers file length

    problem with questions and answers file length

    Hi, thank you for your good implementation. I have a problem when starting for training, in utils.py ,def assert_eq checks if the length of questions and answers are the same or not, besides, in compute_softscore.py script, def compute_target is making a pkl file containing filtered answers, length of original mscoco questions for training is 443757 and length of train_target.pkl file is 388158, it seems that the answers are not getting augmented after filtering and they do not have the same size with questions. so def assert_eq shows an assertion error for me, could you please help me fixing this?

    opened by marzieh-Mlkzdh 0
  • Problem in inference

    Problem in inference

    Thanks for the code in the first place.

    We want to run inference with the pre-trained models, and obtain the region coordinates, features, predicted label of each region in an image. You say that you don't use extra data from Visual Genome, does is mean that the pre-trained model you provide can only predict 1,600 classes of objects but cannot predict the 400 classes of fine-grained attributes coming from the Visual Genome?

    Thanks in advance and look forward to your reply.

    opened by Reply1999 0
  • Reported accuracy not acheived

    Reported accuracy not acheived

    Accuracy achieved for validation is 61.5 only for the same setup.

    Please let me know how to get atleast 63% accuracy near to reported one.

    Thanking you in advance.

    Thanks

    opened by ak-kat 0
  • download.sh should be fixed.

    download.sh should be fixed.

    This is not a question. I tried "download.sh" in this github, but it didn't working at Questions and Annotations download link. I think "http://visualqa.org/data/mscoco/vqa/" should be change to "https://vision.ece.vt.edu/vqa/release_data/mscoco/vqa/" Thanks.

    opened by jaeseo-park 0
  • Features downloading not working

    Features downloading not working

    Hello! I am not been able to download the features from the link https://imagecaption.blob.core.windows.net/imagecaption/trainval_36.zip

    Can any other options be provided to get those features.

    opened by ManasiPat 3
Owner
Hengyuan Hu
Hengyuan Hu
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

Hila Chefer 489 Jan 7, 2023
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
Neural Module Network for VQA in Pytorch

Neural Module Network (NMN) for VQA in Pytorch Note: This is NOT an official repository for Neural Module Networks. NMN is a network that is assembled

Harsh Trivedi 111 Nov 24, 2022
Erpnext app for make employee salary on payroll entry based on one or more project with percentage for all project equal 100 %

Project Payroll this app for make payroll for employee based on projects like project on 30 % and project 2 70 % as account dimension it makes genral

Ibrahim Morghim 8 Jan 2, 2023
A Python module for the generation and training of an entry-level feedforward neural network.

ff-neural-network A Python module for the generation and training of an entry-level feedforward neural network. This repository serves as a repurposin

Riadh 2 Jan 31, 2022
[CVPR 2021] Counterfactual VQA: A Cause-Effect Look at Language Bias

Counterfactual VQA (CF-VQA) This repository is the Pytorch implementation of our paper "Counterfactual VQA: A Cause-Effect Look at Language Bias" in C

Yulei Niu 94 Dec 3, 2022
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

Microsoft 61 Nov 14, 2022
PyTorch implementation of NIPS 2017 paper Dynamic Routing Between Capsules

Dynamic Routing Between Capsules - PyTorch implementation PyTorch implementation of NIPS 2017 paper Dynamic Routing Between Capsules from Sara Sabour,

Adam Bielski 475 Dec 24, 2022
Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Language Emergence in Multi Agent Dialog Code for the Paper Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog Satwik Kottur, José M.

Karan Desai 105 Nov 25, 2022
A PyTorch implementation of the paper "Semantic Image Synthesis via Adversarial Learning" in ICCV 2017

Semantic Image Synthesis via Adversarial Learning This is a PyTorch implementation of the paper Semantic Image Synthesis via Adversarial Learning. Req

Seonghyeon Nam 146 Nov 25, 2022
Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Language Emergence in Multi Agent Dialog Code for the Paper Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog Satwik Kottur, José M.

Karan Desai 105 Nov 25, 2022
Winning solution of the Indoor Location & Navigation Kaggle competition

This repository contains the code to generate the winning solution of the Kaggle competition on indoor location and navigation organized by Microsoft

Tom Van de Wiele 62 Dec 28, 2022
This is the winning solution of the Endocv-2021 grand challange.

Endocv2021-winner [Paper] This is the winning solution of the Endocv-2021 grand challange. Dependencies pytorch # tested with 1.7 and 1.8 torchvision

Vajira Thambawita 14 Dec 3, 2022
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
The PyTorch improved version of TPAMI 2017 paper: Face Alignment in Full Pose Range: A 3D Total Solution.

Face Alignment in Full Pose Range: A 3D Total Solution By Jianzhu Guo. [Updates] 2020.8.30: The pre-trained model and code of ECCV-20 are made public

Jianzhu Guo 3.4k Jan 2, 2023
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially torch.utils.data.dataloade

Sanghyun Son 2.1k Jan 1, 2023
Implementation supporting the ICCV 2017 paper "GANs for Biological Image Synthesis"

GANs for Biological Image Synthesis This codes implements the ICCV-2017 paper "GANs for Biological Image Synthesis". The paper and its supplementary m

Anton Osokin 95 Nov 25, 2022
Lowest memory consumption and second shortest runtime in NTIRE 2022 challenge on Efficient Super-Resolution

FMEN Lowest memory consumption and second shortest runtime in NTIRE 2022 on Efficient Super-Resolution. Our paper: Fast and Memory-Efficient Network T

null 33 Dec 1, 2022