Bilinear attention networks for visual question answering

Overview

Bilinear Attention Networks

This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entities tasks.

For the visual question answering task, our single model achieved 70.35 and an ensemble of 15 models achieved 71.84 (Test-standard, VQA 2.0). For the Flickr30k Entities task, our single model achieved 69.88 / 84.39 / 86.40 for Recall@1, 5, and 10, respectively (slightly better than the original paper). For the detail, please refer to our technical report.

This repository is based on and inspired by @hengyuan-hu's work. We sincerely thank for their sharing of the codes.

Overview of bilinear attention networks

Updates

  • Bilinear attention networks using torch.einsum, backward-compatible. (12 Mar 2019)
  • Now compatible with PyTorch v1.0.1. (12 Mar 2019)

Prerequisites

You may need a machine with 4 GPUs, 64GB memory, and PyTorch v1.0.1 for Python 3.

  1. Install PyTorch with CUDA and Python 3.6.
  2. Install h5py.

WARNING: do not use PyTorch v1.0.0 due to a bug which induces underperformance.

VQA

Preprocessing

Our implementation uses the pretrained features from bottom-up-attention, the adaptive 10-100 features per image. In addition to this, the GloVe vectors. For the simplicity, the below script helps you to avoid a hassle.

All data should be downloaded to a data/ directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/download.sh from the repository root. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh from the repository root to process the data to the correct format.

For now, you should manually download for the below options (used in our best single model).

We use a part of Visual Genome dataset for data augmentation. The image meta data and the question answers of Version 1.2 are needed to be placed in data/.

We use MS COCO captions to extract semantically connected words for the extended word embeddings along with the questions of VQA 2.0 and Visual Genome. You can download in here. Since the contribution of these captions is minor, you can skip the processing of MS COCO captions by removing cap elements in the target option in this line.

Counting module (Zhang et al., 2018) is integrated in this repository as counting.py for your convenience. The source repository can be found in @Cyanogenoid's vqa-counting.

Training

$ python3 main.py --use_both True --use_vg True

to start training (the options for the train/val splits and Visual Genome to train, respectively). The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default hyperparameters should give you the best result of single model, which is around 70.04 for test-dev split.

Validation

If you trained a model with the training split using

$ python3 main.py

then you can run evaluate.py with appropriate options to evaluate its score for the validation split.

Pretrained model

We provide the pretrained model reported as the best single model in the paper (70.04 for test-dev, 70.35 for test-standard).

Please download the link and move to saved_models/ban/model_epoch12.pth (you may encounter a redirection page to confirm). The training log is found in here.

$ python3 test.py --label mytest

The result json file will be found in the directory results/.

Without Visual Genome augmentation

Without the Visual Genome augmentation, we get 69.50 (average of 8 models with the standard deviation of 0.096) for the test-dev split. We use the 8-glimpse model, the learning rate is starting with 0.001 (please see this change for the better results), 13 epochs, and the batch size of 256.

Flickr30k Entities

Preprocessing

You have to manually download Annotation and Sentence files to data/flickr30k/Flickr30kEntities.tar.gz. Then run the provided script tools/download_flickr.sh and tools/process_flickr.sh from the root of this repository, similarly to the case of VQA. Note that the image features of Flickr30k were generated using bottom-up-attention pretrained model.

Training

$ python3 main.py --task flickr --out saved_models/flickr

to start training. --gamma option does not applied. The default hyperparameters should give you approximately 69.6 for Recall@1 for the test split.

Validation

Please download the link and move to saved_models/flickr/model_epoch5.pth (you may encounter a redirection page to confirm).

$ python3 evaluate.py --task flickr --input saved_models/flickr --epoch 5

to evaluate the scores for the test split.

Troubleshooting

Please check troubleshooting wiki and previous issue history.

Citation

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@inproceedings{Kim2018,
author = {Kim, Jin-Hwa and Jun, Jaehyun and Zhang, Byoung-Tak},
booktitle = {Advances in Neural Information Processing Systems 31},
title = {{Bilinear Attention Networks}},
pages = {1571--1581},
year = {2018}
}

License

MIT License

Comments
  • cannot reproduce the best result of single model

    cannot reproduce the best result of single model

    I followed all the instructions and use the default hyperparameters, which should give me the best results. However, if I set random seed=1204 as default, I can only get 69.84 on test-dev split, which is 0.2 lower than the reported results. And I notice that the standard deviations reported on val split is around 0.11. Can you give me some advice on how to fix the gap? Thx!

    opened by cengzy14 7
  • What does the  'xhyk,bvk,bqk->bhvq'  mean???

    What does the 'xhyk,bvk,bqk->bhvq' mean???

    What does this mean in the code, logits = torch.einsum('xhyk,bvk,bqk->bhvq', (self.h_mat, v_, q_)) + self.h_bias What does the 'xhyk,bvk,bqk->bhvq' mean???

    opened by jasscia18 6
  • Out of memory while executing loss.backward()

    Out of memory while executing loss.backward()

    Hello, thanks for your great code! I have some trouble while running

    python3 main.py --use_both True --use_vg True

    I have 4 TITAN Xps, which has 12.2G memory per GPU, and set the batchsize to 256. Then I get the following error:

    nParams= 90618566 optim: adamax lr=0.0007, decay_step=2, decay_rate=0.25, grad_clip=0.25 gradual warmup lr: 0.0003 THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "main.py", line 97, in train(model, train_loader, eval_loader, args.epochs, args.output, optim, epoch) File "/home/Project/ban-vqa/train.py", line 74, in train loss.backward() File "/home/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/home/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/autograd/init.py", line 99, in backward variables, grad_variables, retain_graph) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

    And If I set batchsize to 128, it will occupy ~12G GPU memory during the early stage and then goes down to ~6G per GPU. Is there something wrong with my execution? Thx!

    opened by cengzy14 6
  • Question

    Question

    Hello guys,

    Very nice piece of work. I was wondering why you didn't use a einsum implementation of the bilinear attention in order to speed up training. image This equation is perfect for it. U should have a significant gain, and it would be nice for once to have highly optimized code available on github.

    Best, T.C

    opened by tchaton 5
  • flickr30k upperbound

    flickr30k upperbound

    Hello,

    I used Bottom-up Attention to get boxes for Flickr30k data. Unfortunately, I could not get the same upperbound you reported in the paper. I get 0.6507 you reported 0.8745. Do you mind providing the details how you used Bottom-up model for inducing boxes. Below I listed mine:

    model_name: resnet101_faster_rcnn_final.caffemodel
    conf_thresh=0.2
    min_boxes=10
    max_boxes=100
    

    UPDATE:

    When I increase the number of boxes I get better upperbound but still it is not as good as yours, below setup gives me upperbound 0.8530

    model_name: resnet101_faster_rcnn_final.caffemodel
    conf_thresh=0.01
    min_boxes=200
    max_boxes=200
    
    opened by volkancirik 5
  • bug in bc.py

    bug in bc.py

    line 39 in bc.py: self.h_net = weight_norm(nn.Linear(h_dim, h_out), dim=None) is this should be self.h_net = weight_norm(nn.Linear(h_dim*self.k, h_out), dim=None)

    opened by zhangweifeng1218 5
  • flickr 30k features download

    flickr 30k features download

    Are the hdf5 files in the downloaded flickr30k_features.zip used to reproduce the results? I don't see tsv files in flickr30k_features.zip but I do need the features and bounding boxes for flickr 30k validation/testing sets. The files in flickr30k_features.zip are confusing, for example, in val.hdf5 file, there are (30722, 2048) features, but in adaptive_detection_features_converter.py, known_num_boxes for validation set is 29906, so what are these 30722 features?

    opened by ziyanyang 4
  • How to use the pre

    How to use the pre

    Hello, I'm the first time try to use a vqa network, and I wonder how can I use the pretrained model to ask a question on a image and get a response? Thank you.

    opened by MILOTIDE 4
  • Download Flickr30k  features

    Download Flickr30k features

    opened by drigoni 3
  • Run test get KeyError: 1 error

    Run test get KeyError: 1 error

    when I run python test.py --label mytest, get this error:

    Traceback (most recent call last):
      File "test.py", line 91, in <module>
        eval_dset = VQAFeatureDataset(args.split, dictionary, adaptive=True)
      File "/home/gwh/Downloads/ban-vqa-master/dataset.py", line 244, in __init__
        self.entries = _load_dataset(dataroot, name, self.img_id2idx, self.label2ans)
      File "/home/gwh/Downloads/ban-vqa-master/dataset.py", line 142, in _load_dataset
        entries.append(_create_entry(img_id2val[img_id], question, None))
    KeyError: 1
    

    I find data/test2015_imgid2idx.pkl is {}, the file is generated with python3 tools/adaptive_detection_features_converter.py.

    Can you help me? @jnhwkim Thanks in advance for any suggestions.

    opened by Ailln 3
  • error when using adaptive_detection_features_converter.py

    error when using adaptive_detection_features_converter.py

    While running adaptive_detection_features_converter.py for the TSV files, I am getting this error and can't resolve it. Any leads here would be helpful. This error occurs when trying to decode the features/boxes from the tsv file.

    File "tools/adaptive_detection_features_converter.py", line 156, in extract bboxes = np.frombuffer(base64.decodestring(item['boxes']), dtype=np.float32).reshape((item['num_boxes'], -1)) File "/home/reddy/myvenv/lib/python3.6/base64.py", line 554, in decodestring return decodebytes(s) File "/home/reddy/myvenv/lib/python3.6/base64.py", line 546, in decodebytes return binascii.a2b_base64(s) binascii.Error: Incorrect padding

    opened by dhruvsharma15 2
  • Error in Flickr30k features

    Error in Flickr30k features

    Dear authors,

    I saw your previous answer, but I didn't have time to answer before the issue was closed. I have tried two different Linux systems and have also tried on Windows. I have tried Chrome and Firefox. I can download the package but not unzip it because it gives me an error with the train.hdf5 file. It says the file is corrupted. I also tried two different internet connections. I can't unzip without errors. I have tried to download the file several times, but the result is always the same.

    Could you please check the train.hdf5 file? Davide

    Originally posted by @drigoni in https://github.com/jnhwkim/ban-vqa/issues/46#issuecomment-1137563064

    opened by drigoni 2
  • Pretrained model for Flickr30k

    Pretrained model for Flickr30k

    Dear Authors,

    the link to download the pre-trained model for Flickr30k no longer works. Could you please update it again? Link not working: https://drive.google.com/uc?export=download&id=1xiVVRPsbabipyHes25iE0uj2YkdKWv3K

    Davide

    opened by drigoni 1
Owner
Jin-Hwa Kim
Jin-Hwa Kim
Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

Ren Tianhe 49 Nov 10, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch

?? Flamingo - Pytorch Implementation of Flamingo, state-of-the-art few-shot visual question answering attention net, in Pytorch. It will include the p

Phil Wang 630 Dec 28, 2022
Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

Liangming Pan 70 Nov 27, 2022
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

Remi 672 Jan 1, 2023
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 9, 2022
Compact Bilinear Pooling for PyTorch

Compact Bilinear Pooling for PyTorch. This repository has a pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch. This

Grégoire Payen de La Garanderie 234 Dec 7, 2022
A Pytorch Implementation for Compact Bilinear Pooling.

CompactBilinearPooling-Pytorch A Pytorch Implementation for Compact Bilinear Pooling. Adapted from tensorflow_compact_bilinear_pooling Prerequisites I

null 169 Dec 23, 2022
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

Michihiro Yasunaga 434 Jan 4, 2023
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

Clova AI Research 34 Apr 13, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

OSU DKI Lab 76 Dec 21, 2022
Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

Studio Ousia 147 Dec 7, 2022
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

Abhijith Neil Abraham 19 Sep 9, 2021
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

Junbin Xiao 50 Nov 24, 2022
FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale 40 Dec 13, 2022
Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

Peifeng Wang 33 Dec 5, 2022
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

Salesforce 72 Dec 5, 2022