improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Last update: Dec 28, 2022

Related tags

Deep Learning CLIP-ViL

Overview

CLIP-ViL

In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

We release the extracted features and reproducible code here.

Specifically, we develop our methods in two scenarios: (1) direct task-specific fine-tuning; and (2) Vision and Language pre-training.

CLIP-ViL-Direct/VLN

We directly plug CLIP into tasks-pecific models and finetune on three representative tasks including Visual Question Answering, Image Captioning, and Vision-Language Navigation.

Please see the corresponding code directory for full details.

Noted that in direct finetuning, for Visual Question Answering on VQA 2.0 test-dev, we are able to achieve up to 68.37% accuracy with Pythia, 74.01% accuracy with MCAN and generally more than 4.0% improvements in accuracy; For Image Captioning on Karpathy's test split of MS COCO, we got 2.1% improvements in CIDEr metric over resnet alternatives; For Navigation, On RxR, we got 5% improvements with the nDTW metric (the main metric for RxR). On R2R, we got about 6% improvements in accuracy regarding our strong baselines.

CLIP-ViL-Pretrain

In order to test the potential of combining CLIP pre-training and Vision and Language pre-training. We introduce CLIP-ViL-Pretrain, a vision-and-language model pre-trained on image-text data with CLIP visual encoder as its visual backbone. CLIP-ViL-Pretrain is pretrained on aligned image-text data with a reconstructive objective and an image-text matching objective. It is further finetuned on VQA, SNLI-VE and GQA tasks.

Please see the corresponding code directory for full details.

Noted that CLIP-ViL-Pretrain is able to achieve 76.48% accuracy on VQA 2.0 test-dev and 76.70% accuracy on test-std; 80.61% accuracy on SNLI-VE Dev and 80.20% on Test-P; 61.42% accuracy on GQA test-dev and 62.93% accuracy on test-std.

Reference

If you use CLIP-ViL in your research or wish to refer to the baseline results published here, please use the following BibTeX entry.

@misc{shen2021clip,
    title={How Much Can CLIP Benefit Vision-and-Language Tasks?}, 
    author={Sheng Shen and Liunian Harold Li and Hao Tan and Mohit Bansal and Anna Rohrbach and Kai-Wei Chang and Zhewei Yao and Kurt Keutzer},
    year={2021},
    eprint={2107.06383},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Comments

Captioning model training script fails

Hi, I followed the data preparation and ran the training script for the default CLIP-RN50 model in the readme. However, the training job crashes with the log below. Could you please check if the current example training script is runnable?

$ /scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption 
> python tools/train.py --cfg configs/phrase1/clip_rn50_transformer_scl.yml
Warning: key N_enc not in args
Warning: key N_dec not in args
Warning: key d_model not in args
Warning: key d_ff not in args
Warning: key num_att_heads not in args
Warning: key dropout not in args
Warning: key REFORWARD not in args
DataLoader loading json file:  data/cocotalk.json
vocab size is  9487
DataLoader loading h5 file:  data/cocotalk_clip_RN50_fc data/cocotalk_clip_RN50_att data/cocotalk_box data/cocotalk_label.h5
max sequence length in data is 16
read 123287 image features
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
Read data: 0.0007147789001464844
/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
iter 0 (epoch 0), train_loss = 9.158, time/batch = 25.627
Read data: 0.0002460479736328125
iter 1000 (epoch 0), train_loss = 4.920, time/batch = 0.183
Read data: 0.00023293495178222656
iter 2000 (epoch 0), train_loss = 3.784, time/batch = 0.194
Traceback (most recent call last):
  File "tools/train.py", line 293, in <module>
    train(opt)
  File "tools/train.py", line 246, in train
    val_loss, predictions, lang_stats = eval_utils.eval_split(
  File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/utils/eval_utils.py", line 171, in eval_split
    seq, seq_logprobs = model(fc_feats, att_feats, att_masks, opt=tmp_eval_kwargs, mode='sample')
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 5 on device 5.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/models/CaptionModel.py", line 33, in forward
    return getattr(self, '_'+mode)(*args, **kwargs)
TypeError: _sample() missing 2 required positional arguments: 'fc_feats' and 'att_feats'

opened by j-min 11

The extracted feature of the COCO dataset for caption

In CLIP-ViL/CLIP-ViL-Direct/caption/clip.py, "x = x + self.positional_embedding[0, :, None, :].to(x.dtype) # (HW+1)NC" should be " x = x + self.positional_embedding[:, None, :].to(x.dtype) # (HW+1)NC"

Could you explain why you change the following code of the official clip? In official clip, out_proj_weight=self.c_proj.weight, out_proj_bias=self.c_proj.bias. In your code,

opened by liujiaheng 4
About precompute

Hi! @airsplay. I have used precomute_imagenet_views.py and found that the result is not consistent with the tsv file provided by https://nlp.cs.unc.edu/data/vln_clip/features/CLIP-ViT-B-32-views.tsv . I wonder that how did you get the tsv file provided on that website? I run it with the default settings(arch=vit, LABEL=False)

opened by HeyMercer 2
About the training time of Pythia

Thanks for your open source code! I want to inquire about the training time of Pythia utilizing your CLIP features.

It seems to take me more than a week if I use mmf with your config for Pythia (even setting the batchsize to 32 with original iterations). However, when I utilize the code in https://github.com/KaihuaTang/VQA2.0-Recent-Approachs-2018.pytorch and run the butp, I could finish the training in 5-6 hours with 4 Nvidia 2080Ti.

To be honest, I am not familiar with mmf. And I just want to make sure whether I did something wrong.

opened by tingxueronghua 2
bug in positional_embedding's weights when resizing.

Hi, @airsplay , @sIncerass , @liunian-harold-li , @clip-vil

Thanks for sharing your code! This is a very interesting work :)

However, I find that there may be an inappropriate operation in ./CLIP-ViL-Direct/caption/scripts/clip_prepro_feats.py when extracting the visual CLIP features. Specifically, when you resize the official CLIP models' positional_embedding to support larger image input resolution than the official CLIP models, you assign pos_embed.weight with the resized pos_embed weights while the pos_embed.data actually used in forward() remains zeros. I guess the lack of positional_embedding's weights when extracting the CLIP grid features is the true reason for the large performance degradation on the experiments using CLIP-ViT-B grid features. The performances on CLIP ResNet models are relatively normal because the model.visual.attnpool.positional_embedding layer is only used on the top of CLIP ResNet models to aggregate the extracted grid features for the global visual feature.

To verify my guess, I re-implement a new feature-extraction pipeline and run the COCO Captioning task using X-modaler codebase with the extracted CLIP grid features. I get 127.8 CIDEr score using ViT-B/32_448 CLIP grid features. I have provided my detailed experiment results on COCO Captioning here.

So I wonder whether the lack of positional_embedding's weights when extracting CLIP-ViT-B grid features also destroyed the performance on VQA tasks? And the visualization and conclusion in Figure3 will change after providing the pos_embed_weights when extracting CLIP-ViT-B grid features?

Best, Jianjie

opened by jianjieluo 2

No 'tfm_gen' when trying to run feature extraction for vqa mcan

Hi! I'm trying to run mcan feature extraction.

This is the command I was trying to run: python mcan_clip_grid_feature.py --config-file configs/R-50-grid.yaml --dataset coco_2014_train --model_type RN50. And I got the following error. I searched throughout the codebase and didn't find where self.tfm_gen is defined.

Traceback (most recent call last):
  File "mcan_clip_grid_feature.py", line 142, in <module>
    main(args)
  File "mcan_clip_grid_feature.py", line 136, in main
    do_feature_extraction(cfg, model, args.dataset, args)
  File "mcan_clip_grid_feature.py", line 78, in do_feature_extraction
    extract_clip_feature_on_dataset(model, data_loader, dump_folder, args)
  File "mcan_clip_grid_feature.py", line 113, in extract_clip_feature_on_dataset
    for idx, inputs in enumerate(tqdm.tqdm(data_loader)):
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/detectron2/data/common.py", line 90, in __getitem__
    data = self._map_func(self._dataset[cur_idx])
  File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/detectron2/utils/serialize.py", line 26, in __call__
    return self._obj(*args, **kwargs)
  File "/mnt/nlpgridio3/data/weiqiuy/CLIP-ViL/CLIP-ViL-Direct/vqa/grid_feats/dataset_mapper.py", line 107, in __call__
    self.tfm_gens = [self.tfm_gens[0]] + [T.Resize((600, 1000))]
AttributeError: 'AttributeDatasetMapper' object has no attribute 'tfm_gens'

opened by fallcat 1

Pythia Feature Extraction

I'm trying to extract image features for vqa with pythia using python pythia_clip_grid_feature.py --config-file configs/R-50-grid.yaml --dataset coco_2015_train --model_type RN50. Isn't this supposed to give output as 100 object features of 2048 dimension. I'm getting varying outputs of dimensions, (1, 13, 20, 2048), (1, 15, 20, 2048) and similar ones. Could anyone point out where I'm wrong and where I need to make changes to get (100, 2048) output for an image. And what should be the format of annotation file if I am to use VQA dataset, since that doesn't have attributes like area, segmentation, categories, etc Thanks

opened by shamanthak-hegde 1
Data dir for mcan_clip_grid_feature.py

Hello, thanks for sharing such a great code base. If I want to extract VQAv2's COCO image features (or some other dataset) by using mcan_clip_grid_feature.py, how should I prepare the data format and specify the dataset directory that the code needs ?

opened by Fly2flies 1
Errors occurred when extracting clip features using Resnet

Thank you for your answer to the previous question. The following errors occurred when I used resnet model to extract clip features. The dimension of x is 3, the dimension of self.positional_embedding is 2, but the dimension of slice operation is 4. I don't know what was wrong and I hope to get your help. Thank you very much.

opened by tianjunyu0871 1
Errors occurred when extracting clip features using ViT-B/32

Thanks for sharing your code! When I run clip_prepro_feats.py, the following error occurred. I tried many times, but still failed. I hope to get your help.

opened by tianjunyu0871 1
CLIP-VIT-B-Transformer captioning results

Hi @sIncerass , thanks for your interesting work. I have a question about image captioning results in Table 2. I find that the Transformer model with CLIP-ViT-B feature can still get a good performance instead of the dramatically worse performance reported in Table 2. Maybe there is a bug in the CLIP-ViT-B feature extraction.

opened by YuanEZhou 1
How to reproduce the results of experiments that are shown in Table 7

Thanks for your open source code! I'm trying the reproduce the results of experiments in Table 7 in the ICLR paper: 'Zero-Shot Performance of CLIP in VQA'.

Would you provide code or detail information for this?

opened by raven38 0

Where can I found annotations for SNLI-VE?

It seems that the model uses different annotation files from the code. What's the difference between it different and the original SNLI-VE jsonl file？And where can I found it? Can you share it with us? Thank you in advance!

# CLIP-ViL-Pretrain/src/tasks/snli_data.py
text_db_paths = {
    "valid": "/local/harold/ubert/clip_vlp/lxmert/data/snli_ve/txt_db/ve_dev.db",
    "train": "/local/harold/ubert/clip_vlp/lxmert/data/snli_ve/txt_db/ve_train.db",
    "test": "/local/harold/ubert/clip_vlp/lxmert/data/snli_ve/txt_db/ve_test.db",
}

opened by 1219521375 1

CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Checkpoint for SNLI-VE

Hi, just wanted to ask if there is a checkpoint for the finetuned model on SNLI-VE and if it would be possible share it? Thank you in advance! @airsplay

opened by sramshetty 0

Owner

GitHub

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

70 Nov 27, 2022

Traditional deepdream with VQGAN+CLIP and optical flow. Ready to use in Google Colab

VQGAN-CLIP-Video cat.mp4 policeman.mp4 schoolboy.mp4 forsenBOG.mp4

23 Oct 26, 2022

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

191 Dec 31, 2022

Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

506 Nov 29, 2022

Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

672 Jan 1, 2023

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

18 Dec 9, 2022

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

49 Nov 10, 2022

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

225 Dec 29, 2022

Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch

?? Flamingo - Pytorch Implementation of Flamingo, state-of-the-art few-shot visual question answering attention net, in Pytorch. It will include the p

630 Dec 28, 2022

Created as part of CS50 AI's coursework. This AI makes use of knowledge entailment to calculate the best probabilities to win Minesweeper.

Minesweeper-AI Created as part of CS50 AI's coursework. This AI makes use of knowledge entailment to calculate the best probabilities to win Minesweep

0 Jul 20, 2022

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

75 Dec 29, 2022

QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

434 Jan 4, 2023

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

34 Apr 13, 2022

covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

19 Sep 9, 2021

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

41 Nov 8, 2022

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

22 Nov 9, 2022

GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

76 Dec 21, 2022

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

147 Dec 7, 2022

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

50 Nov 24, 2022