improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Overview

CLIP-ViL

In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

We release the extracted features and reproducible code here.

Specifically, we develop our methods in two scenarios: (1) direct task-specific fine-tuning; and (2) Vision and Language pre-training.

CLIP-ViL-Direct/VLN

We directly plug CLIP into tasks-pecific models and finetune on three representative tasks including Visual Question Answering, Image Captioning, and Vision-Language Navigation.

Please see the corresponding code directory for full details.

Noted that in direct finetuning, for Visual Question Answering on VQA 2.0 test-dev, we are able to achieve up to 68.37% accuracy with Pythia, 74.01% accuracy with MCAN and generally more than 4.0% improvements in accuracy; For Image Captioning on Karpathy's test split of MS COCO, we got 2.1% improvements in CIDEr metric over resnet alternatives; For Navigation, On RxR, we got 5% improvements with the nDTW metric (the main metric for RxR). On R2R, we got about 6% improvements in accuracy regarding our strong baselines.

CLIP-ViL-Pretrain

In order to test the potential of combining CLIP pre-training and Vision and Language pre-training. We introduce CLIP-ViL-Pretrain, a vision-and-language model pre-trained on image-text data with CLIP visual encoder as its visual backbone. CLIP-ViL-Pretrain is pretrained on aligned image-text data with a reconstructive objective and an image-text matching objective. It is further finetuned on VQA, SNLI-VE and GQA tasks.

Please see the corresponding code directory for full details.

Noted that CLIP-ViL-Pretrain is able to achieve 76.48% accuracy on VQA 2.0 test-dev and 76.70% accuracy on test-std; 80.61% accuracy on SNLI-VE Dev and 80.20% on Test-P; 61.42% accuracy on GQA test-dev and 62.93% accuracy on test-std.

Related Links

Reference

If you use CLIP-ViL in your research or wish to refer to the baseline results published here, please use the following BibTeX entry.

@misc{shen2021clip,
    title={How Much Can CLIP Benefit Vision-and-Language Tasks?}, 
    author={Sheng Shen and Liunian Harold Li and Hao Tan and Mohit Bansal and Anna Rohrbach and Kai-Wei Chang and Zhewei Yao and Kurt Keutzer},
    year={2021},
    eprint={2107.06383},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Comments
  • Captioning model training script fails

    Captioning model training script fails

    Hi, I followed the data preparation and ran the training script for the default CLIP-RN50 model in the readme. However, the training job crashes with the log below. Could you please check if the current example training script is runnable?

    $ /scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption 
    > python tools/train.py --cfg configs/phrase1/clip_rn50_transformer_scl.yml
    Warning: key N_enc not in args
    Warning: key N_dec not in args
    Warning: key d_model not in args
    Warning: key d_ff not in args
    Warning: key num_att_heads not in args
    Warning: key dropout not in args
    Warning: key REFORWARD not in args
    DataLoader loading json file:  data/cocotalk.json
    vocab size is  9487
    DataLoader loading h5 file:  data/cocotalk_clip_RN50_fc data/cocotalk_clip_RN50_att data/cocotalk_box data/cocotalk_label.h5
    max sequence length in data is 16
    read 123287 image features
    assigned 113287 images to split train
    assigned 5000 images to split val
    assigned 5000 images to split test
    Read data: 0.0007147789001464844
    /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
      warnings.warn('Was asked to gather along dimension 0, but all '
    iter 0 (epoch 0), train_loss = 9.158, time/batch = 25.627
    Read data: 0.0002460479736328125
    iter 1000 (epoch 0), train_loss = 4.920, time/batch = 0.183
    Read data: 0.00023293495178222656
    iter 2000 (epoch 0), train_loss = 3.784, time/batch = 0.194
    Traceback (most recent call last):
      File "tools/train.py", line 293, in <module>
        train(opt)
      File "tools/train.py", line 246, in train
        val_loss, predictions, lang_stats = eval_utils.eval_split(
      File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/utils/eval_utils.py", line 171, in eval_split
        seq, seq_logprobs = model(fc_feats, att_feats, att_masks, opt=tmp_eval_kwargs, mode='sample')
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
        outputs = self.parallel_apply(replicas, inputs, kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
        return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
        output.reraise()
      File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
        raise self.exc_type(msg)
    TypeError: Caught TypeError in replica 5 on device 5.
    Original Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
        output = module(*input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/models/CaptionModel.py", line 33, in forward
        return getattr(self, '_'+mode)(*args, **kwargs)
    TypeError: _sample() missing 2 required positional arguments: 'fc_feats' and 'att_feats'
    
    opened by j-min 11
  • The extracted feature of the COCO dataset for caption

    The extracted feature of the COCO dataset for caption

    image In CLIP-ViL/CLIP-ViL-Direct/caption/clip.py, "x = x + self.positional_embedding[0, :, None, :].to(x.dtype) # (HW+1)NC" should be " x = x + self.positional_embedding[:, None, :].to(x.dtype) # (HW+1)NC"

    Could you explain why you change the following code of the official clip? In official clip, out_proj_weight=self.c_proj.weight, out_proj_bias=self.c_proj.bias. In your code, image

    opened by liujiaheng 4
  • About precompute

    About precompute

    Hi! @airsplay. I have used precomute_imagenet_views.py and found that the result is not consistent with the tsv file provided by https://nlp.cs.unc.edu/data/vln_clip/features/CLIP-ViT-B-32-views.tsv . I wonder that how did you get the tsv file provided on that website? I run it with the default settings(arch=vit, LABEL=False)

    opened by HeyMercer 2
  • About the training time of Pythia

    About the training time of Pythia

    Thanks for your open source code! I want to inquire about the training time of Pythia utilizing your CLIP features.

    It seems to take me more than a week if I use mmf with your config for Pythia (even setting the batchsize to 32 with original iterations). However, when I utilize the code in https://github.com/KaihuaTang/VQA2.0-Recent-Approachs-2018.pytorch and run the butp, I could finish the training in 5-6 hours with 4 Nvidia 2080Ti.

    To be honest, I am not familiar with mmf. And I just want to make sure whether I did something wrong.

    opened by tingxueronghua 2
  • bug in positional_embedding's weights when resizing.

    bug in positional_embedding's weights when resizing.

    Hi, @airsplay , @sIncerass , @liunian-harold-li , @clip-vil

    Thanks for sharing your code! This is a very interesting work :)

    However, I find that there may be an inappropriate operation in ./CLIP-ViL-Direct/caption/scripts/clip_prepro_feats.py when extracting the visual CLIP features. Specifically, when you resize the official CLIP models' positional_embedding to support larger image input resolution than the official CLIP models, you assign pos_embed.weight with the resized pos_embed weights while the pos_embed.data actually used in forward() remains zeros. I guess the lack of positional_embedding's weights when extracting the CLIP grid features is the true reason for the large performance degradation on the experiments using CLIP-ViT-B grid features. The performances on CLIP ResNet models are relatively normal because the model.visual.attnpool.positional_embedding layer is only used on the top of CLIP ResNet models to aggregate the extracted grid features for the global visual feature.

    To verify my guess, I re-implement a new feature-extraction pipeline and run the COCO Captioning task using X-modaler codebase with the extracted CLIP grid features. I get 127.8 CIDEr score using ViT-B/32_448 CLIP grid features. I have provided my detailed experiment results on COCO Captioning here.

    So I wonder whether the lack of positional_embedding's weights when extracting CLIP-ViT-B grid features also destroyed the performance on VQA tasks? And the visualization and conclusion in Figure3 will change after providing the pos_embed_weights when extracting CLIP-ViT-B grid features?

    Best, Jianjie

    opened by jianjieluo 2
  • No 'tfm_gen' when trying to run feature extraction for vqa mcan

    No 'tfm_gen' when trying to run feature extraction for vqa mcan

    Hi! I'm trying to run mcan feature extraction.

    This is the command I was trying to run: python mcan_clip_grid_feature.py --config-file configs/R-50-grid.yaml --dataset coco_2014_train --model_type RN50. And I got the following error. I searched throughout the codebase and didn't find where self.tfm_gen is defined.

    Traceback (most recent call last):
      File "mcan_clip_grid_feature.py", line 142, in <module>
        main(args)
      File "mcan_clip_grid_feature.py", line 136, in main
        do_feature_extraction(cfg, model, args.dataset, args)
      File "mcan_clip_grid_feature.py", line 78, in do_feature_extraction
        extract_clip_feature_on_dataset(model, data_loader, dump_folder, args)
      File "mcan_clip_grid_feature.py", line 113, in extract_clip_feature_on_dataset
        for idx, inputs in enumerate(tqdm.tqdm(data_loader)):
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/tqdm/std.py", line 1133, in __iter__
        for obj in iterable:
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
        data = self._next_data()
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
        return self._process_data(data)
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
        data.reraise()
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
        raise self.exc_type(msg)
    AttributeError: Caught AttributeError in DataLoader worker process 0.
    Original Traceback (most recent call last):
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
        data = fetcher.fetch(index)
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/detectron2/data/common.py", line 90, in __getitem__
        data = self._map_func(self._dataset[cur_idx])
      File "/nlp/data/weiqiuy/miniconda3/envs/mmf/lib/python3.7/site-packages/detectron2/utils/serialize.py", line 26, in __call__
        return self._obj(*args, **kwargs)
      File "/mnt/nlpgridio3/data/weiqiuy/CLIP-ViL/CLIP-ViL-Direct/vqa/grid_feats/dataset_mapper.py", line 107, in __call__
        self.tfm_gens = [self.tfm_gens[0]] + [T.Resize((600, 1000))]
    AttributeError: 'AttributeDatasetMapper' object has no attribute 'tfm_gens'
    
    opened by fallcat 1
  • Pythia Feature Extraction

    Pythia Feature Extraction

    I'm trying to extract image features for vqa with pythia using python pythia_clip_grid_feature.py --config-file configs/R-50-grid.yaml --dataset coco_2015_train --model_type RN50. Isn't this supposed to give output as 100 object features of 2048 dimension. I'm getting varying outputs of dimensions, (1, 13, 20, 2048), (1, 15, 20, 2048) and similar ones. Could anyone point out where I'm wrong and where I need to make changes to get (100, 2048) output for an image. And what should be the format of annotation file if I am to use VQA dataset, since that doesn't have attributes like area, segmentation, categories, etc Thanks

    opened by shamanthak-hegde 1
  • Data dir for mcan_clip_grid_feature.py

    Data dir for mcan_clip_grid_feature.py

    Hello, thanks for sharing such a great code base. If I want to extract VQAv2's COCO image features (or some other dataset) by using mcan_clip_grid_feature.py, how should I prepare the data format and specify the dataset directory that the code needs ?

    opened by Fly2flies 1
  • Errors occurred when extracting clip features using Resnet

    Errors occurred when extracting clip features using Resnet

    Thank you for your answer to the previous question. The following errors occurred when I used resnet model to extract clip features. The dimension of x is 3, the dimension of self.positional_embedding is 2, but the dimension of slice operation is 4. I don't know what was wrong and I hope to get your help. Thank you very much. QQ图片20220127100529

    opened by tianjunyu0871 1
  • Errors occurred when extracting clip features using ViT-B/32

    Errors occurred when extracting clip features using ViT-B/32

    Thanks for sharing your code! When I run clip_prepro_feats.py, the following error occurred. I tried many times, but still failed. I hope to get your help.

    捕获
    opened by tianjunyu0871 1
  • CLIP-VIT-B-Transformer captioning results

    CLIP-VIT-B-Transformer captioning results

    Hi @sIncerass , thanks for your interesting work. I have a question about image captioning results in Table 2. I find that the Transformer model with CLIP-ViT-B feature can still get a good performance instead of the dramatically worse performance reported in Table 2. Maybe there is a bug in the CLIP-ViT-B feature extraction.

    opened by YuanEZhou 1
  • How to reproduce the results of experiments that are shown in Table 7

    How to reproduce the results of experiments that are shown in Table 7

    Thanks for your open source code! I'm trying the reproduce the results of experiments in Table 7 in the ICLR paper: 'Zero-Shot Performance of CLIP in VQA'.

    Would you provide code or detail information for this?

    opened by raven38 0
  • Where can I found annotations for SNLI-VE?

    Where can I found annotations for SNLI-VE?

    It seems that the model uses different annotation files from the code. What's the difference between it different and the original SNLI-VE jsonl file?And where can I found it? Can you share it with us? Thank you in advance!

    # CLIP-ViL-Pretrain/src/tasks/snli_data.py
    text_db_paths = {
        "valid": "/local/harold/ubert/clip_vlp/lxmert/data/snli_ve/txt_db/ve_dev.db",
        "train": "/local/harold/ubert/clip_vlp/lxmert/data/snli_ve/txt_db/ve_train.db",
        "test": "/local/harold/ubert/clip_vlp/lxmert/data/snli_ve/txt_db/ve_test.db",
    }
    
    opened by 1219521375 1
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Checkpoint for SNLI-VE

    Checkpoint for SNLI-VE

    Hi, just wanted to ask if there is a checkpoint for the finetuned model on SNLI-VE and if it would be possible share it? Thank you in advance! @airsplay

    opened by sramshetty 0
Owner
null
Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

Liangming Pan 70 Nov 27, 2022
Traditional deepdream with VQGAN+CLIP and optical flow. Ready to use in Google Colab

VQGAN-CLIP-Video cat.mp4 policeman.mp4 schoolboy.mp4 forsenBOG.mp4

null 23 Oct 26, 2022
FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

XCL 191 Dec 31, 2022
Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

Jin-Hwa Kim 506 Nov 29, 2022
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

Remi 672 Jan 1, 2023
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 9, 2022
Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

Ren Tianhe 49 Nov 10, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch

?? Flamingo - Pytorch Implementation of Flamingo, state-of-the-art few-shot visual question answering attention net, in Pytorch. It will include the p

Phil Wang 630 Dec 28, 2022
Created as part of CS50 AI's coursework. This AI makes use of knowledge entailment to calculate the best probabilities to win Minesweeper.

Minesweeper-AI Created as part of CS50 AI's coursework. This AI makes use of knowledge entailment to calculate the best probabilities to win Minesweep

Beckham 0 Jul 20, 2022
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文 《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

null 75 Dec 29, 2022
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

Michihiro Yasunaga 434 Jan 4, 2023
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

Clova AI Research 34 Apr 13, 2022
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

Abhijith Neil Abraham 19 Sep 9, 2021
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

Yusen Zhang 22 Nov 9, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

OSU DKI Lab 76 Dec 21, 2022
Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

Studio Ousia 147 Dec 7, 2022
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

Junbin Xiao 50 Nov 24, 2022