improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Related tags

CLIP-ViL
Overview

CLIP-ViL

In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

We release the extracted features and reproducible code here.

Specifically, we develop our methods in two scenarios: (1) direct task-specific fine-tuning; and (2) Vision and Language pre-training.

CLIP-ViL-Direct/VLN

We directly plug CLIP into tasks-pecific models and finetune on three representative tasks including Visual Question Answering, Image Captioning, and Vision-Language Navigation.

Please see the corresponding code directory for full details.

Noted that in direct finetuning, for Visual Question Answering on VQA 2.0 test-dev, we are able to achieve up to 68.37% accuracy with Pythia, 74.01% accuracy with MCAN and generally more than 4.0% improvements in accuracy; For Image Captioning on Karpathy's test split of MS COCO, we got 2.1% improvements in CIDEr metric over resnet alternatives; For Navigation, On RxR, we got 5% improvements with the nDTW metric (the main metric for RxR). On R2R, we got about 6% improvements in accuracy regarding our strong baselines.

CLIP-ViL-Pretrain

In order to test the potential of combining CLIP pre-training and Vision and Language pre-training. We introduce CLIP-ViL-Pretrain, a vision-and-language model pre-trained on image-text data with CLIP visual encoder as its visual backbone. CLIP-ViL-Pretrain is pretrained on aligned image-text data with a reconstructive objective and an image-text matching objective. It is further finetuned on VQA, SNLI-VE and GQA tasks.

Please see the corresponding code directory for full details.

Noted that CLIP-ViL-Pretrain is able to achieve 76.48% accuracy on VQA 2.0 test-dev and 76.70% accuracy on test-std; 80.61% accuracy on SNLI-VE Dev and 80.20% on Test-P; 61.42% accuracy on GQA test-dev and 62.93% accuracy on test-std.

Related Links

Reference

If you use CLIP-ViL in your research or wish to refer to the baseline results published here, please use the following BibTeX entry.

@misc{shen2021clip,
    title={How Much Can CLIP Benefit Vision-and-Language Tasks?}, 
    author={Sheng Shen and Liunian Harold Li and Hao Tan and Mohit Bansal and Anna Rohrbach and Kai-Wei Chang and Zhewei Yao and Kurt Keutzer},
    year={2021},
    eprint={2107.06383},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Issues
  • Captioning model training script fails

    Captioning model training script fails

    Hi, I followed the data preparation and ran the training script for the default CLIP-RN50 model in the readme. However, the training job crashes with the log below. Could you please check if the current example training script is runnable?

    $ /scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption 
    > python tools/train.py --cfg configs/phrase1/clip_rn50_transformer_scl.yml
    Warning: key N_enc not in args
    Warning: key N_dec not in args
    Warning: key d_model not in args
    Warning: key d_ff not in args
    Warning: key num_att_heads not in args
    Warning: key dropout not in args
    Warning: key REFORWARD not in args
    DataLoader loading json file:  data/cocotalk.json
    vocab size is  9487
    DataLoader loading h5 file:  data/cocotalk_clip_RN50_fc data/cocotalk_clip_RN50_att data/cocotalk_box data/cocotalk_label.h5
    max sequence length in data is 16
    read 123287 image features
    assigned 113287 images to split train
    assigned 5000 images to split val
    assigned 5000 images to split test
    Read data: 0.0007147789001464844
    /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
      warnings.warn('Was asked to gather along dimension 0, but all '
    iter 0 (epoch 0), train_loss = 9.158, time/batch = 25.627
    Read data: 0.0002460479736328125
    iter 1000 (epoch 0), train_loss = 4.920, time/batch = 0.183
    Read data: 0.00023293495178222656
    iter 2000 (epoch 0), train_loss = 3.784, time/batch = 0.194
    Traceback (most recent call last):
      File "tools/train.py", line 293, in <module>
        train(opt)
      File "tools/train.py", line 246, in train
        val_loss, predictions, lang_stats = eval_utils.eval_split(
      File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/utils/eval_utils.py", line 171, in eval_split
        seq, seq_logprobs = model(fc_feats, att_feats, att_masks, opt=tmp_eval_kwargs, mode='sample')
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
        outputs = self.parallel_apply(replicas, inputs, kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
        return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
        output.reraise()
      File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
        raise self.exc_type(msg)
    TypeError: Caught TypeError in replica 5 on device 5.
    Original Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
        output = module(*input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/models/CaptionModel.py", line 33, in forward
        return getattr(self, '_'+mode)(*args, **kwargs)
    TypeError: _sample() missing 2 required positional arguments: 'fc_feats' and 'att_feats'
    
    opened by j-min 6
  • MS COCO Caption scores with MLE objective

    MS COCO Caption scores with MLE objective

    Hi, thanks for sharing your code! This is a very interesting work :) Could you please let me know the MS COCO scores on the Karpathy test split with MLE objectives only (before self-critical learning)?

    opened by j-min 1
  • How to combine CLIP with Oscacr(or VinVL)?

    How to combine CLIP with Oscacr(or VinVL)?

    Thanks for sharing the code! This is an interesting work! I wonder how to use CLIP as the visual encoder for the pre-training models using object detection tags (such as Oscar and VinVL)?

    opened by 594422814 1
  • About clip feature extraction

    About clip feature extraction

    I found that I totally cannot get the same feature result that you provided called cocotalk att and FC feature maps. I also try to use CLIP original code to extract the result, but it's still different. How can I solve this problem?

    opened by LittleDonkey1203 1
  • Pretrained weights for image captioning

    Pretrained weights for image captioning

    Could you provide the pretrained weights for image captioning work?

    opened by zhuang93 1
  • Grad-CAM visualization code

    Grad-CAM visualization code

    Hi, I am interested in your excellent work. Is there any way that I can visualize Grad-CAM like Fig. 3 in your paper?

    opened by yangbang18 1
  • Why weights of R-50-grid.yaml is commented out?

    Why weights of R-50-grid.yaml is commented out?

    https://github.com/clip-vil/CLIP-ViL/blob/master/CLIP-ViL-Direct/vqa/configs/R-50-grid.yaml#L3

    opened by tshu-w 0
  • Grad-CAM visualization code

    Grad-CAM visualization code

    Hi, I am interested in your excellent work. Is there any way that I can visualize Grad-CAM like Fig. 3 in your paper?

    opened by yangbang18 0
  • evaluating vqa using pythia

    evaluating vqa using pythia

    Hi, thanks for publishing the code. It seems that I'm doing something wrong with the evaluation.

    1. I generated pythia features using the following command: python pythia_clip_grid_feature.py --config-file configs/R-50-grid.yaml --dataset coco_2014_val --model_type RN50 and a new folder (./clip/RN50/val2014) with *.pth files (e.g. 42.pth) was created.

    Next, I modified local_clip_r50.yaml to point to the above-generated folder, and then I tried to evaluate using MMF with the following command: mmf_predict config=./mmf_configs/pythia/local_clip_r50.yaml datasets=vqa2 model=pythia run_type=val

    but I got the following error: FileNotFoundError: [Errno 2] No such file or directory: '/media/yoavs/9955ec44-7fd3-499d-8f20-40a37f20674c/CLIP-ViL/CLIP-ViL-Direct/vqa/clip/RN50/val2014/COCO_val2014_000000428399.npy' seems that the features file names are formatted differently, and just modifying files names yielded other errors. What am I doing wrong?

    1. In addition, how do I finetune (vqa, pythia) using MMF? mmf_run config=config=./mmf_configs/pythia/local_clip_r50.yaml datasets=vqa2 model=pythia run_type=train_val ??

    Thanks.

    opened by itsyoavshalev 0
Generate vibrant and detailed images using only text.

CLIP Guided Diffusion From RiversHaveWings. Generate vibrant and detailed images using only text. See captions and more generations in the Gallery See

Clay M. 80 Sep 26, 2021
X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

null 735 Sep 23, 2021
Learning and Building Convolutional Neural Networks using PyTorch

Image Classification Using Deep Learning Learning and Building Convolutional Neural Networks using PyTorch. Models, selected are based on number of ci

Mayur 39 Sep 22, 2021
Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Nerdy Rodent 641 Sep 26, 2021
Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

CLIP-Guided-Diffusion Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab. Original colab notebooks by Ka

Nerdy Rodent 12 Sep 21, 2021
An open source implementation of CLIP.

OpenCLIP Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training). The goal of this repository is to enable

null 274 Sep 21, 2021
Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

null 6 Sep 2, 2021
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Kevin Costa 35 Sep 16, 2021
Simple implementation of OpenAI CLIP model in PyTorch.

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

Moein Shariatnia 72 Sep 23, 2021
The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch. Feel free to make a pu

Ritchie Ng 8.4k Sep 22, 2021
3D ResNets for Action Recognition (CVPR 2018)

3D ResNets for Action Recognition Update (2020/4/13) We published a paper on arXiv. Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara, and Yutaka Satoh,

Kensho Hara 3.1k Sep 22, 2021
PyTorch implementation of popular datasets and models in remote sensing

PyTorch Remote Sensing (torchrs) (WIP) PyTorch implementation of popular datasets and models in remote sensing tasks (Change Detection, Image Super Re

isaac 102 Sep 19, 2021
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 24 Sep 22, 2021
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

Zhenda Xie 208 Sep 26, 2021
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Hugging Face 51.7k Sep 23, 2021
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Hao Tan 47 Sep 8, 2021
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

Federico Galatolo 125 Sep 17, 2021
The official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." *.

F-Clip — Fully Convolutional Line Parsing This repository contains the official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang

Xili Dai 67 Sep 23, 2021
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

Remi 641 Sep 19, 2021