Code and data for ImageCoDe, a contextual vison-and-language benchmark

Overview

ImageCoDe

arxiv

This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions.

Example

Data

All collected descriptions for the training and validation set are under data/train_data.json and data/valid_data.json.

Image sets can be downloaded on Zenodo or GoogleDrive and should be unzipped in data/.

You can download from the commandline via:

wget https://zenodo.org/record/6518944/files/image-sets.zip

For ViLBERT experiments, you need to download a pretrained ViLBERT checkpoint from volta here, simply by clicking on ViLBERT in the table. Save the downloaded file as baselines/vilbert/vilbert-pretrained.bin. Since ViLBERT uses image features from Faster R-CNN, you also have to downloaded these for all ImageCoDe images here: Google Drive link. Save the file as data/rcnn-features36-36.lmdb. The same procedure applies for UNITER.

The format for data/train_data.json looks like this:

{
  "MSR-VTT-videoTrainValVideo_video2044-shot1_0": {
    "6": "a mom holding her babies in the middle of the picture, no other image intervenes with the image.",
    "7": "The image is fading between a woman holding a baby and a woman sitting with a red background. The hands of the woman sitting aren't visible."
  },
  "video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2": {
  "..."
  }
}

And the images under data/ have the following structure. Each folder contains 10 images. If the images are video frames, the number X in imgX.jpg indicates the frame number:

  .
  ├── MSR-VTT-videoTrainValVideo_video2044-shot1_0
      │   ├── img0.jpg
      │   ├── img7.jpg
      │   ├── ...
  ├── video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2
      │   ├── ...

Leaderboard

Based on this you can train your model and test on the unlabeled test set:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    "The team name on shirt is visible without a number, but all letters can be seen for team name.",
    "the player can be seen with him on the left close to the logo on the pitch on the right and can be clearly seen"
  ],
  "...":
  ["..."]
}

In order to appear on the leaderboard, please format your results in the following format:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    1,
    2
  ],
  "...":
  ["..."]
}

Where the example here with "1" and "2" represent image indices ranging from 0 to 9. You can submit to the leaderboard by sending your test set file (or a download link) to [email protected] and we will update the leaderboard quickly (max. 1-2 days). The leaderboard is maintained on the project website and might change its submission procedure at some point.

Installations

Run install.sh for running CLIP experiments. For VilBERT follow the instructions for volta.

Code

Code for CLIP is under baselines/clip and and code for ViLBERT/UNITER is under baselines/crossencoders.

For details commands to run each model variant shown in the paper, have a look at the README in baselines.

For example to train the best performing model CLIP+TemporalEmbeddings, run:

python3 contextual.py --lr 2e-6 --lr_head 1e-4 -b 36 -m ViT-B/16 --fusion mult -a gelu --logit_scale 1000 --finetuned_checkpoint_path checkpoints/CONTRA_clip_best__36_4e-06_30_1395526.pt --add_input --frozen_clip --positional

Data Analysis

Our manual annotation of various phenomena (negation, nuances, ...) in our validation set can be found under data/manual_annotation_valid.yaml

License

This work is licensed under the MIT license. See LICENSE for details. Third-party software and data sets are subject to their respective licenses.
If you want to cite our paper, please use:

@inproceedings{krojer_contextual_2022,
  address = {Online},
  title = {Image Retrieval from Contextual Descriptions},
  booktitle = {Proceedings of the 60th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics},
  publisher = {Association for Computational Linguistics},
  author = {Krojer, Benno and Adlakha, Vaibhav and Vineet, Vibhav and Goyal, Yash and Ponti, Edoardo and Reddy, Siva},
  month = may,
  year = {2022},
}

Acknowledgement

Our data (specifically the image sets) are built upon 3 video dataset and Open Images:

We also the volta repository for ViLBERT and UNITER baseline variants

For questions or feedback, don't hesitate to contact the author: [email protected]

You might also like...
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English ⚖️ 🏆 🧑‍🎓 👩‍⚖️ Dataset Summary Inspired by the recent widespread use of th

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.
A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.

WILDS is a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

A code repository associated with the paper A Benchmark for Rough Sketch Cleanup by Chuan Yan, David Vanderhaeghe, and Yotam Gingold from SIGGRAPH Asia 2020.

A Benchmark for Rough Sketch Cleanup This is the code repository associated with the paper A Benchmark for Rough Sketch Cleanup by Chuan Yan, David Va

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

Official code for 'Robust Siamese Object Tracking for Unmanned Aerial Manipulator' and offical introduction to UAMT100 benchmark
Official code for 'Robust Siamese Object Tracking for Unmanned Aerial Manipulator' and offical introduction to UAMT100 benchmark

SiamSA: Robust Siamese Object Tracking for Unmanned Aerial Manipulator Demo video 📹 Our video on Youtube and bilibili demonstrates the evaluation of

This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. (CVPR 2022)"

Gait3D-Benchmark This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild

Comments
  • explore the dataset

    explore the dataset

    Hey thank you for sharing the code and dataset! I am wondering if there is a plan to build an online demo for the dataset (e.g., COCO Explorer), so we can explore image sets easily.

    opened by zhaoyanpeng 2
  • Could you provide checkpoint for the best performing model

    Could you provide checkpoint for the best performing model

    Thank you for the great dataset first, but checkpoints/CONTRA_clip_best__36_4e-06_30_1395526.pt is missing in README.md

    python3 contextual.py --lr 2e-6 --lr_head 1e-4 -b 36 -m ViT-B/16 --fusion mult -a gelu --logit_scale 1000 --finetuned_checkpoint_path checkpoints/CONTRA_clip_best__36_4e-06_30_1395526.pt --add_input --frozen_clip --positional
    

    Could you provide this checkpoint? Thank you in advance.

    opened by Soonhwan-Kwon 1
Owner
McGill NLP
Research group within McGill University and Mila focusing on various topics in natural language processing.
McGill NLP
UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus General info This is

null 71 Oct 25, 2022
Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search This is an implementation for our paper Contextual Non-Loca

Tencent YouTu Research 50 Dec 3, 2022
Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

null 105 Dec 23, 2022
banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services. This library is developed by Bandit ML and ex-authors of Facebook's applied reinforcement learning platform, Reagent.

Bandit ML 51 Dec 22, 2022
Generate Contextual Directory Wordlist For Target Org

PathPermutor Generate Contextual Directory Wordlist For Target Org This script generates contextual wordlist for any target org based on the set of UR

null 8 Jun 23, 2021
ICCV2021 - Mining Contextual Information Beyond Image for Semantic Segmentation

Introduction The official repository for "Mining Contextual Information Beyond Image for Semantic Segmentation". Our full code has been merged into ss

null 55 Nov 9, 2022
[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

CONQUER: Contexutal Query-aware Ranking for Video Corpus Moment Retreival PyTorch implementation of CONQUER: Contexutal Query-aware Ranking for Video

Hou zhijian 23 Dec 26, 2022
Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

Hui Wu 19 Oct 21, 2022
Official Pytorch implementation for Deep Contextual Video Compression, NeurIPS 2021

Introduction Official Pytorch implementation for Deep Contextual Video Compression, NeurIPS 2021 Prerequisites Python 3.8 and conda, get Conda CUDA 11

null 51 Dec 3, 2022