Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Related tags

Computer Vision MCQ

Overview

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral)

Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained Model

News

2022-04-17 We release the pre-trained model initialized from CLIP (ViT-B/32) and its usage (text-to-video retrieval and video feature extraction).

2022-04-08 We release the pre-training and downstream evaluation code, and the pre-trained model.

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

Visualization

Answer Noun Questions

We visualize cross-modality attention between the text tokens of noun questions and video tokens from BridgeFormer. In the second and fifth column, the noun phrase marked in blue (Q1) is erased as the question, and in the third and sixth column, the noun phrase marked in green (Q2) is erased as the question. BridgeFormer attends to video patches with specific object information to answer noun questions.

Answer Verb Questions

We visualize cross-modality attention between the text tokens of verb questions and video tokens from BridgeFormer. Three frames sampled from a video are shown and the verb phrase marked in blue (Q) is erased as the question. BridgeFormer focuses on object motions of video tokens to answer verb questions.

Dependencies and Installation

Python >= 3.8 (Recommend to use Anaconda)
PyTorch >= 1.7
NVIDIA GPU + CUDA

Installation

Clone repo

git clone https://github.com/TencentARC/MCQ.git
cd MCQ

Install dependent packages
```
pip install -r requirements.txt
```
Download the DistilBERT base model from Hugging Face in hugging face or in distilbert-base-uncased. Put "distilbert-base-uncased" under the directory of this repo.

Data Preparation

Please refer to DATA.md for pre-training and downstream evaluation datasets.

Pre-training

We adopt the curriculum learning to train the model, which pre-trains the model on the image dataset CC3M and video dataset WebVid-2M using 1 frame, and then on the video dataset WebVid-2M using 4 frames.

For 1-frame pre-training, since a single frame does not contain temporal dynamics to correspond to verb phrases, we train the model to answer only noun questions.
```
bash sctripts/train_1frame_mask_noun.sh
```
When the training loss converges, we get model "MCQ_1frame.pth".
For 4-frame pre-training, to save computation cost to enable a comparatively large batch size for contrastive learning, we train the model to anwer noun and verb questions sequentially. We first train the model to answer noun questions with "MCQ_1frame.pth" loaded in "configs/dist-4frame-mask-noun.json".
```
bash sctripts/train_4frame_mask_noun.sh
```
When the training loss converges, we get model "MCQ_4frame_noun.pth". We then train the model to answer verb questions with "MCQ_4frame_noun.pth" loaded in "configs/dist-4frame-mask-verb.json".
```
bash sctripts/train_4frame_mask_verb.sh
```
When the training loss converges, we get the final model.
Our repo adopts Multi-Machine and Multi-GPU training, with 32 A100 GPU for 1-frame pre-training and 40 A100 GPU for 4-frame pre-training.

Pre-trained Model

Our pre-trained model can be downloaded in Pre-trained Model, which contains the weights of VideoFormer, TextFormer and BridgeFormer. For downstream evaluation, you only need to load the weights of VideoFormer and TextFormer, with BridgeFormer removed.

Downstream Retrieval (Zero-shot on MSR-VTT)

Download our pre-trained model in Pre-trained Model (Or use your own pre-traind model).
Load the pre-trained model in "configs/zero_msrvtt_4f_i21k.json".
```
bash sctripts/test_retrieval.sh
```

CLIP-initialized Pre-trained Model

We also initialize our model from CLIP weights to pre-train a model with MCQ. Specifically, we use the pre-trained CLIP (ViT-B/32) as the backbone of VideoFormer and TextFormer, and randomly initialize BridgeFormer. Our VideoFormer does not incur any additional parameters compared to the ViT of CLIP, with a parameter-free modification to allow for the input of video frames with variable length.

To evaluate the performance of the CLIP-initialized pre-trained model on text-to-video retrieval,

Download the model in CLIP-Initialized Pre-trained Model.
Load the pre-trained model in "configs/zero_msrvtt_4f_i21k_clip.json".
```
bash sctripts/test_retrieval_CLIP.sh
```

We also provide a script to extract video features of any given videos from the CLIP-initialized pre-trained model,

python extract_video_features_clip.py

To Do

Release pre-training code
Release pre-trained model
Release downstream evaluation code
Release CLIP-initialized model
Release video representation extraction code

License

MCQ is released under BSD 3-Clause License.

Acknowledgement

Our code is based on the implementation of "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" https://github.com/m-bain/frozen-in-time.git.

Citation

If our code is helpful to your work, please cite:

@article{ge2022bridgeformer,
  title={BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions},
  author={Ge, Yuying and Ge, Yixiao and Liu, Xihui and Li, Dian and Shan, Ying and Qie, Xiaohu and Luo, Ping},
  journal={arXiv preprint arXiv:2201.04850},
  year={2022}
}

Comments

Can't reproduce the reported MSRVTT(zero-shot) results with the released model weights

Hello, thank you for the code of MCQ! We utilize the released weights and follow the data settings, trying to reproduce MSRVTT ZS results. But our result(R@1) is about four points lower than the reported result in the paper. Is there any place we need to pay attention to? Thank you.

opened by jiyt17 3
Is there any scripts that I can used for extracting the noun phrase

Hi, I want to know how to extract the phrase in the paper? I saw the issue that mentioned extracting the noun phrases, but it did not consistent what presented in the paper. For example, how to extract "an old woman" rather than "woman"?

Is there any scripts that I can used for extracting the phrases?

opened by vateye 3
Why is three [MASK] in noun/ verb answer clause, not one or two?

Hi, I'm wondering why you add three [MASK] in answers. I have seen your reply in #7, but I still don't know why the number of [MASK] and whether it is important. Any reply will be helpful! Thank you for your good job again.

opened by May2333 0
Questions about action recognition

As mentioned in table 4, there are 3 different test split. How are the specific test sets selected and how many are there? Also for the table 5, what is the training data and what is the test data

opened by superPangpang 0
How to finetune on the MSRVTT

Hello, wonderful project!. Here I wonder how to finetune the pre-trained models on downstream video-text retrieval datasets like MSR-VTT, LSMDC, and MSVD? I notice that the script for zero-shot retrieval has been provided, but there is no script about how to finetune on retrieval datasets.

opened by ForawardStar 1

Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Related tags

Overview

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral)

News

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

Visualization

Answer Noun Questions

Answer Verb Questions

Dependencies and Installation

Installation

Data Preparation

Pre-training

Pre-trained Model

Downstream Retrieval (Zero-shot on MSR-VTT)

CLIP-initialized Pre-trained Model

To Do

License

Acknowledgement

Citation

Comments

Can't reproduce the reported MSRVTT(zero-shot) results with the released model weights

Is there any scripts that I can used for extracting the noun phrase

Why is three [MASK] in noun/ verb answer clause, not one or two?

Questions about action recognition

How to finetune on the MSRVTT

Owner

Applied Research Center (ARC), Tencent PCG

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

When Age-Invariant Face Recognition Meets Face Age Synthesis: A Multi-Task Learning Framework (CVPR 2021 oral)

Code for CVPR 2022 paper "SoftGroup for Instance Segmentation on 3D Point Clouds"

Code for CVPR'2022 paper ✨ "Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model"

Code for CVPR 2022 paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

Code release for Hu et al., Learning to Segment Every Thing. in CVPR, 2018.

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

(CVPR 2021) Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds

Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

(CVPR 2021) ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

Distilling Knowledge via Knowledge Review, CVPR 2021

Automatically download multiple papers by keywords in CVPR

Official implementation of Character Region Awareness for Text Detection (CRAFT)

kaldi-asr/kaldi is the official location of the Kaldi project.

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).