CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Overview

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

The implementation of paper CLIP2Video: Mastering Video-Text Retrieval via Image CLIP.

CLIP2Video is a video-text retrieval model based on CLIP (ViT-B/32), which transfers the image-language pre-training model to video-text retrieval in an end-to-end manner. Our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

Pipeline Blocks

Introduction

This is the source code of CLIP2Video, a method for Video-Text Retrieval based on temporal correlations. It is built on top of the CLIP4Clip by ( Huaishao Luo et al.) in PyTorch.

Requirement

pip install -r requirements.txt 

Download data and Pre-trained Model

Supported public training sets:

  • MSR-VTT(9k)
  • MSR-VTT(full)
  • MSVD
  • VATEX-English Version

Supported public testing protocols:

  • MSR-VTT 1k-A protocol (SOTA)
  • MSR-VTT full protocol (SOTA)
  • MSVD(SOTA
  • VATEX-English version(SOTA

Download official video: Official videos of different data can be found as follows:

Pre-process

To train and test the above datasets: you should use sample_frame.py to transform video into frames.

python sample_frame.py --input_path [raw video path] --output_path [frame path]

(Optional) The splits and captions can be found in the links of used dataset. For the convenience, you can also use the split in data/ directly.

Download CLIP model

To train and test the above datasets based on pre-trained CLIP model, you should visit CLIP and download ViT-B/32.

Test Model

We provide three models trained on MSVD, MSR-VTT and VATEX-English.

Model Name checkpoint
CLIP2Video_MSVD link
CLIP2Video_MSRVTT9k link
CLIP2Video_VATEX link

To test the trained model, please refer test/.

(Optional) If the path of trained model(--checkpoint) doesn't exist, the parameters of basic CLIP (--clip_path) will be loaded.

Main Article Results of CLIP2Video

T2V:

Protocol R@1 R@5 R@10 Median Rank Mean Rank
MSVD 47.0 76.8 85.9 2 9.6
MSRVTT-9k 45.6 72.6 81.7 2 14.6
MSRVTT-Full 29.8 55.5 66.2 4 45.5
Vatex (English) random 1k5 split 57.3 90.0 95.5 1 3.6
Vatex (English) HGR split 61.2 90.9 95.6 1 3.4

V2T:

Protocol R@1 R@5 R@10 Median Rank Mean Rank
MSVD 58.7 85.6 91.6 1 4.3
MSRVTT-9k 43.5 72.3 82.1 2 10.2
MSRVTT-Full 54.6 82.1 90.8 1 5.3
Vatex (English) random 1k5 split 76.0 97.7 99.9 1 1.5
Vatex (English) HGR split 77.9 98.1 99.1 1 1.6

(Optional:) Clarification of different results in VATEX:

  1. In our paper, we do not strictly follow HGR's split, but randomly split the test set by ourselves, which is the split in

    • data/vatex_data/test1k5_sec_list.txt
  2. In HGR split, we adopt the totally same split following HGR, and the split can be seen as:

    • data/vatex_data/test_list.txt
    • data/vatex_data/val_list.txt

We will revise the results strictly following HGR split for fair comparison in the paper later!


Citation

If you find CLIP2Video useful in your work, you can cite the following paper:

@article{fang2021clip2video,
  title={CLIP2Video: Mastering Video-Text Retrieval via Image CLIP},
  author={Fang, Han and Xiong, Pengfei and Xu, Luhui and Chen, Yu},
  journal={arXiv preprint arXiv:2106.11097},
  year={2021}
}

Acknowledgments

Some components of this code implementation are adopted from CLIP and CLIP4Clip. We sincerely appreciate for their contributions.

You might also like...
Activity image-based video retrieval

Cross-modal-retrieval Our approach is focus on Activity Image-to-Video Retrieval (AIVR) task. The compared methods are state-of-the-art single modalit

A Joint Video and Image Encoder for End-to-End Retrieval
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval project page | arXiv | webvid-data Repository containing the code,

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021
Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback
Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

CoSMo.pytorch Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback, Seungmin Lee*, Dongwan Kim*, Bohyung

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

Apply AnimeGAN-v2 across frames of a video clip

title emoji colorFrom colorTo sdk app_file pinned AnimeGAN-v2 For Videos 🔥 blue red gradio app.py false AnimeGAN-v2 For Videos Apply AnimeGAN-v2 acro

Making a music video with Wav2CLIP and VQGAN-CLIP

music2video Overview A repo for making a music video with Wav2CLIP and VQGAN-CLIP. The base code was derived from VQGAN-CLIP The CLIP embedding for au

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval
[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

CONQUER: Contexutal Query-aware Ranking for Video Corpus Moment Retreival PyTorch implementation of CONQUER: Contexutal Query-aware Ranking for Video

Comments
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • When will you publish the train script?

    When will you publish the train script?

    Hello! Thank you for the code, and I want to train the model by myself to prove the effectiveness of your method, when will you publish the train script?.

    @CryhanFang

    opened by starmemda 0
Owner
null
Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

null 47 Dec 16, 2022
[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning DouZero is a reinforcement learning framework for DouDizhu (斗地主), t

Kwai Inc. 3.1k Jan 4, 2023
[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Disentangled Representation Learning for Text-Video Retrieval This is a PyTorch implementation of the paper Disentangled Representation Learning for T

Qiang Wang 49 Dec 18, 2022
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Deep Daze mist over green hills shattered plates on the grass cosmic love and attention a time traveler in the crowd life during the plague meditative

Phil Wang 4.4k Jan 3, 2023
CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

vadim epstein 690 Jan 2, 2023
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Ryan Murdock has done it again, combining OpenAI's CLIP and the generator from a BigGAN! This repository wraps up his work so it is easily accessible to anyone who owns a GPU.

Phil Wang 2.3k Jan 9, 2023
CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

Myeongjun Kim 52 Jan 7, 2023
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

null 458 Jan 2, 2023
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Kevin Costa 73 Sep 11, 2022
A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Eugenio Herrera 175 Dec 29, 2022