[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Qiang Wang

Last update: Dec 18, 2022

Related tags

Overview

Disentangled Representation Learning for Text-Video Retrieval

This is a PyTorch implementation of the paper Disentangled Representation Learning for Text-Video Retrieval:

@Article{DRLTVR2022,
  author  = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},
  journal = {arXiv:2203.07111},
  title   = {Disentangled Representation Learning for Text-Video Retrieval},
  year    = {2022},
}

Catalog

Setup
Fine-tuning code
Visualization demo

Setup

Setup code environment

git clone https://github.com/foolwood/DRL.git
cd DRL
conda create -n drl python=3.9
conda activate drl
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model (as pretraining)

cd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

cd data/MSR-VTT
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip ; unzip MSRVTT.zip
mv MSRVTT/videos/all ./videos ; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json

Fine-tuning code

Train on MSR-VTT 1k.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main.py --do_train 1 --workers 8 --n_display 50 \
--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \
--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \
--max_words 32 --max_frames 12 --video_framerate 1 \
--base_encoder ViT-B/32 --agg_module seqTransf \
--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \
--output_dir ckpts/ckpt_msrvtt_wti_cdcr

Reproduce the ablation experiments scripts

configs	feature	gpus	Text-Video					Video-Text					train time (h)
configs	feature	gpus	R@1	R@5	R@10	MdR	MnR	R@1	R@5	R@10	MdR	MnR	train time (h)
CLIP4Clip	ViT/B-32	4	42.8	72.1	81.4	2.0	16.3	44.1	70.5	80.5	2.0	11.8	10.5
zero-shot	ViT/B-32	4	31.1	53.7	63.4	4.0	41.6	26.5	50.1	61.7	5.0	39.9	-
Interaction
DP+None	ViT/B-32	4	42.9	70.6	81.4	2.0	15.4	43.0	71.1	81.1	2.0	11.8	2.5
DP+seqTransf	ViT/B-32	4	42.8	71.1	81.1	2.0	15.6	44.1	70.9	80.9	2.0	11.7	2.6
XTI+None	ViT/B-32	4	40.5	71.1	82.6	2.0	13.6	42.7	70.8	80.2	2.0	12.5	14.3
XTI+seqTransf	ViT/B-32	4	42.4	71.3	80.9	2.0	15.2	40.1	69.2	79.6	2.0	15.8	16.8
TI+seqTransf	ViT/B-32	4	44.8	73.0	82.2	2.0	13.4	42.6	72.7	82.8	2.0	9.1	2.6
WTI+seqTransf	ViT/B-32	4	46.6	73.4	83.5	2.0	13.0	45.4	73.4	81.9	2.0	9.2	2.6
Channel DeCorrelation Regularization
DP+seqTransf+CDCR	ViT/B-32	4	43.9	71.1	81.2	2.0	15.3	42.3	70.3	81.1	2.0	11.4	2.6
TI+seqTransf+CDCR	ViT/B-32	4	45.8	73.0	81.9	2.0	12.8	43.3	71.8	82.7	2.0	8.9	2.6
WTI+seqTransf+CDCR	ViT/B-32	4	47.6	73.4	83.3	2.0	12.8	45.1	72.9	83.5	2.0	9.2	2.6

Note: the performances are slight boosts due to new hyperparameters.

Visualization demo

Run our visualization demo using matplotlib (no GPU needed):

License

See LICENSE for details.

Acknowledgments

Our code is partly based on CLIP4Clip.

You might also like...

PyTorch implementation of: Michieli U. and Zanuttigh P., "Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations", CVPR 2021.

Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations This is the official PyTorch implementation

Multimedia Technology and Telecommunication Lab

42 Nov 9, 2022

Code for CVPR2021 paper 'Where and What? Examining Interpretable Disentangled Representations'.

PS-SC GAN This repository contains the main code for training a PS-SC GAN (a GAN implemented with the Perceptual Simplicity and Spatial Constriction c

40 Dec 16, 2022

Implementation of StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation in PyTorch

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation Implementation of StyleSpace Analysis: Disentangled Controls for StyleGAN Ima

86 Dec 7, 2022

Disentangled Lifespan Face Synthesis

Disentangled Lifespan Face Synthesis Project Page | Paper Demo on Colab Preparation Please follow this github to prepare the environments and dataset.

50 Sep 20, 2022

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

253 Jan 6, 2023

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

4 Mar 11, 2022

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Related tags

Overview

Disentangled Representation Learning for Text-Video Retrieval

Catalog

Setup

Setup code environment

Download CLIP Model (as pretraining)

Download Datasets

Fine-tuning code

Visualization demo

License

Acknowledgments

You might also like...

PyTorch implementation of: Michieli U. and Zanuttigh P., "Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations", CVPR 2021.

Code for CVPR2021 paper 'Where and What? Examining Interpretable Disentangled Representations'.

Implementation of StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation in PyTorch

Disentangled Lifespan Face Synthesis

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Official PyTorch implementation of BlobGAN: Spatially Disentangled Scene Representations

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Activity image-based video retrieval

Owner

Qiang Wang

Eff video representation - Efficient video representation through neural fields

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

Near-Duplicate Video Retrieval with Deep Metric Learning

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation

This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021

DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control

Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.