A PyTorch implementation of VIOLET

Tsu-Jui Fu

Last update: Dec 30, 2022

Related tags

Overview

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are parital examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

Here is our best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

Multiple-Choice Question Answering (TGIF-Action, TGIF-Transition, MSRVTT-MC, and LSMDC-MC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json

Open-Ended Question Answering (TGIF-Frame, MSRVTT-QA, LSMDC-FiB, and MSVD-QA)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json

Text-to-Video Retrieval (MSRVTT, DiDeMo, YouCook2, and LSMDC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all trained downstream checkpoints.

Citation

@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

Comments

Swin Base or Small

Hi, I have noticed that the output original Swin-Base is 1024, but according to your code, the output is 768. Did you use the Swin Small for experiment?

opened by vateye 5
MVM for CLIP feature

Hi, I would like to know how to compute the loss between VideoSwin and the CLIP features in latest paper. Since the Swin family models take the patch size as 4x4, however for ViT the patch size is 16. I would like to know how to compute the loss between these two? (i.e., l1 loss)?

Thanks.

opened by vateye 3
I found some wrong places, can you see if it's right?

In main_pretrain.py, line 165: p = (1+_h*_w)*i_t + i_h*_w + i_w I think it should plus 1. p = (1+_h*_w)*i_t + i_h*_w + i_w + 1 Because your first position is for the separator.

opened by bubbliiiing 3
msvd-qa test

Thanks for your great work! But I have some questions about the msvd-qa test split. you use an answer set for msvd-qa, How do you deal with questions whose answers are not in the answer set？Just throw them away? I just throw them away get 11983 qa pairs from the original test file(13157 qa pairs). And I use the finetuned-checkpoint you provided, but get a lower accuracy 0.4554

opened by fangz-cs 2

Error step_pretrain on Rank

    Hello, I run the pre-training model in the environment of 4 gpus, and Error step_pretrain on Rank 1, 3, 2 0 is displayed, but the pre-training is not successful。

opened by lileiooo 1

Treat FFOE Video QA as a classifcation task on answers candidate set

Violet uses most commen answers as candidates, but there are other answers in the test set. How do you deal with them? Are they abandoned according to the txt_msvd.json?

opened by jiyt17 1
Performance check

Hi, thank you for sharing the code and models.

I have used the ckpt_violet_pretrain.pt and ckpt_violet_msrvtt-retrieval with our data processing (5 frames with interval num_frames // 5) for msrvtt t2v retrieval evaluation. I got rank@1 22.6/32.9 which is lower than the number (25.9/34.7) in the paper. I also tested the CLIP model and got a similar result. Are the released models achieving the reported results? If yes, could you provide the processing pipeline or describe how to get the reported performance? Thank you!

opened by Flowerfan 7