VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Related tags

Deep Learning vimpac
Overview

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Authors: Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

Data Preprocessing

Please refer to video2token folder for the detailed README file.

For pre-training, the dataset is usually large, and we suggest to use FPS=2 during extraction. For downstream tasks, we suggest using FPS=16 that enables a higher frame rate for short videos.

We recommend to store the data locally at data/video_tokens. If different paths are used, please specify the path of VIDEO_CODE_PATHS and VIDEO_ANNO_PATHS in vimpac/data.py.

Pre-Trained Weights

We provide the pre-trained weights with their links. Please download the pre-trained weight and extract them under snap/.

Pre-Training

The default pre-training uses the HowTo100M dataset. The pre-training data could be switched to Kinetics-700 and other datasets by specifying the --dataset-name argument. We have validated that the mask-then-predict task works reasonablely well on Kinetics-700 datasets. However, the average length of video clips inside K-700 is 10 seconds thus not sure supporting the long-range contrastive learning.

Small Model

We first provide the script to pre-train a small model (6 layers, 512 dimensions, 256 frame-size, and 5 clip length):

bash scripts/pretrain/small.sh 0,1,2,3

We here annotate some essential arguments inside the pre-training scripts. For a full descriptions for all the arguments, please check param.py

We also provide two debugging options:

# bash scripts/pretrain/small.sh 0,1,2,3 --tqdm        # Show progress bar.
# bash scripts/pretrain/small.sh 0,1,2,3 --debug       # Only run a few steps per epoch.

Large Model

We follow BERT to pre-train our large model in two stages. The first stage pretrains for 90 epochs using frame-size 128 and clip-length 5. The second stage pretrains for 10 epochs using frame-size 256 and clip-length 5.

Scripts for the first stage:

bash scripts/pretrain/large.sh 0,1,2,3

Then we could directly run the script for the second stage without any further changes. It will load the last snapshot from the first stage, do interpolation for larger spatial size, and continue pre-training.

bash scripts/pretrain/large_frame256cont.sh 0,1,2,3

Fine-Tuning

After run the pre-training in pre-training or download the pre-trained weights from pre-trained-weights, we fine-tune the models on several downstream tasks. The arguments in these scripts are consistent with the hyperparameters in the paper. Please refer to Table 11 and Table 12 of our paper for a detailed list of all these hyperparameters.

SSV2

bash scripts/finetune/small_ssv2.sh 0,1,2,3

Diving48

bash scripts/finetune/small_diving48.sh 0,1,2,3

UCF101

bash scripts/finetune/small_ucf101.sh 0,1,2,3

HMDB51

bash scripts/finetune/small_hmdb51.sh 0,1,2,3

Change the Input Shape

Following ViT, we support the use of different input sizes from pre-training by interpolating the positional embedding. This is done by passing the --different-shape option. Otherwise, an error will pop up if the fine-tuning input shape is different from the pre-training. A larger input shape generally improves the results. We here take SSV2 as an example.

Longer clip length (10; default 5):

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --bs-per-gpu 4

Long clip length (10; default 5) + higher frame rate (4; default 2)

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --bs-per-gpu 4

Long clip length (10; default 5) + higher frame rate (4; default 2) + larger input size (256; default 128). Please also make sure that VQ-VAE code with input-size 256 has been extracted as in Pre-processing.

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --frame-size 256 --bs-per-gpu 2

Large Models

We provide scripts to run large models. Frame 128:

bash scripts/finetune/large_frame128_ucf101.sh 0,1,2,3

Frame 256:

bash scripts/finetune/large_frame256_ucf101.sh 0,1,2,3

The input shape could be changed as in change input shape. Our final model use the scripts of:

bash scripts/finetune/large_frame256_ucf101.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --frame-size 256 --bs-per-gpu 2

Acknowledgement

This work was granted access to the HPC resources of IDRIS under the allocation 20XX-AD011011621R1 made by GENCI. We thank Teven Le Scao and Victor Sanh for their help on the way.

You might also like...
PyTorch implementation for COMPLETER: Incomplete Multi-view Clustering via Contrastive Prediction (CVPR 2021)
PyTorch implementation for COMPLETER: Incomplete Multi-view Clustering via Contrastive Prediction (CVPR 2021)

Completer: Incomplete Multi-view Clustering via Contrastive Prediction This repo contains the code and data of the following paper accepted by CVPR 20

The code for our paper
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

TAPEX: Table Pre-training via Learning a Neural SQL Executor
TAPEX: Table Pre-training via Learning a Neural SQL Executor

TAPEX: Table Pre-training via Learning a Neural SQL Executor The official repository which contains the code and pre-trained models for our paper TAPE

Code for the paper
Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)

Training GANs with Stronger Augmentations via Contrastive Discriminator (ICLR 2021) This repository contains the code for reproducing the paper: Train

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

pytorch implementation of
pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Noise Contrastive Estimation for pyTorch Overview This repository contains a re-implementation of the Noise Contrastive Estimation algorithm, implemen

Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.
Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Price-Prediction-For-a-Dream-Home ROADMAP TO THIS LINEAR REGRESSION BASED HOUSE PRICE PREDICTION PREDICTION MODEL Import all the dependencies of the p

Comments
  • HTM resize-shorter side size?

    HTM resize-shorter side size?

    Hi,May I ask what size the shorter side was resized to in the pre-training data processing before center crop(HTM dataset)? I did not find relevant information in the article and code.

    opened by dongzhiwu 0
  • Visual Token of HowTo100M

    Visual Token of HowTo100M

    Hi, do you transform the raw videos of HTM datasets into visual tokens during the pre-training? And how large of the total size of its visual tokens? Since HTM takes 12T space, I'm curious about the size of its visual tokens.

    opened by zhengsipeng 3
Owner
Hao Tan
NLP @ UNC Chapel Hill
Hao Tan
ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE ConvMAE: Masked Convolution Meets Masked Autoencoders Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1, 1 Shanghai AI Laboratory, 2 M

Alpha VL Team of Shanghai AI Lab 345 Jan 8, 2023
The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

PRIMER The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. PRIMER is a pre-trained model for mu

AI2 114 Jan 6, 2023
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
FocusFace: Multi-task Contrastive Learning for Masked Face Recognition

FocusFace This is the official repository of "FocusFace: Multi-task Contrastive Learning for Masked Face Recognition" accepted at IEEE International C

Pedro Neto 21 Nov 17, 2022
Saeed Lotfi 28 Dec 12, 2022
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 385 Jan 6, 2023
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

CCOP Code of our paper Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning Requirement Install OpenSelfSup Install Detectron2

Chenhongyi Yang 21 Dec 13, 2022
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 470 Dec 30, 2022
CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022