VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Related tags

vimpac
Overview

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Authors: Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

Data Preprocessing

Please refer to video2token folder for the detailed README file.

For pre-training, the dataset is usually large, and we suggest to use FPS=2 during extraction. For downstream tasks, we suggest using FPS=16 that enables a higher frame rate for short videos.

We recommend to store the data locally at data/video_tokens. If different paths are used, please specify the path of VIDEO_CODE_PATHS and VIDEO_ANNO_PATHS in vimpac/data.py.

Pre-Trained Weights

We provide the pre-trained weights with their links. Please download the pre-trained weight and extract them under snap/.

Pre-Training

The default pre-training uses the HowTo100M dataset. The pre-training data could be switched to Kinetics-700 and other datasets by specifying the --dataset-name argument. We have validated that the mask-then-predict task works reasonablely well on Kinetics-700 datasets. However, the average length of video clips inside K-700 is 10 seconds thus not sure supporting the long-range contrastive learning.

Small Model

We first provide the script to pre-train a small model (6 layers, 512 dimensions, 256 frame-size, and 5 clip length):

bash scripts/pretrain/small.sh 0,1,2,3

We here annotate some essential arguments inside the pre-training scripts. For a full descriptions for all the arguments, please check param.py

We also provide two debugging options:

# bash scripts/pretrain/small.sh 0,1,2,3 --tqdm        # Show progress bar.
# bash scripts/pretrain/small.sh 0,1,2,3 --debug       # Only run a few steps per epoch.

Large Model

We follow BERT to pre-train our large model in two stages. The first stage pretrains for 90 epochs using frame-size 128 and clip-length 5. The second stage pretrains for 10 epochs using frame-size 256 and clip-length 5.

Scripts for the first stage:

bash scripts/pretrain/large.sh 0,1,2,3

Then we could directly run the script for the second stage without any further changes. It will load the last snapshot from the first stage, do interpolation for larger spatial size, and continue pre-training.

bash scripts/pretrain/large_frame256cont.sh 0,1,2,3

Fine-Tuning

After run the pre-training in pre-training or download the pre-trained weights from pre-trained-weights, we fine-tune the models on several downstream tasks. The arguments in these scripts are consistent with the hyperparameters in the paper. Please refer to Table 11 and Table 12 of our paper for a detailed list of all these hyperparameters.

SSV2

bash scripts/finetune/small_ssv2.sh 0,1,2,3

Diving48

bash scripts/finetune/small_diving48.sh 0,1,2,3

UCF101

bash scripts/finetune/small_ucf101.sh 0,1,2,3

HMDB51

bash scripts/finetune/small_hmdb51.sh 0,1,2,3

Change the Input Shape

Following ViT, we support the use of different input sizes from pre-training by interpolating the positional embedding. This is done by passing the --different-shape option. Otherwise, an error will pop up if the fine-tuning input shape is different from the pre-training. A larger input shape generally improves the results. We here take SSV2 as an example.

Longer clip length (10; default 5):

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --bs-per-gpu 4

Long clip length (10; default 5) + higher frame rate (4; default 2)

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --bs-per-gpu 4

Long clip length (10; default 5) + higher frame rate (4; default 2) + larger input size (256; default 128). Please also make sure that VQ-VAE code with input-size 256 has been extracted as in Pre-processing.

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --frame-size 256 --bs-per-gpu 2

Large Models

We provide scripts to run large models. Frame 128:

bash scripts/finetune/large_frame128_ucf101.sh 0,1,2,3

Frame 256:

bash scripts/finetune/large_frame256_ucf101.sh 0,1,2,3

The input shape could be changed as in change input shape. Our final model use the scripts of:

bash scripts/finetune/large_frame256_ucf101.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --frame-size 256 --bs-per-gpu 2

Acknowledgement

This work was granted access to the HPC resources of IDRIS under the allocation 20XX-AD011011621R1 made by GENCI. We thank Teven Le Scao and Victor Sanh for their help on the way.

Issues
  • Visual Token of HowTo100M

    Visual Token of HowTo100M

    Hi, do you transform the raw videos of HTM datasets into visual tokens during the pre-training? And how large of the total size of its visual tokens? Since HTM takes 12T space, I'm curious about the size of its visual tokens.

    opened by zhengsipeng 2
Owner
Hao Tan
NLP @ UNC Chapel Hill
Hao Tan
Parametric Contrastive Learning (ICCV2021)

Parametric-Contrastive-Learning This repository contains the implementation code for ICCV2021 paper: Parametric Contrastive Learning (https://arxiv.or

DV Lab 75 Oct 23, 2021
State of the art Semantic Sentence Embeddings

Contrastive Tension State of the art Semantic Sentence Embeddings Published Paper · Huggingface Models · Report Bug Overview This is the official code

Fredrik Carlsson 67 Oct 6, 2021
A selection of State Of The Art research papers (and code) on human locomotion (pose + trajectory) prediction (forecasting)

A selection of State Of The Art research papers (and code) on human trajectory prediction (forecasting). Papers marked with [W] are workshop papers.

Karttikeya Manglam 31 Sep 3, 2021
PyGCL: Graph Contrastive Learning Library for PyTorch

PyGCL: Graph Contrastive Learning for PyTorch PyGCL is an open-source library for graph contrastive learning (GCL), which features modularized GCL com

GCL: Graph Contrastive Learning Library for PyTorch 216 Oct 21, 2021
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [2021]

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations This repo contains the Pytorch implementation of our paper: Revisit

Wouter Van Gansbeke 55 Oct 19, 2021
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 6k Oct 18, 2021
Video Contrastive Learning with Global Context

Video Contrastive Learning with Global Context (VCLR) This is the official PyTorch implementation of our VCLR paper. Install dependencies environments

null 102 Oct 18, 2021
A curated (most recent) list of resources for Learning with Noisy Labels

A curated (most recent) list of resources for Learning with Noisy Labels

Jiaheng Wei 99 Oct 17, 2021
PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

Yonglong Tian 1.3k Oct 23, 2021
Spatial Contrastive Learning for Few-Shot Classification (SCL)

This repo contains the official implementation of Spatial Contrastive Learning for Few-Shot Classification (SCL), which presents of a novel contrastive learning method applied to few-shot image classification in order to learn more general purpose embeddings, and facilitate the test-time adaptation to novel visual categories.

Yassine 16 Sep 26, 2021
PyTorch implementation for COMPLETER: Incomplete Multi-view Clustering via Contrastive Prediction (CVPR 2021)

Completer: Incomplete Multi-view Clustering via Contrastive Prediction This repo contains the code and data of the following paper accepted by CVPR 20

XLearning Group 35 Oct 9, 2021
A curated list of resources for Image and Video Deblurring

A curated list of resources for Image and Video Deblurring

Subeesh Vasu 1.1k Oct 23, 2021
In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

Dirk Neuhäuser 4 Aug 11, 2021
All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

Daniel Bourke 1.7k Oct 24, 2021
🛠 All-in-one web-based IDE specialized for machine learning and data science.

All-in-one web-based development environment for machine learning Getting Started • Features & Screenshots • Support • Report a Bug • FAQ • Known Issu

Machine Learning Tooling 2.2k Oct 23, 2021
RoBERTa Marathi Language model trained from scratch during huggingface 🤗 x flax community week

RoBERTa base model for Marathi Language (मराठी भाषा) Pretrained model on Marathi language using a masked language modeling (MLM) objective. RoBERTa wa

Nipun Sadvilkar 15 Jul 29, 2021
Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization This is an official implementation of "Feature Stylization and Domain-

null 11 Sep 23, 2021
A comprehensive list of published machine learning applications to cosmology

ml-in-cosmology This github attempts to maintain a comprehensive list of published machine learning applications to cosmology, organized by subject ma

George Stein 211 Oct 15, 2021
DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

Jason Antic 14.2k Oct 17, 2021