VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Overview

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv]

VideoMAE Framework

PWC
PWC
PWC
PWC

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

📰 News

[2022.4.24] Code and pre-trained models are available now! Please give a star ⭐️ for our best efforts. 😆
[2022.4.15] The LICENSE of this project has been upgraded to CC-BY-NC 4.0.
[2022.3.24] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.

Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 84.7% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

Something-Something V2

Method Extra Data Backbone Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-B 16x2x3 70.6 92.6
VideoMAE no ViT-L 16x2x3 74.2 94.7
VideoMAE no ViT-L 32x1x3 75.3 95.2

Kinetics-400

Method Extra Data Backbone Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-B 16x5x3 80.9 94.7
VideoMAE no ViT-L 16x5x3 84.7 96.5
VideoMAE Kinetics-700 ViT-L 16x5x3 85.8 96.8

UCF101 & HMDB51

Method Extra Data Backbone UCF101 HMDB51
VideoMAE no ViT-B 90.8 61.1
VideoMAE Kinetics-400 ViT-B 96.1 73.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍 Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: [email protected]

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen and Chongjian Ge for their kindly support.
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

✏️ Citation

If you think this project is helpful, please feel free to give a star ⭐️ and cite our paper:

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}
Issues
  • release csv label files for ssv2

    release csv label files for ssv2

    Hi, congratulate on your great work! Could you release label files (train.csv val.csv test.csv) for ssv2? I tried to generate these files following the guidance in DATASET.md and use them for evaluation on ssv2 by running run_class_finetuning.py with --eval, but the accuracy is abnormal(too low), and I guess maybe the csv files I generated were wrong. Could you release the csv files of K400 and ssv2 for everyone to download? Thanks!

    opened by nqx12348 9
  • The learning rate for ssv2 dataset

    The learning rate for ssv2 dataset

    Hi, I have tried to reproduce the VideoMAE performance on SSV2 dataset. I run the experiments on four A100 machines (each includes eight 80G GPUs) and modify --nnodes=4, batch_size 64, such that our total batch size is the same. However, the performance is not consistent with the reported one. I checked the log you provided and notice that the learning rate is different. It seems that your learning rate is not 1.2e-3 (1.5e-4*2048/256) after the warm-up stage. Thanks a lot and I am looking forward to your reply.

    opened by vaesl 9
  • encontered missing keys error while finetunening

    encontered missing keys error while finetunening

    I encontered missing keys error while finetunening, I resume from model pretrained with pretrain model and using it while finetune. how to solve this problem? image

    opened by Expert68 8
  • about inference speed?

    about inference speed?

    Thanks for the great work! But I have one question that was not discussed in the paper, which is the inference speed.

    I understand that during training the speed is fast, since you masked out 90% of the patches. However, during test time, I assume you retain all the patches. Since the attention in ViT is quadratic in cost, does it mean you will need 10*10 = 100 times the flops compared to the pretrain phase?

    If that is the case, how much time does it take to do a single forward pass?

    opened by gy20073 7
  • Request of a best pretrained checkpoint for downstream tasks

    Request of a best pretrained checkpoint for downstream tasks

    Thanks for you excellent work. With only self-supervised learning, the VideoMAE can perform surprisely well. Besides that, I think the (self-supervised + full-supervised) models are also very meaningful, especially for the downstream tasks/datasets.

    @yztongzhan Could you please providing a model/checkpoint that is the best pretrained model you can offer for learning video representation towards any unknown downstream video datasets?

    For example, a model pretrained on the Kinetic-700 with VideoMAE self-supervisly, and then pretrained on Kinetic-700 supervisely.

    Such a model should be very useful. One can easily fine-tune it on any downstream datasets for many possible purposes. Just like the ResNet-50 pretrained on ImageNet.

    opened by makecent 4
  • loss_scale_value has no key

    loss_scale_value has no key "scale"

    Hi author: Thanks for your impressive work, I run your code "run_mae_pretraining.py" and it comes an error that "engine_for_pretraining.py", line 79, in train_one_epoch loss_scale_value = loss_scaler.state_dict()["scale"]" , as shown below; p.s. I used the pretrained ssv2's checkpoint; 342b799fcc675fc5b1e8606ce8073f3dbe3f4e24

    opened by fanbooo 4
  • how to preprosessing?

    how to preprosessing?

    Hello. First, thank you for this repository.

    I'm trying to preprocess.

    I already looked at install.md and installed it. Now look at data.md and try to follow it. However, I already have the kinetics-400 downloaded before, so I just need to pre-process the data.

    i) Download the dataset from the official website.

    ii) Preprocess the dataset by resizing the short edges of the video to 320px. You can refer to the MMAction2 data benchmark for TSN and SlowOnly.

    iii) Create the annotations required by the dataloader ("<path_to_video> <video_class>" in the annotations). Comments usually include train.csv, val.csv, and test.csv (here val.csv is equivalent to test.csv). The file format *.csv is:

    I'm on the second of this course you mentioned, and I'm not sure how to do it. How do I pre-process?

    Thank you for read it!

    opened by hic9507 3
  • novograd optimizer removed from timm

    novograd optimizer removed from timm

    ModuleNotFoundError: No module named 'timm.optim.novograd'

    it seems like timm just gives you nvnovograd if you ask for novograd

    https://github.com/rwightman/pytorch-image-models/blob/master/timm/optim/optim_factory.py

    elif opt_lower == 'novograd' or opt_lower == 'nvnovograd':
        optimizer = NvNovoGrad(parameters, **opt_args)
    
    opened by DalasNoin 2
  • Add the length of the video when preparing datasets

    Add the length of the video when preparing datasets

    According to the code to load data, it seems we should add the length of each video when preparing datasets. See https://github.com/MCG-NJU/VideoMAE/blob/main/DATASET.md for details. The generated annotation of video datasets should be like: dataset_root/video_1.mp4 100 label_1, where 100 means the length of video_1.mp4.

    opened by JiayuZou2020 2
  • the accuracy i get is low

    the accuracy i get is low

    i train the model on UCF101, pretrain (100 epochs,finetune(100 epochs), the accuracy is 67-68,but your accuracy is above 90, Is it related to parameter settings? can you tell me the parameter settings

    opened by Winnie202 2
  • Downstream training loss and accuracy are nearly constants

    Downstream training loss and accuracy are nearly constants

    Hello, thanks for your effort on coding VideoMAE. I have followed the installation guide completely, then I selected two categories from UCF-101 to do downstream training (video binary classification). The data is preprocessed including resized the short edge to 240px or 320px as well as saved in .mp4 format. The train.csv, val.csv, test.csv are all prepared also. However, how could the training loss /accuracy are nearly contstants 0.69/0.5 throughout the training?

    The dataset I specified is UCF-101 and the nb_classes is set as 2. The only other thing I modfied from the official script is the learning rate due to working on single GPU non-distributed environment. I have tried different multiple setting of learning rate from 64x to 0.015x, but all of them leads to constant losses. Could you please help me, thank you a lot.

    Screenshot from 2022-05-09 11-06-49

    opened by bnbsking 2
  • Linear probe experiment

    Linear probe experiment

    Is there a script or details available for linear probing? The original MAE paper suggests theres a substantial difference between setups for finetuning and linear probing for MAE models and wanted to be able to reproduce results.

    Thanks!

    opened by alexnwang 0
  • About preparing SthV2

    About preparing SthV2

    Hi, Thank you for your work!

    I read your 'DATASET.md'.

    Are there two key points in processing sthv2 datasets: the first is to change the suffix to MP4, the second is to reduce the short side to 320p? (and can only videos with an original height of 240p be zoomed out?) 中文:处理sthv2数据集是不是就2个核心要义:第一个是将后缀改为.mp4,第二个是将短边放缩至320p?(而且是不是只有原始高度为240p的视频才能被选中,然后再去放缩?)

    opened by klinic 0
  • Pretraining VideoMAE on HMDB51

    Pretraining VideoMAE on HMDB51

    Hi Zhan,

    Thank you for your excellent work! We are surprised by VideoMAE's data efficiency (paper section 4.3), especially on small datasets like HMDB51. We are trying to use your code to reproduce your experiment on HMDB51. However, we cannot achieve the same finetune accuracy as the table2 reported (61.1 %):

    image

    Our experiments show that the model is converged after 2400 pretraining epochs. We are using eight Tesla V100 32GB GPUs. Also, we changed the batch size, learning rate and temporal stride as the appendix described.

    I wonder if you can also share your complete experiment configurations for UCF101 and HMDB51? Some settings like warmup epochs number may also be critical for reproducing.

    opened by chenliz1 6
  • Preprocess for ssv2 dataset

    Preprocess for ssv2 dataset

    Hello, the author! I read your DATASETS.md for ssv2. But I have no ideo how to run the data_clean.py for original ssv2 videos? Why are .txt files used in data_clean.py? Could you kindly give one or two instructions to run data_clean.py for ssv2?

    opened by aries-young 2
Owner
Multimedia Computing Group, Nanjing University
Multimedia Computing Group, Nanjing University
Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners This repository is built upon BEiT, thanks very much! Now, we on

Zhiliang Peng 2.1k Aug 3, 2022
An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners This is a coarse version for MAE, only make the pretrain model, the fine

FlyEgle 198 Aug 8, 2022
Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

mae-repo PyTorch re-implememtation of "masked autoencoders are scalable vision learners". In this repo, it heavily borrows codes from codebase https:/

Peng Qiao 1 Dec 14, 2021
ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE ConvMAE: Masked Convolution Meets Masked Autoencoders Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1, 1 Shanghai AI Laboratory, 2 M

Alpha VL Team of Shanghai AI Lab 278 Aug 6, 2022
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 280 Aug 13, 2022
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Hao Tan 69 Jul 18, 2022
Official repository for the paper "Self-Supervised Models are Continual Learners" (CVPR 2022)

Self-Supervised Models are Continual Learners This is the official repository for the paper: Self-Supervised Models are Continual Learners Enrico Fini

Enrico Fini 46 Aug 7, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 54 Aug 8, 2022
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 246 Aug 10, 2022
The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

PRIMER The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. PRIMER is a pre-trained model for mu

AI2 87 Aug 11, 2022
The official codes of "Semi-supervised Models are Strong Unsupervised Domain Adaptation Learners".

SSL models are Strong UDA learners Introduction This is the official code of paper "Semi-supervised Models are Strong Unsupervised Domain Adaptation L

Yabin Zhang 21 Jul 21, 2022
PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Contrast to Divide: self-supervised pre-training for learning with noisy labels This is an official implementation of "Contrast to Divide: self-superv

null 54 Jul 30, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Microsoft 6.1k Aug 13, 2022
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [pdf] The official repository for Self-Supervised Pre-Training for Transfo

Hao Luo 45 Dec 3, 2021
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 88 Jul 7, 2022
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 413 Aug 3, 2022
Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training Code for our paper "Predicting lncRNA–protein interactio

zhanglabNKU 1 Oct 29, 2021
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

VITA 74 Jun 18, 2022
Contains code for the paper "Vision Transformers are Robust Learners".

Vision Transformers are Robust Learners This repository contains the code for the paper Vision Transformers are Robust Learners by Sayak Paul* and Pin

Sayak Paul 88 Jul 24, 2022