VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Multimedia Computing Group, Nanjing University

Last update: Jan 7, 2023

Related tags

Deep Learning transformer action-recognition video-understanding mae self-supervised-learning masked-autoencoder vision-transformer video-transformer

Overview

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv]

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

📰 News

[2022.4.24] Code and pre-trained models are available now! Please give a star ⭐️ for our best efforts. 😆
[2022.4.15] The LICENSE of this project has been upgraded to CC-BY-NC 4.0.
[2022.3.24] ~~Code and pre-trained models will be released here.~~ Welcome to watch this repository for the latest updates.

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 84.7% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

✨ Something-Something V2

Method	Extra Data	Backbone	Frames x Clips x Crops	Top-1	Top-5
VideoMAE	no	ViT-B	16x2x3	70.6	92.6
VideoMAE	no	ViT-L	16x2x3	74.2	94.7
VideoMAE	no	ViT-L	32x1x3	75.3	95.2

✨ Kinetics-400

Method	Extra Data	Backbone	Frames x Clips x Crops	Top-1	Top-5
VideoMAE	no	ViT-B	16x5x3	80.9	94.7
VideoMAE	no	ViT-L	16x5x3	84.7	96.5
VideoMAE	Kinetics-700	ViT-L	16x5x3	85.8	96.8

✨ UCF101 & HMDB51

Method	Extra Data	Backbone	UCF101	HMDB51
VideoMAE	no	ViT-B	90.8	61.1
VideoMAE	Kinetics-400	ViT-B	96.1	73.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍 Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: [email protected]

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen and Chongjian Ge for their kindly support.
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

✏️ Citation

If you think this project is helpful, please feel free to give a star ⭐️ and cite our paper:

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}

Comments

release csv label files for ssv2

Hi, congratulate on your great work! Could you release label files (train.csv val.csv test.csv) for ssv2? I tried to generate these files following the guidance in DATASET.md and use them for evaluation on ssv2 by running run_class_finetuning.py with --eval, but the accuracy is abnormal(too low), and I guess maybe the csv files I generated were wrong. Could you release the csv files of K400 and ssv2 for everyone to download? Thanks!

opened by nqx12348 12
can not test the reported accuracy on UCF-101

Hi, Thanks for sharing the code! I have tested the model on UCF-101 val dataset (split-1) you provided in https://drive.google.com/file/d/1MSyon6fPpKz7oqD6WDGPFK4k_Rbyb6fw/view?usp=sharing. But the accuracy is less than 70%. How can I get the 91.3 accuracy you reported in model_zoo?

opened by xiaojieli0903 11
Adding VideoMAE to HuggingFace Transformers

Hi VideoMAE team :)

I've implemented VideoMAE as a fork of 🤗 HuggingFace Transformers, and I'm going to add it soon to the library (see https://github.com/huggingface/transformers/pull/17821). Here's a notebook that illustrates inference with it: https://colab.research.google.com/drive/1ZX_XnM0ol81FbcxrFS3nNLkmn-0fzvQk?usp=sharing

The reason I'm adding VideoMAE is because I really like the simplicity of it, it was literally a single line of code change from ViT (nn.Conv2d -> nn.Conv3d).

As you may or may not know, any model on the HuggingFace hub has its own Github repository. E.g. the VideoMAE-base checkpoint fine-tuned on Kinetics-400 can be found here: https://huggingface.co/nielsr/videomae-base. If you check the "files and versions" tab, it includes the weights. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!

A model card can also be added to the repo, which is just a README.

Are you interested in creating an organization on the hub, such that we can store all model checkpoints there (rather than under my user name)?

Let me know!

Kind regards,

Niels ML Engineer @ HuggingFace

opened by NielsRogge 10
Pretraining VideoMAE on HMDB51

Hi Zhan,

Thank you for your excellent work! We are surprised by VideoMAE's data efficiency (paper section 4.3), especially on small datasets like HMDB51. We are trying to use your code to reproduce your experiment on HMDB51. However, we cannot achieve the same finetune accuracy as the table2 reported (61.1 %):

Our experiments show that the model is converged after 2400 pretraining epochs. We are using eight Tesla V100 32GB GPUs. Also, we changed the batch size, learning rate and temporal stride as the appendix described.

I wonder if you can also share your complete experiment configurations for UCF101 and HMDB51? Some settings like warmup epochs number may also be critical for reproducing.

opened by chenliz1 9
The learning rate for ssv2 dataset

Hi, I have tried to reproduce the VideoMAE performance on SSV2 dataset. I run the experiments on four A100 machines (each includes eight 80G GPUs) and modify --nnodes=4, batch_size 64, such that our total batch size is the same. However, the performance is not consistent with the reported one. I checked the log you provided and notice that the learning rate is different. It seems that your learning rate is not 1.2e-3 (1.5e-4*2048/256) after the warm-up stage. Thanks a lot and I am looking forward to your reply.

opened by vaesl 9
pre-training batch size mismatch between code and paper

First of all, thanks for your excellent work.

I'm re-producing your work by using one node with 8 GPUs.

In your pre-training scripts for K400 and SSv2,

the effective batch size : 32 x 8 nodes x 8 gpus = 2048

but your paper denotes 1024.

Could you let me know which one is correct?

Thanks in advance.

opened by youngwanLEE 8
encontered missing keys error while finetunening

I encontered missing keys error while finetunening, I resume from model pretrained with pretrain model and using it while finetune. how to solve this problem?

opened by Expert68 8
about inference speed?

Thanks for the great work! But I have one question that was not discussed in the paper, which is the inference speed.

I understand that during training the speed is fast, since you masked out 90% of the patches. However, during test time, I assume you retain all the patches. Since the attention in ViT is quadratic in cost, does it mean you will need 10*10 = 100 times the flops compared to the pretrain phase?

If that is the case, how much time does it take to do a single forward pass?

opened by gy20073 7
loss_scale_value has no key "scale"

Hi author: Thanks for your impressive work, I run your code "run_mae_pretraining.py" and it comes an error that "engine_for_pretraining.py", line 79, in train_one_epoch loss_scale_value = loss_scaler.state_dict()["scale"]" , as shown below; p.s. I used the pretrained ssv2's checkpoint;

opened by fanbooo 5
Reproducing Camera-Ready Improved Numbers

The NeurIPS camera ready version (v3 on arXiv) has some significantly higher results than the previous paper version (v2 on arXiv). E.g. for ViT-B pretrained on K400 for 1600 epochs, performance on K400 jumps from 80.9% to 81.5%. For ViT-B pretrained on SSv2 for 2400 epochs, performance on SSv2 jumps from 70.6% to 70.8%. Could the authors share the updated finetuning code and configs? I am unable to reproduce the new results. My results are close to what is reported in v2 of the paper

opened by dfan 4
from petrel_client.client import Client

Hi,

I am trying to use data_clean.py for Kinetics 400 but am unable to use petrel_client from the import statement: from petrel_client.client import Client I have installed petrel but that has not worked and I have not been able to find a petrel_client package online.

opened by greggg123 4
ucf-101 fine-tune from kinetics-400 pretrained model

Hello.

How do I reproduce K400 (pretrain, 1600ep) -> UCF101 (finetune) performance as shown in the paper (96.1 Top-1 in UCF)? Should I use the same config as UCF101 (pretrain) -> UCF101 (finetune)?

opened by taeoh-kim 2
Failed to reproduce linear probing top 1 acc

Hi, I am reproducing the experiment you conducted because it is very interesting. Among them, a linear probing experiment was conducted, but it failed to obtain the top 1 acc of 38.9% presented in Table 3. The top 1 acc I got was 33.9% in 1x1 view condition and 34.7% in 2x3 view condition. If possible, can you share the code that you used to run the linear-prob experiment? It would be a great help to me if you share

thanks

opened by potatowarriors 1
Training receipt for AVA

Thanks for the excellent work!

Could you provide the training configuration for the AVA dataset? I tried to reproduce the results but observed overfitting.

opened by tztztztztz 3
Linear probe experiment

Is there a script or details available for linear probing? The original MAE paper suggests theres a substantial difference between setups for finetuning and linear probing for MAE models and wanted to be able to reproduce results.

Thanks!

opened by alexnwang 0

Owner

Multimedia Computing Group, Nanjing University

GitHub https://arxiv.org/abs/2203.12602

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners This repository is built upon BEiT, thanks very much! Now, we on

2.3k Jan 4, 2023

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners This is a coarse version for MAE, only make the pretrain model, the fine

214 Dec 29, 2022

Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

mae-repo PyTorch re-implememtation of "masked autoencoders are scalable vision learners". In this repo, it heavily borrows codes from codebase https:/

1 Dec 14, 2021

ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE ConvMAE: Masked Convolution Meets Masked Autoencoders Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1, 1 Shanghai AI Laboratory, 2 M

345 Jan 8, 2023

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL)

385 Jan 6, 2023

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

74 Dec 3, 2022

Official repository for the paper "Self-Supervised Models are Continual Learners" (CVPR 2022)

Self-Supervised Models are Continual Learners This is the official repository for the paper: Self-Supervised Models are Continual Learners Enrico Fini

73 Dec 18, 2022

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

59 Dec 28, 2022

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

282 Jan 9, 2023

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

PRIMER The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. PRIMER is a pre-trained model for mu

114 Jan 6, 2023

The official codes of "Semi-supervised Models are Strong Unsupervised Domain Adaptation Learners".

SSL models are Strong UDA learners Introduction This is the official code of paper "Semi-supervised Models are Strong Unsupervised Domain Adaptation L

26 Dec 26, 2022

PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Contrast to Divide: self-supervised pre-training for learning with noisy labels This is an official implementation of "Contrast to Divide: self-superv

55 Nov 23, 2022

UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

7.6k Jan 1, 2023

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [pdf] The official repository for Self-Supervised Pre-Training for Transfo

45 Dec 3, 2021

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

92 Nov 30, 2022

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Related tags

Overview

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv]

📰 News

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

⚡️ A Simple, Efficient and Strong Baseline in SSVP

😮 High performance, but NO extra data required

🚀 Main Results

✨ Something-Something V2

✨ Kinetics-400

✨ UCF101 & HMDB51

🔨 Installation

➡️ Data Preparation

🔄 Pre-training

⤴️ Fine-tuning with pre-trained models

📍 Model Zoo

👀 Visualization

☎️ Contact

👍 Acknowledgements

🔒 License

✏️ Citation

Comments

Owner

Multimedia Computing Group, Nanjing University

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

ConvMAE: Masked Convolution Meets Masked Autoencoders

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Official repository for the paper "Self-Supervised Models are Continual Learners" (CVPR 2022)

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

The official codes of "Semi-supervised Models are Strong Unsupervised Domain Adaptation Learners".

PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Contains code for the paper "Vision Transformers are Robust Learners".