[Arxiv]
Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-TrainingVideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab
📰
News
[2022.4.24] Code and pre-trained models are available now! Please give a star
[2022.4.15] The LICENSE of this project has been upgraded to CC-BY-NC 4.0.
[2022.3.24] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.
✨
Highlights
🔥
Masked Video Modeling for Video Pre-Training
VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.
⚡️
A Simple, Efficient and Strong Baseline in SSVP
VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.
😮
High performance, but NO extra data required
VideoMAE works well for video datasets of different scales and can achieve 84.7% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.
🚀
Main Results
✨
Something-Something V2
Method | Extra Data | Backbone | Frames x Clips x Crops | Top-1 | Top-5 |
---|---|---|---|---|---|
VideoMAE | no | ViT-B | 16x2x3 | 70.6 | 92.6 |
VideoMAE | no | ViT-L | 16x2x3 | 74.2 | 94.7 |
VideoMAE | no | ViT-L | 32x1x3 | 75.3 | 95.2 |
✨
Kinetics-400
Method | Extra Data | Backbone | Frames x Clips x Crops | Top-1 | Top-5 |
---|---|---|---|---|---|
VideoMAE | no | ViT-B | 16x5x3 | 80.9 | 94.7 |
VideoMAE | no | ViT-L | 16x5x3 | 84.7 | 96.5 |
VideoMAE | Kinetics-700 | ViT-L | 16x5x3 | 85.8 | 96.8 |
✨
UCF101 & HMDB51
Method | Extra Data | Backbone | UCF101 | HMDB51 |
---|---|---|---|---|
VideoMAE | no | ViT-B | 90.8 | 61.1 |
VideoMAE | Kinetics-400 | ViT-B | 96.1 | 73.3 |
🔨
Installation
Please follow the instructions in INSTALL.md.
➡️
Data Preparation
Please follow the instructions in DATASET.md for data preparation.
🔄
Pre-training
The pre-training instruction is in PRETRAIN.md.
⤴️
Fine-tuning with pre-trained models
The fine-tuning instruction is in FINETUNE.md.
📍
Model Zoo
We provide pre-trained and fine-tuned models in MODEL_ZOO.md.
👀
Visualization
We provide the script for visualization in vis.sh
. Colab notebook for better visualization is coming soon.
☎️
Contact
Zhan Tong: [email protected]
👍
Acknowledgements
Thanks to Ziteng Gao, Lei Chen and Chongjian Ge for their kindly support.
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.
🔒
License
The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.
✏️
Citation
If you think this project is helpful, please feel free to give a star
@article{videomae,
title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
journal={arXiv preprint arXiv:2203.12602},
year={2022}
}