ConvMAE: Masked Convolution Meets Masked Autoencoders



ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1,

1 Shanghai AI Laboratory, 2 MMLab, CUHK, 3 Sensetime Research.

This repo is the official implementation of ConvMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:

ImageNet Pretrain: See
ImageNet Finetune: See
Object Detection: See
Semantic Segmentation: See



The supported codes and models for COCO object detection and instance segmentation are available.


  1. Pretrained models on ImageNet-1K for ConvMAE.
  2. The supported codes and models for ImageNet-1K finetuning and linear probing are provided.


The preprint version is public at arxiv.


ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.

  • We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
  • ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
  • ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).


Pretrain on ImageNet-1K

The following table provides pretrained checkpoints and logs used in the paper.

pretrained checkpoints download
logs download

Main Results on ImageNet-1K

Models #Params(M) Supervision Encoder Ratio Pretrain Epochs FT acc@1(%) LIN acc@1(%) FT logs/weights LIN logs/weights
BEiT 88 DALLE 100% 300 83.0 37.6 - -
MAE 88 RGB 25% 1600 83.6 67.8 - -
SimMIM 88 RGB 100% 800 84.0 56.7 - -
MaskFeat 88 HOG 100% 300 83.6 N/A - -
data2vec 88 RGB 100% 800 84.2 N/A - -
ConvMAE-B 88 RGB 25% 1600 85.0 70.9 log/weight

Main Results on COCO

Mask R-CNN

Models Pretrain Pretrain Epochs Finetune Epochs #Params(M) FLOPs(T) box AP mask AP logs/weights
Swin-B IN21K w/ labels 300 36 109 0.7 51.4 45.4 -
Swin-L IN21K w/ labels 300 36 218 1.1 52.4 46.2 -
MViTv2-B IN21K w/ labels 300 36 73 0.6 53.1 47.4 -
MViTv2-L IN21K w/ labels 300 36 239 1.3 53.6 47.5 -
Benchmarking-ViT-B IN1K w/o labels 1600 100 118 0.9 50.4 44.9 -
Benchmarking-ViT-L IN1K w/o labels 1600 100 340 1.9 53.3 47.2 -
ViTDet IN1K w/o labels 1600 100 111 0.8 51.2 45.5 -
MIMDet-ViT-B IN1K w/o labels 1600 36 127 1.1 51.5 46.0 -
MIMDet-ViT-L IN1K w/o labels 1600 36 345 2.6 53.3 47.5 -
ConvMAE-B IN1K w/o lables 1600 25 104 0.9 53.2 47.1 log/weight

Main Results on ADE20K


Models Pretrain Pretrain Epochs Finetune Iters #Params(M) FLOPs(T) mIoU logs/weights
DeiT-B IN1K w/ labels 300 16K 163 0.6 45.6 -
Swin-B IN1K w/ labels 300 16K 121 0.3 48.1 -
MoCo V3 IN1K 300 16K 163 0.6 47.3 -
DINO IN1K 400 16K 163 0.6 47.2 -
BEiT IN1K+DALLE 1600 16K 163 0.6 47.1 -
PeCo IN1K 300 16K 163 0.6 46.7 -
CAE IN1K+DALLE 800 16K 163 0.6 48.8 -
MAE IN1K 1600 16K 163 0.6 48.1 -
ConvMAE-B IN1K 1600 16K 153 0.6 51.7 soon

Main Results on Kinetics-400

Models Pretrain Epochs Finetune Epochs #Params(M) Top1 Top5 logs/weights
VideoMAE-B 200 100 87 77.8
VideoMAE-B 800 100 87 79.4
VideoMAE-B 1600 100 87 79.8
VideoMAE-B 1600 100 (w/ Repeated Aug) 87 80.7 94.7
SpatioTemporalLearner-B 800 150 (w/ Repeated Aug) 87 81.3 94.9
VideoConvMAE-B 200 100 86 80.1 94.3 Soon
VideoConvMAE-B 800 100 86 81.7 95.1 Soon
VideoConvMAE-B-MSD 800 100 86 82.7 95.5 Soon

Main Results on Something-Something V2

Models Pretrain Epochs Finetune Epochs #Params(M) Top1 Top5 logs/weights
VideoMAE-B 200 40 87 66.1
VideoMAE-B 800 40 87 69.3
VideoMAE-B 2400 40 87 70.3
VideoConvMAE-B 200 40 86 67.7 91.2 Soon
VideoConvMAE-B 800 40 86 69.9 92.4 Soon
VideoConvMAE-B-MSD 800 40 86 70.7 93.0 Soon

Getting Started


  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Training and evaluation


The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.


ConvMAE is released under the MIT License.


  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.03892},
  • Pretraining implementation

    Pretraining implementation

    I have implemented pretraining codes based on MAE repo but I wonder one thing: in the decoder phase, (1) do you sum all features of 3 stages and then normalize it or (2) you normalize the feature of last stage and then sum it with 2 previous ones? Because I got nan loss after 270 epochs with (1) approach. Btw, Have you ever met Nan loss during training?

    opened by hao-pt 9
  • ImageNet Evaluation

    ImageNet Evaluation

    Thanks for sharing the great work. I encountered difficulties in reproducing the evaluation results on My evaluation results are: * Acc@1 1.090 Acc@5 2.188 loss 8.955 Accuracy of the network on the 50000 test images: 1.1% That's obviously too big a gap.

    I download the ImageNet-1K following your guidance and prepared the ImageNet-1K following Jasonlee1995. Are there any details I haven't noticed, or any specific requirements for preparing the dataset?

    opened by SheldonHS 6
  • Time required to train one epoch.

    Time required to train one epoch.

    Dear author: Thank you for sharing the excellent work! May I ask how the time overhead of ConvMAE pre-training compares to MAE? Can you provide the time required to train an epoch for these two methods on the same type of GPU?

    opened by charlesCXK 6
  • Question about ConvMAE-v2

    Question about ConvMAE-v2

    Thank you for your excellent work!

    When I load ConvMAE-v2-Base pretrained checkpoints [], it has cls_token parameter, which not in

    Does ConvMAE-v2 model different from in some details, thanks!

    opened by z-jiaming 4
  • mask convolution

    mask convolution

    Hi! Thanks for the opensource code. I noticed that the mask convolution in the code only masks the residual block, but the skip connection does not have a mask, as shown in line 119 of "ConvMAE/". The corresponding code is as follows: "x = x + self.drop_path(self.conv2(self.attn(mask * self.conv1(self.norm1(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2))))) " Will this lead to information leakage in convolution stage?

    opened by cathylao 3
  • Train on

    Train on

    Could you provide a tutorial on how to train and finetune with custom dataset? And how to modify the input image size during the detection, the current code seems not to support custom image size.

    opened by zqyJason 2
  • How can i train 200 epoches for DET ?

    How can i train 200 epoches for DET ?

    Hi , I want to train the pretrained model in detectron2 framework for object detection. But the code only train 1 epoch and then ended. Is this a bug ?

    opened by ross-Hr 8
  • refactor hard coded numbers for more control over parameters (MaskedAutoencoderConvViT)

    refactor hard coded numbers for more control over parameters (MaskedAutoencoderConvViT)

    Hi - I'd like to do patches of size 32x32, and a smaller model in general. any thing I change breaks the entire code. It would be really helpful if you refactored out all of the places that specify 4,2,16...etc throughout the code for MaskedAutoencoderConvViT

    Thanks, Dan

    opened by DanTaranis 1
  • How long will the the pretraining stage takes in V100?

    How long will the the pretraining stage takes in V100?


    Thank you for your excellent work! We would like to know how long would the pretraining of ImNet-1k take when running on the machine with 8 V100. Also, will you release your manuscript about your work on Faster ConvMAE soon? We can't wait to know more details about the Faster ConvMAE.

    opened by guoxih 0
  • Total  memory consumption for training with 32 batch size.

    Total memory consumption for training with 32 batch size.

    I have tried training the convmae detector (as provided in this repository) with 2 GPUs with each 32GB (V-100). It looks like I can carry out training with only batch size = 2. Going beyond batch-size 2 raises CUDA out of memory. Also with such small batch size training does not seem to produce any well-trained model. Could you tell me the recommended memory size for training the model with batch size = 32?

    Thank you so much.

    opened by IamYourAlpha 6
  • Doubts about masking strategy

    Doubts about masking strategy

    Hi! Thanks for the opensource code. I have the doubts about masking strategy. In the paper: Uniformly masking stage-1 input tokens from the H/4 × W/4 featuremaps would cause all tokens of stage-3 to have partially visible information and requires keeping all stage-3 tokens. Why the visible information will pass to the stage-3, if the images was masked in the first. Thanks very much!

    opened by aichifandefan 0
Alpha VL Team of Shanghai AI Lab
Alpha VL Team of Shanghai AI Lab
PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

null 36 Oct 30, 2022
An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners This is a coarse version for MAE, only make the pretrain model, the fine

FlyEgle 214 Dec 29, 2022
Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

mae-repo PyTorch re-implememtation of "masked autoencoders are scalable vision learners". In this repo, it heavily borrows codes from codebase https:/

Peng Qiao 1 Dec 14, 2021
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 385 Jan 6, 2023
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv] VideoMAE: Masked Autoencoders are Data-Efficient Learne

Multimedia Computing Group, Nanjing University 697 Jan 7, 2023
git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li Accepted by CVPR

NingWang 236 Dec 22, 2022
Self-Learned Video Rain Streak Removal: When Cyclic Consistency Meets Temporal Correspondence

In this paper, we address the problem of rain streaks removal in video by developing a self-learned rain streak removal method, which does not require any clean groundtruth images in the training process.

Yang Wenhan 44 Dec 6, 2022
This is the official code for the paper "Tracker Meets Night: A Transformer Enhancer for UAV Tracking".

SCT This is the official code for the paper "Tracker Meets Night: A Transformer Enhancer for UAV Tracking" The spatial-channel Transformer (SCT) enhan

Intelligent Vision for Robotics in Complex Environment 27 Nov 23, 2022
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 621 Dec 31, 2022
Implementation of experiments in the paper Clockwork Variational Autoencoders (project website) using JAX and Flax

Clockwork VAEs in JAX/Flax Implementation of experiments in the paper Clockwork Variational Autoencoders (project website) using JAX and Flax, ported

Julius Kunze 26 Oct 5, 2022
Official implementation of the paper "AAVAE: Augmentation-AugmentedVariational Autoencoders"

AAVAE Official implementation of the paper "AAVAE: Augmentation-AugmentedVariational Autoencoders" Abstract Recent methods for self-supervised learnin

Grid AI Labs 48 Dec 12, 2022
Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun

ARAE Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun Disc

Junbo (Jake) Zhao 399 Jan 2, 2023
Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders

Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders

null 1 Oct 11, 2021
Data Augmentation with Variational Autoencoders

Documentation Pyraug This library provides a way to perform Data Augmentation using Variational Autoencoders in a reliable way even in challenging con

null 112 Nov 30, 2022
PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

PyTorch Autoencoders Implementing a Variational Autoencoder (VAE) Series in Pytorch. Inspired by this repository Model List check model paper conferen

Subin An 8 Nov 21, 2022
Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training Code for our paper "Predicting lncRNA–protein interactio

zhanglabNKU 1 Nov 29, 2022
A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks without the use of any outside machine learning libraries - all from scratch.

Kordel K. France 2 Nov 14, 2022
Autoencoders pretraining using clustering

Autoencoders pretraining using clustering

IITiS PAN 2 Dec 16, 2021
PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

Soohwan Kim 565 Jan 4, 2023