ConvMAE: Masked Convolution Meets Masked Autoencoders

Alpha VL Team of Shanghai AI Lab

Last update: Jan 8, 2023

Related tags

Deep Learning computer-vision backbone object-detection semantic-segmentation mae masked-image-modeling

Overview

ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao¹, Teli Ma¹, Hongsheng Li², Jifeng Dai³, Yu Qiao¹,

¹ Shanghai AI Laboratory, ² MMLab, CUHK, ³ Sensetime Research.

This repo is the official implementation of ConvMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.

Updates

16/May/2022

The supported codes and models for COCO object detection and instance segmentation are available.

11/May/2022

Pretrained models on ImageNet-1K for ConvMAE.
The supported codes and models for ImageNet-1K finetuning and linear probing are provided.

08/May/2022

The preprint version is public at arxiv.

Introduction

ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.

We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).

Pretrain on ImageNet-1K

The following table provides pretrained checkpoints and logs used in the paper.

	ConvMAE-Base
pretrained checkpoints	download
logs	download

Main Results on ImageNet-1K

Models	#Params(M)	Supervision	Encoder Ratio	Pretrain Epochs	FT acc@1(%)	LIN acc@1(%)	FT logs/weights	LIN logs/weights
BEiT	88	DALLE	100%	300	83.0	37.6	-	-
MAE	88	RGB	25%	1600	83.6	67.8	-	-
SimMIM	88	RGB	100%	800	84.0	56.7	-	-
MaskFeat	88	HOG	100%	300	83.6	N/A	-	-
data2vec	88	RGB	100%	800	84.2	N/A	-	-
ConvMAE-B	88	RGB	25%	1600	85.0	70.9	log/weight

Main Results on COCO

Mask R-CNN

Models	Pretrain	Pretrain Epochs	Finetune Epochs	#Params(M)	FLOPs(T)	box AP	mask AP	logs/weights
Swin-B	IN21K w/ labels	300	36	109	0.7	51.4	45.4	-
Swin-L	IN21K w/ labels	300	36	218	1.1	52.4	46.2	-
MViTv2-B	IN21K w/ labels	300	36	73	0.6	53.1	47.4	-
MViTv2-L	IN21K w/ labels	300	36	239	1.3	53.6	47.5	-
Benchmarking-ViT-B	IN1K w/o labels	1600	100	118	0.9	50.4	44.9	-
Benchmarking-ViT-L	IN1K w/o labels	1600	100	340	1.9	53.3	47.2	-
ViTDet	IN1K w/o labels	1600	100	111	0.8	51.2	45.5	-
MIMDet-ViT-B	IN1K w/o labels	1600	36	127	1.1	51.5	46.0	-
MIMDet-ViT-L	IN1K w/o labels	1600	36	345	2.6	53.3	47.5	-
ConvMAE-B	IN1K w/o lables	1600	25	104	0.9	53.2	47.1	log/weight

Main Results on ADE20K

UperNet

Models	Pretrain	Pretrain Epochs	Finetune Iters	#Params(M)	FLOPs(T)	mIoU	logs/weights
DeiT-B	IN1K w/ labels	300	16K	163	0.6	45.6	-
Swin-B	IN1K w/ labels	300	16K	121	0.3	48.1	-
MoCo V3	IN1K	300	16K	163	0.6	47.3	-
DINO	IN1K	400	16K	163	0.6	47.2	-
BEiT	IN1K+DALLE	1600	16K	163	0.6	47.1	-
PeCo	IN1K	300	16K	163	0.6	46.7	-
CAE	IN1K+DALLE	800	16K	163	0.6	48.8	-
MAE	IN1K	1600	16K	163	0.6	48.1	-
ConvMAE-B	IN1K	1600	16K	153	0.6	51.7	soon

Main Results on Kinetics-400

Models	Pretrain Epochs	Finetune Epochs	#Params(M)	Top1	Top5	logs/weights
VideoMAE-B	200	100	87	77.8
VideoMAE-B	800	100	87	79.4
VideoMAE-B	1600	100	87	79.8
VideoMAE-B	1600	100 (w/ Repeated Aug)	87	80.7	94.7
SpatioTemporalLearner-B	800	150 (w/ Repeated Aug)	87	81.3	94.9
VideoConvMAE-B	200	100	86	80.1	94.3	Soon
VideoConvMAE-B	800	100	86	81.7	95.1	Soon
VideoConvMAE-B-MSD	800	100	86	82.7	95.5	Soon

Main Results on Something-Something V2

Models	Pretrain Epochs	Finetune Epochs	#Params(M)	Top1	Top5	logs/weights
VideoMAE-B	200	40	87	66.1
VideoMAE-B	800	40	87	69.3
VideoMAE-B	2400	40	87	70.3
VideoConvMAE-B	200	40	86	67.7	91.2	Soon
VideoConvMAE-B	800	40	86	69.9	92.4	Soon
VideoConvMAE-B-MSD	800	40	86	70.7	93.0	Soon

Getting Started

Prerequisites

Linux
Python 3.7+
CUDA 10.2+
GCC 5+

Training and evaluation

See PRETRAIN.md for pretraining.
See FINETUNE.md for pretrained model finetuning and linear probing.
See DETECTION.md for using pretrained backbone on Mask RCNN.
See SEGMENTATION.md for using pretrained backbone on UperNet.

Acknowledgement

The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.

License

ConvMAE is released under the MIT License.

Citation

@article{gao2022convmae,
  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.03892},
  year={2022}
}

Comments

Pretraining implementation

I have implemented pretraining codes based on MAE repo but I wonder one thing: in the decoder phase, (1) do you sum all features of 3 stages and then normalize it or (2) you normalize the feature of last stage and then sum it with 2 previous ones? Because I got nan loss after 270 epochs with (1) approach. Btw, Have you ever met Nan loss during training?

opened by hao-pt 9
ImageNet Evaluation

Thanks for sharing the great work. I encountered difficulties in reproducing the evaluation results on FINETUNE.md. My evaluation results are: * Acc@1 1.090 Acc@5 2.188 loss 8.955 Accuracy of the network on the 50000 test images: 1.1% That's obviously too big a gap.

I download the ImageNet-1K following your guidance and prepared the ImageNet-1K following Jasonlee1995. Are there any details I haven't noticed, or any specific requirements for preparing the dataset？

opened by SheldonHS 6
Time required to train one epoch.

Dear author: Thank you for sharing the excellent work! May I ask how the time overhead of ConvMAE pre-training compares to MAE? Can you provide the time required to train an epoch for these two methods on the same type of GPU?

opened by charlesCXK 6
Question about ConvMAE-v2

Thank you for your excellent work!

When I load ConvMAE-v2-Base pretrained checkpoints [https://drive.google.com/file/d/1gykVKNDlRn8eiuXk5bUj1PbSnHXFzLnI/view?usp=sharing], it has cls_token parameter, which not in models_convmae.py.

Does ConvMAE-v2 model different from models_convmae.py in some details, thanks!

opened by z-jiaming 4
mask convolution

Hi! Thanks for the opensource code. I noticed that the mask convolution in the code only masks the residual block, but the skip connection does not have a mask, as shown in line 119 of "ConvMAE/vision_transformer.py". The corresponding code is as follows: "x = x + self.drop_path(self.conv2(self.attn(mask * self.conv1(self.norm1(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2))))) " Will this lead to information leakage in convolution stage?

opened by cathylao 3
Train on

Could you provide a tutorial on how to train and finetune with custom dataset? And how to modify the input image size during the detection, the current code seems not to support custom image size.

opened by zqyJason 2
How can i train 200 epoches for DET ?

Hi , I want to train the pretrained model in detectron2 framework for object detection. But the code only train 1 epoch and then ended. Is this a bug ?

opened by ross-Hr 8
refactor hard coded numbers for more control over parameters (MaskedAutoencoderConvViT)

Hi - I'd like to do patches of size 32x32, and a smaller model in general. any thing I change breaks the entire code. It would be really helpful if you refactored out all of the places that specify 4,2,16...etc throughout the code for MaskedAutoencoderConvViT

Thanks, Dan

opened by DanTaranis 1
How long will the the pretraining stage takes in V100?

Hi,

Thank you for your excellent work! We would like to know how long would the pretraining of ImNet-1k take when running on the machine with 8 V100. Also, will you release your manuscript about your work on Faster ConvMAE soon? We can't wait to know more details about the Faster ConvMAE.

opened by guoxih 0
Total memory consumption for training with 32 batch size.

I have tried training the convmae detector (as provided in this repository) with 2 GPUs with each 32GB (V-100). It looks like I can carry out training with only batch size = 2. Going beyond batch-size 2 raises CUDA out of memory. Also with such small batch size training does not seem to produce any well-trained model. Could you tell me the recommended memory size for training the model with batch size = 32?

Thank you so much.

opened by IamYourAlpha 6
Doubts about masking strategy

Hi! Thanks for the opensource code. I have the doubts about masking strategy. In the paper: Uniformly masking stage-1 input tokens from the H/4 × W/4 featuremaps would cause all tokens of stage-3 to have partially visible information and requires keeping all stage-3 tokens. Why the visible information will pass to the stage-3, if the images was masked in the first. Thanks very much!

opened by aichifandefan 0

Owner

Alpha VL Team of Shanghai AI Lab

GitHub

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

36 Oct 30, 2022

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners This is a coarse version for MAE, only make the pretrain model, the fine

214 Dec 29, 2022

Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

mae-repo PyTorch re-implememtation of "masked autoencoders are scalable vision learners". In this repo, it heavily borrows codes from codebase https:/

1 Dec 14, 2021

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL)

385 Jan 6, 2023

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv] VideoMAE: Masked Autoencoders are Data-Efficient Learne

Multimedia Computing Group, Nanjing University

697 Jan 7, 2023

git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li Accepted by CVPR

236 Dec 22, 2022

ConvMAE: Masked Convolution Meets Masked Autoencoders

Related tags

Overview

ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

Updates

Introduction

Pretrain on ImageNet-1K

Main Results on ImageNet-1K

Main Results on COCO

Mask R-CNN

Main Results on ADE20K

UperNet

Main Results on Kinetics-400

Main Results on Something-Something V2

Getting Started

Prerequisites

Training and evaluation

Acknowledgement

License

Citation

Comments

Owner

Alpha VL Team of Shanghai AI Lab

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Self-Learned Video Rain Streak Removal: When Cyclic Consistency Meets Temporal Correspondence

This is the official code for the paper "Tracker Meets Night: A Transformer Enhancer for UAV Tracking".

Code release for SLIP Self-supervision meets Language-Image Pre-training

Implementation of experiments in the paper Clockwork Variational Autoencoders (project website) using JAX and Flax

Official implementation of the paper "AAVAE: Augmentation-AugmentedVariational Autoencoders"

Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun

Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders

Data Augmentation with Variational Autoencoders

PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

Autoencoders pretraining using clustering

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)