Implementation of momentum^2 teacher


Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning


  1. All experiments are done with python3.6, torch==1.5.0; torchvision==0.6.0


Data Preparation

Prepare the ImageNet data in ${root_of_your_clone}/data/imagenet_train, ${root_of_your_clone}/data/imagenet_val. Since we have an internal platform(storage) to read imagenet, I have not tried the local mode. You may need to do some modification in momentum_teacher/data/ to support the local mode.


Before training, ensure the path (namely ${root_of_clone}) is added in your PYTHONPATH, e.g.

export PYTHONPATH=$PYTHONPATH:${root_of_clone}

To do unsupervised pre-training of a ResNet-50 model on ImageNet in an 8-gpu machine, run:

  1. using -d to specify gpu_id for training, e.g., -d 0-7
  2. using -b to specify batch_size, e.g., -b 256
  3. using --experiment-name to specify the output folder, and the training log & models will be dumped to './outputs/${experiment-name}'
  4. using -f to specify the description file of ur experiment.


python3 momentum_teacher/tools/ -b 256 -d 0-7 --experiment-name your_exp -f momentum_teacher/exps/arxiv/exp_8_v100/

Linear Evaluation:

With a pre-trained model, to train a supervised linear classifier on frozen features/weights in an 8 gpus machine, run:

  1. using -d to specify gpu_id for training, e.g., -d 0-7
  2. using -b to specify batch_size, e.g., -b 256
  3. using --experiment-name to specify the folder for saving pre-training models.
python3 momentum_teacher/tools/ -b 256 --experiment-name your_exp -f momentum_teacher/exps/arxiv/


Results of Pretraining on a Single Machine

After pretraining on 8 NVIDIA V100 GPUS and 1024 batch-sizes, the results of linear-evaluation are:

pre-train code pre-train
pre-train time accuracy weights
path 100 ~1.8 day 70.7 -
path 200 ~3.6 day 72.7 -
path 300 ~5.5 day 73.8 -

After pretraining on 8 NVIDIA 2080 GPUS and 256 batch-sizes, the results of linear-evaluation are:

pre-train code pre-train
pre-train time accuracy wights
path 100 ~2.5 day 70.4 -
path 200 ~5 day 72.3 -
path 300 ~7.5 day 72.9 -

Results of Pretraining on Multiple Machines

E.g., To do unsupervised pre-training with 4096 batch-sizes and 32 V100 GPUs. run:

Suggesting that each machine has 8 V100 GPUs and there are 4 machines

# machine 1:
export MACHINE=0; export MACHINE_TOTAL=4; python3 momentum_teacher/tools/ -b 4096 -f xxx
# machine 2:
export MACHINE=1; export MACHINE_TOTAL=4; python3 momentum_teacher/tools/ -b 4096 -f xxx
# machine 3:
export MACHINE=2; export MACHINE_TOTAL=4; python3 momentum_teacher/tools/ -b 4096 -f xxx
# machine 4:
export MACHINE=3; export MACHINE_TOTAL=4; python3 momentum_teacher/tools/ -b 4096 -f xxx

results of linear-eval:

pre-train code pre-train
pre-train time accuracy weights
path 100 ~11hour 70.3 -
path 200 ~22hour 72.5 -
path 300 ~33hour 73.7 -

To do unsupervised pre-training with 4096 batch-sizes and 128 2080 GPUs, pls follow the above guides. Results of linear-eval:

pre-train code pre-train
pre-train time accuracy weights
path 100 ~5hour 69.0 -
path 200 ~10hour 71.5 -
path 300 ~15hour 72.3 -


This is an implementation for Momentum^2 Teacher, it is worth noting that:

  • The original implementation is based on our internal Platform.
  • This released version has slightly better performances compared with the tech report's.
You might also like...
Unet network with mean teacher for altrasound image segmentation

Unet network with mean teacher for altrasound image segmentation

A forwarding MPI implementation that can use any other MPI implementation via an MPI ABI

MPItrampoline MPI wrapper library: MPI trampoline library: MPI integration tests: MPI is the de-facto standard for inter-node communication on HPC sys

ALBERT-pytorch-implementation - ALBERT pytorch implementation

ALBERT-pytorch-implementation developing... 모델의 개념이해를 돕기 위한 구현물로 현재 변수명을 상세히 적었고

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

PyTorch implementation of neural style transfer algorithm
PyTorch implementation of neural style transfer algorithm

neural-style-pt This is a PyTorch implementation of the paper A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias

PyTorch implementation of DeepDream algorithm
PyTorch implementation of DeepDream algorithm

neural-dream This is a PyTorch implementation of DeepDream. The code is based on neural-style-pt. Here we DeepDream a photograph of the Golden Gate Br

The project is an official implementation of our CVPR2019 paper
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) News [2020/07/05] A very nice blog from Towards Data Science introd

Image-to-Image Translation with Conditional Adversarial Networks (Pix2pix) implementation in keras

pix2pix-keras Pix2pix implementation in keras. Original paper: Image-to-Image Translation with Conditional Adversarial Networks (pix2pix) Paper Author

Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

This is a Python implementation of cover trees, a data structure for finding nearest neighbors in a general metric space (e.g., a 3D box with periodic

  • Teacher network weights used for linear evaluation

    Teacher network weights used for linear evaluation

    Hi, and thanks a lot for sharing your code! I noticed that both MoCo and BYOL use the frozen weights from the student network for linear evaluation, whereas your evaluation script uses the weights from the teacher network.

    In my tests (on another dataset) the linear evaluation score is much better when using the student network weights. Could you confirm which weights were used in the paper?

    opened by oliviermoliner 2
  • Question about the moco implementation

    Question about the moco implementation

    Hello, I'm a bit confused about the moco implementation in this paper. Since moco only has one forward pass for the teacher network, so I guess that the lazy update is not required for moco right? In this case, did you include the bn statistics for the current batch during the forward pass?

    To be more specific, do you update the running_mean and running_var before calculating x?

    with torch.no_grad():
        self.running_mean = self.momentum * mean + (1 - self.momentum) * self.running_mean
        self.running_var = self.momentum * var * n / (n - 1) + (1 - self.momentum) * self.running_var
    x = (x - self.running_mean[None, :, None, None].detach()) / (
        torch.sqrt(self.running_var[None, :, None, None].detach() + self.eps)

    or you calculate x first

    x = (x - self.running_mean[None, :, None, None].detach()) / (
        torch.sqrt(self.running_var[None, :, None, None].detach() + self.eps)
    with torch.no_grad():
        self.running_mean = self.momentum * mean + (1 - self.momentum) * self.running_mean
        self.running_var = self.momentum * var * n / (n - 1) + (1 - self.momentum) * self.running_var
    opened by kyle-1997 0
  • Can you provide MomentumBatchNorm3d?

    Can you provide MomentumBatchNorm3d?

    Thanks for your excellent work.

    I try to implement a 3D version of momentum bn for self-supervised video representation learning. However, the performance of the pre-trained model is as good as pre-training with SyncBn. Can you provide the official implementation?

    Here is my implementation:

    opened by SunDoge 0
  • NCCL issue while reproducing the results

    NCCL issue while reproducing the results


    Thank you for sharing this great work!

    When I was running your code, I met NCCL error which is shown in the attached screenshot below. image

    I was wondering if others met this error as well while running this code?

    opened by danielchyeh 0
jemmy li
jemmy li
pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

Zhiqiang Shen 16 Nov 4, 2020
This is an unofficial implementation of the paper “Student-Teacher Feature Pyramid Matching for Unsupervised Anomaly Detection”.

This is an unofficial implementation of the paper “Student-Teacher Feature Pyramid Matching for Unsupervised Anomaly Detection”.

haifeng xia 32 Oct 26, 2022
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

This is the official PyTorch implementation of the ALBEF paper [Blog]. This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.

Salesforce 805 Jan 9, 2023
auto-tuning momentum SGD optimizer

YellowFin YellowFin is an auto-tuning optimizer based on momentum SGD which requires no manual specification of learning rate and momentum. It measure

Jian Zhang 288 Nov 19, 2022
Deep learning algorithms for muon momentum estimation in the CMS Trigger System

Deep learning algorithms for muon momentum estimation in the CMS Trigger System The Compact Muon Solenoid (CMS) is a general-purpose detector at the L

anuragB 2 Oct 6, 2021
Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021)

EMI-FGSM This repository contains code to reproduce results from the paper: Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021) Xiaosen Wa

John Hopcroft Lab at HUST 10 Sep 26, 2022
PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Unbiased Teacher for Semi-Supervised Object Detection This is the PyTorch implementation of our paper: Unbiased Teacher for Semi-Supervised Object Detection

Facebook Research 366 Dec 28, 2022
[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Undistillable: Making A Nasty Teacher That CANNOT teach students "Undistillable: Making A Nasty Teacher That CANNOT teach students" Haoyu Ma, Tianlong

VITA 71 Dec 28, 2022
AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

Frank Liu 26 Oct 13, 2022