PyTorch implementation of MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Related tags

Deep Learning moco
Overview

MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

This is a PyTorch implementation of the MoCo paper:

@Article{he2019moco,
  author  = {Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross Girshick},
  title   = {Momentum Contrast for Unsupervised Visual Representation Learning},
  journal = {arXiv preprint arXiv:1911.05722},
  year    = {2019},
}

It also includes the implementation of the MoCo v2 paper:

@Article{chen2020mocov2,
  author  = {Xinlei Chen and Haoqi Fan and Ross Girshick and Kaiming He},
  title   = {Improved Baselines with Momentum Contrastive Learning},
  journal = {arXiv preprint arXiv:2003.04297},
  year    = {2020},
}

Preparation

Install PyTorch and ImageNet dataset following the official PyTorch ImageNet training code.

This repo aims to be minimal modifications on that code. Check the modifications by:

diff main_moco.py <(curl https://raw.githubusercontent.com/pytorch/examples/master/imagenet/main.py)
diff main_lincls.py <(curl https://raw.githubusercontent.com/pytorch/examples/master/imagenet/main.py)

Unsupervised Training

This implementation only supports multi-gpu, DistributedDataParallel training, which is faster and simpler; single-gpu or DataParallel training is not supported.

To do unsupervised pre-training of a ResNet-50 model on ImageNet in an 8-gpu machine, run:

python main_moco.py \
  -a resnet50 \
  --lr 0.03 \
  --batch-size 256 \
  --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

This script uses all the default hyper-parameters as described in the MoCo v1 paper. To run MoCo v2, set --mlp --moco-t 0.2 --aug-plus --cos.

Note: for 4-gpu training, we recommend following the linear lr scaling recipe: --lr 0.015 --batch-size 128 with 4 gpus. We got similar results using this setting.

Linear Classification

With a pre-trained model, to train a supervised linear classifier on frozen features/weights in an 8-gpu machine, run:

python main_lincls.py \
  -a resnet50 \
  --lr 30.0 \
  --batch-size 256 \
  --pretrained [your checkpoint path]/checkpoint_0199.pth.tar \
  --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

Linear classification results on ImageNet using this repo with 8 NVIDIA V100 GPUs :

pre-train
epochs
pre-train
time
MoCo v1
top-1 acc.
MoCo v2
top-1 acc.
ResNet-50 200 53 hours 60.8±0.2 67.5±0.1

Here we run 5 trials (of pre-training and linear classification) and report mean±std: the 5 results of MoCo v1 are {60.6, 60.6, 60.7, 60.9, 61.1}, and of MoCo v2 are {67.7, 67.6, 67.4, 67.6, 67.3}.

Models

Our pre-trained ResNet-50 models can be downloaded as following:

epochs mlp aug+ cos top-1 acc. model md5
MoCo v1 200 60.6 download b251726a
MoCo v2 200 67.7 download 59fd9945
MoCo v2 800 71.1 download a04e12f8

Transferring to Object Detection

See ./detection.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

See Also

Comments
  • ImageNet linear classifier weights?

    ImageNet linear classifier weights?

    Hi, would you mind also uploading the weights (or whole checkpoint) for a model with the linear classifier on ImageNet? I'm running main_lincls.py myself currently, but it looks like it will take quite some time to get through the 100ep needed and I guess it can be generally useful to others to have these weights readily downloadable.

    opened by lucasb-eyer 18
  • Is shuffle batch norm tied with momentum contrast training?

    Is shuffle batch norm tied with momentum contrast training?

    Hi, first thank you for the work. I am wondering if one wants to use shuffle BN, the training has to be in the momentum update fashion. Because it seems to me that in shuffle BN, the encoder which encodes the shuffled batch cannot be updated by backprop even if one wants to. Is this a correct understanding or am I getting something wrong? Thank you.

    opened by dongyaoli10x 6
  • Are there results with other normalizations?

    Are there results with other normalizations?

    Hello, thanks for the awesome project and paper.

    Are there some results with other normalizations (instance norm, layer norm...) instead of shuffle BN?

    I found that shuffle BN takes 20 % of times in def forward(self, im_q, im_k) with both V100x4, V100x2 settings.

    In addition, the shuffle time is 6x times longer than inference of key features in https://github.com/facebookresearch/moco/blob/master/moco/builder.py#L133-L135

    • shuffle time (line:133): 0.06 s
    • inference (line 135): 0.01 s

    I think if replacement of batchnorm with other normalizations does not hurts the results, we can make the model training more faster.

    opened by LeeDoYup 6
  • strange top-1

    strange top-1

    Epoch: [34][3590/4999] Time 0.426 ( 1.635) Data 0.000 ( 0.227) Loss 6.8926e+00 (6.9147e+00) Acc@1 73.44 ( 76.76) Acc@5 87.50 ( 87.55) Epoch: [34][3600/4999] Time 0.437 ( 1.638) Data 0.000 ( 0.227) Loss 7.0694e+00 (6.9147e+00) Acc@1 59.38 ( 76.76) Acc@5 76.56 ( 87.55) Epoch: [34][3610/4999] Time 0.432 ( 1.638) Data 0.000 ( 0.226) Loss 6.9074e+00 (6.9146e+00) Acc@1 78.12 ( 76.76) Acc@5 90.62 ( 87.55) Epoch: [34][3620/4999] Time 0.423 ( 1.639) Data 0.000 ( 0.225) Loss 6.9464e+00 (6.9146e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.55) Epoch: [34][3630/4999] Time 0.436 ( 1.644) Data 0.000 ( 0.225) Loss 6.8364e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 89.06 ( 87.56) Epoch: [34][3640/4999] Time 0.425 ( 1.646) Data 0.000 ( 0.224) Loss 6.9520e+00 (6.9145e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.56) Epoch: [34][3650/4999] Time 0.426 ( 1.646) Data 0.000 ( 0.224) Loss 6.8319e+00 (6.9145e+00) Acc@1 84.38 ( 76.77) Acc@5 87.50 ( 87.56) Epoch: [34][3660/4999] Time 0.428 ( 1.646) Data 0.000 ( 0.223) Loss 6.8066e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 90.62 ( 87.57) Epoch: [34][3670/4999] Time 0.471 ( 1.651) Data 0.000 ( 0.222) Loss 6.9694e+00 (6.9144e+00) Acc@1 78.12 ( 76.77) Acc@5 89.06 ( 87.57) Epoch: [34][3680/4999] Time 0.431 ( 1.650) Data 0.000 ( 0.222) Loss 6.8628e+00 (6.9144e+00) Acc@1 81.25 ( 76.77) Acc@5 87.50 ( 87.57) Epoch: [34][3690/4999] Time 0.428 ( 1.650) Data 0.000 ( 0.221) Loss 6.8666e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 92.19 ( 87.56) Epoch: [34][3700/4999] Time 0.434 ( 1.650) Data 0.000 ( 0.221) Loss 6.9402e+00 (6.9144e+00) Acc@1 71.88 ( 76.78) Acc@5 87.50 ( 87.57) Epoch: [34][3710/4999] Time 0.434 ( 1.654) Data 0.000 ( 0.220) Loss 6.8522e+00 (6.9144e+00) Acc@1 81.25 ( 76.78) Acc@5 92.19 ( 87.57) Epoch: [34][3720/4999] Time 0.421 ( 1.655) Data 0.000 ( 0.219) Loss 6.8393e+00 (6.9145e+00) Acc@1 79.69 ( 76.78) Acc@5 90.62 ( 87.57) Epoch: [34][3730/4999] Time 0.426 ( 1.658) Data 0.000 ( 0.219) Loss 6.9804e+00 (6.9145e+00) Acc@1 68.75 ( 76.78) Acc@5 81.25 ( 87.57) Epoch: [34][3740/4999] Time 0.424 ( 1.658) Data 0.000 ( 0.218) Loss 7.0028e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57) Epoch: [34][3750/4999] Time 0.438 ( 1.662) Data 0.000 ( 0.218) Loss 6.9528e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57) Epoch: [34][3760/4999] Time 0.423 ( 1.664) Data 0.000 ( 0.217) Loss 6.8455e+00 (6.9143e+00) Acc@1 76.56 ( 76.79) Acc@5 93.75 ( 87.57) Epoch: [34][3770/4999] Time 0.430 ( 1.666) Data 0.000 ( 0.217) Loss 6.9374e+00 (6.9143e+00) Acc@1 81.25 ( 76.79) Acc@5 90.62 ( 87.57)

    I use the following command to train on ImageNet with 4 2080ti:

    python main_moco.py -a resnet50 --mlp --moco-t 0.2 --aug-plus --cos --lr 0.015 --batch-size 256 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 /job/large_dataset/open_datasets/ImageNet/

    I doubt it is training on the supervised manner. Are there anything wrong with my experiments?

    opened by passion3394 5
  • Error in distributed training

    Error in distributed training

    I got an error frequently when distributed training is enabled. It occurs roughly for every 50~100 epochs. Here is the error message:

    terminate called after throwing an instance of 'std::system_error'
      what():  Transport endpoint is not connected
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
        exitcode = _main(fd)
      File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
        self = reduction.pickle.load(from_parent)
    _pickle.UnpicklingError: pickle data was truncated
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
        exitcode = _main(fd)
      File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
        self = reduction.pickle.load(from_parent)
    EOFError: Ran out of input
    

    Could you help me to resolve the issue?

    opened by kibok90 5
  • Code and settings of semantic segmentation task

    Code and settings of semantic segmentation task

    Hi, I'd like to reproduce the results of semantic segmentation task (VOC and LVIS), but I couldn't find the code and setting files. Do you have any plans to provide them with this repository? Thanks.

    opened by yshinya6 4
  • accuracy is 100% for 1st epoch 10itr/10batches. further it decreases to 0 or very less value

    accuracy is 100% for 1st epoch 10itr/10batches. further it decreases to 0 or very less value

    I am trying to run code on my own data (apart from datasets mentioned in paper). During training, it is seen that accuracy is 100% for 1st epoch 1st 10 batches, however it is decreasing to 0 or comparatively very small value throughout further training. Also loss is increasing all time. Snapshot for reference. image

    opened by angy50 4
  • Can BN be applied within DistributedDataParallel (DDP)?

    Can BN be applied within DistributedDataParallel (DDP)?

    Typically, we will use SyncBN in DDP to ensure that the computed gradients are identical across different GPUs. It maintains the models on different GPUs with exactly same parameters during training.

    However, in moco training (IN-1M), the encoders consist of several vanilla BNs. How to ensure that the models across GPUs are with same parameters? Thanks.

    opened by Ze-Yang 4
  • Question about code in moco/builder.py

    Question about code in moco/builder.py

    Hi,

    Thanks for your impressive work.

    In moco/builder.py, line 63:

    self.queue[:, ptr:ptr + batch_size] = keys.T

    I suppose that the keys is a Tensor with the batch_size dim, and T is a float scaler attribute of self as self.T .

    So an AttributeError AttributeError: 'Tensor' object has no attribute 'T' would be raised if I directly run the train code.

    Should it be self.T(I guess)? or any specific setting i missed?

    Regards,

    opened by AutumnZ-94 4
  • Loss curves on ImageNet

    Loss curves on ImageNet

    Hello -- I'm trying to reproduce some of these results on a different dataset, and the loss slowly bounces up and down, without converging (see below). Is that expected behavior? I don't think the paper shows what the loss/pretext accuracy look like in the ImageNet training -- might it be possible to share those plots here?

    Screen Shot 2020-04-07 at 11 31 28 AM

    Thanks!

    Edit: Note, my dataset has ~ 250K images, so ~25% the size of Imagnet -- I'm wondering whether the difference in dataset sizes could be causing problems? Eg perhaps because the length of the momentum buffer is 4x larger relative to the size of the dataset.

    opened by bkj 4
  • Issue about dequeue_and_enqueue

    Issue about dequeue_and_enqueue

    Hi, I am a little confused about the code of _dequeue_and_enqueue

    https://github.com/facebookresearch/moco/blob/main/moco/builder.py#L53-L66

            # replace the keys at ptr (dequeue and enqueue)
            self.queue[:, ptr:ptr + batch_size] = keys.T
            ptr = (ptr + batch_size) % self.K  # move pointer
    

    this implement will update queue gradually, just as below: image

    but as described in your paper, the tensor which will be replaced should be pushed into the queue[left side] and the oldest should be removed[right side], just as fisrt-in-first-out, but the code above just replace red rectangle from left to right, so I'm confused about your code.

    Paper:

    The samples in the dictionary are progressively replaced. The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed.

    opened by qishibo 3
  • How to load the Hyperparameters without command line code Argument Parser?

    How to load the Hyperparameters without command line code Argument Parser?

    Hi,

    Did anyone tried loading the hyperparameters or modified the code without using command line argument parser. Please share here if anyone tried doing it?

    opened by arohi2bujji 0
  • Low Accuracy

    Low Accuracy

    I am run this code wiht 2-V100 GPUS python main_lincls.py
    -a resnet50
    --lr 30.0
    --batch-size 2048 \ --pretrained moco_v1_200ep_pretrain.pth.tar
    /workspace/imagenet
    -j 14

    run 98 epoch but my 1-acc is 57.90

    Why is the performance so low?

    opened by dbsdmlgus50 0
  • Issue with batch size

    Issue with batch size

    Hello @KaimingHe, I am running the MoCo pre-training with the following configurations:

    Num of GPUs = 8 GPU type: NVIDIA Quadro RTX 8000

    But I am unable to run this with a batch size greater than 8. It throws a PyTorch Spawn error when I set the batch size greater than 8 (ex:16, 32). I confirmed that all GPUs are running at full utilization when I run the pre training with the above mentioned batch size. Do you have any suggestions or thoughts on how I might proceed? With a batch size this small, would it provide any meaningful results?

    opened by kkannan8291 1
  • Question about the queue for key encoder

    Question about the queue for key encoder

    Hey, First of all, thank you for the code.

    I have a doubt about updating the queue using the mini-batch. In the paper, the size of the queue is 65,536, which is basically the number of negative keys. I have confusion about the term "keys" here, does 65,536 keys mean 65,336 images or does it mean 65,536 features generated from the key encoder?

    opened by Rohit8y 2
  • Question about transfering to COCO with Mocov1 and Mocov2 checkpoint

    Question about transfering to COCO with Mocov1 and Mocov2 checkpoint

    Hi, I get some question when I reproduce the results on COCO.

    When I use the Mocov1 pretraining checkpointreleased in the repo, I can get the same results reported in the paper as below. Evaluation results for bbox: | AP | AP50 | AP75 | APs | APm | APl | |:------:|:------:|:------:|:------:|:------:|:------:| | 38.658 | 58.372 | 41.765 | 21.351 | 43.372 | 51.629 |

    Evaluation results for segm: | AP | AP50 | AP75 | APs | APm | APl | |:------:|:------:|:------:|:------:|:------:|:------:| | 34.014 | 55.207 | 36.171 | 14.915 | 37.526 | 50.783 |

    However, when I change the pretraining checkpoint to the Mocov2 one. I get a worse results as below. Evaluation results for bbox: | AP | AP50 | AP75 | APs | APm | APl | |:------:|:------:|:------:|:------:|:------:|:------:| | 33.921 | 52.401 | 36.497 | 19.247 | 37.598 | 45.513 |

    Evaluation results for segm: | AP | AP50 | AP75 | APs | APm | APl | |:------:|:------:|:------:|:------:|:------:|:------:| | 30.113 | 49.341 | 31.897 | 13.713 | 32.557 | 45.618 |

    There is a large gap about 4% between the results of Mocov1 and Mocov2. The code I used is in this repo with official setting. Is this a normal phenomenon? Or just I make some mistakes? I am wondering if anyone could kindly give me some advices or explanations. Thanks a lot!

    opened by mZhenz 0
Owner
Meta Research
Meta Research
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

This is the official PyTorch implementation of the ALBEF paper [Blog]. This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.

Salesforce 805 Jan 9, 2023
custom pytorch implementation of MoCo v3

MoCov3-pytorch custom implementation of MoCov3 [arxiv]. I made minor modifications based on the official MoCo repository [github]. No ViT part code an

null 39 Nov 14, 2022
PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

MoCo v3 for Self-supervised ResNet and ViT Introduction This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT. The original M

Facebook Research 887 Jan 8, 2023
This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

Core-tuning This repository is the official implementation of ``Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regular

vanint 18 Dec 17, 2022
The code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning"

The Code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning" Setting up and using the repo Get the dataset. Follow

null 4 Apr 20, 2022
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

Zhenda Xie 293 Dec 20, 2022
UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

dddzg 49 Jan 2, 2023
This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

JigsawClustering Jigsaw Clustering for Unsupervised Visual Representation Learning Pengguang Chen, Shu Liu, Jiaya Jia Introduction This project provid

DV Lab 73 Sep 18, 2022
PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Contrast to Divide: self-supervised pre-training for learning with noisy labels This is an official implementation of "Contrast to Divide: self-superv

null 55 Nov 23, 2022
Implementation of momentum^2 teacher

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning Requirements All experiments are done with python3.6, torch

jemmy li 121 Sep 26, 2022
Deep learning algorithms for muon momentum estimation in the CMS Trigger System

Deep learning algorithms for muon momentum estimation in the CMS Trigger System The Compact Muon Solenoid (CMS) is a general-purpose detector at the L

anuragB 2 Oct 6, 2021
This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation.

ISL This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation, which is accepted

null 19 May 4, 2022
PyTorch implementation code for the paper MixCo: Mix-up Contrastive Learning for Visual Representation

How to Reproduce our Results This repository contains PyTorch implementation code for the paper MixCo: Mix-up Contrastive Learning for Visual Represen

opcrisis 46 Dec 15, 2022
auto-tuning momentum SGD optimizer

YellowFin YellowFin is an auto-tuning optimizer based on momentum SGD which requires no manual specification of learning rate and momentum. It measure

Jian Zhang 288 Nov 19, 2022
Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021)

EMI-FGSM This repository contains code to reproduce results from the paper: Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021) Xiaosen Wa

John Hopcroft Lab at HUST 10 Sep 26, 2022
The implementation of "Bootstrapping Semantic Segmentation with Regional Contrast".

ReCo - Regional Contrast This repository contains the source code of ReCo and baselines from the paper, Bootstrapping Semantic Segmentation with Regio

Shikun Liu 128 Dec 30, 2022
This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"

Learning Invariant Representation for Unsupervised Image Restoration (CVPR 2020) Introduction This is an implementation for the paper "Learning Invari

GarField 88 Nov 7, 2022
Viewmaker Networks: Learning Views for Unsupervised Representation Learning

Viewmaker Networks: Learning Views for Unsupervised Representation Learning Alex Tamkin, Mike Wu, and Noah Goodman Paper link: https://arxiv.org/abs/2

Alex Tamkin 31 Dec 1, 2022
CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning This repository contains the code and relevant instructions

XiaoMing 5 Aug 19, 2022