PyTorch implementation of MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Meta Research

Last update: Jan 2, 2023

Related tags

Deep Learning moco

Overview

MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

This is a PyTorch implementation of the MoCo paper:

@Article{he2019moco,
  author  = {Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross Girshick},
  title   = {Momentum Contrast for Unsupervised Visual Representation Learning},
  journal = {arXiv preprint arXiv:1911.05722},
  year    = {2019},
}

It also includes the implementation of the MoCo v2 paper:

@Article{chen2020mocov2,
  author  = {Xinlei Chen and Haoqi Fan and Ross Girshick and Kaiming He},
  title   = {Improved Baselines with Momentum Contrastive Learning},
  journal = {arXiv preprint arXiv:2003.04297},
  year    = {2020},
}

Preparation

Install PyTorch and ImageNet dataset following the official PyTorch ImageNet training code.

This repo aims to be minimal modifications on that code. Check the modifications by:

diff main_moco.py <(curl https://raw.githubusercontent.com/pytorch/examples/master/imagenet/main.py)
diff main_lincls.py <(curl https://raw.githubusercontent.com/pytorch/examples/master/imagenet/main.py)

Unsupervised Training

This implementation only supports multi-gpu, DistributedDataParallel training, which is faster and simpler; single-gpu or DataParallel training is not supported.

To do unsupervised pre-training of a ResNet-50 model on ImageNet in an 8-gpu machine, run:

python main_moco.py \
  -a resnet50 \
  --lr 0.03 \
  --batch-size 256 \
  --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

This script uses all the default hyper-parameters as described in the MoCo v1 paper. To run MoCo v2, set --mlp --moco-t 0.2 --aug-plus --cos.

Note: for 4-gpu training, we recommend following the linear lr scaling recipe: --lr 0.015 --batch-size 128 with 4 gpus. We got similar results using this setting.

Linear Classification

With a pre-trained model, to train a supervised linear classifier on frozen features/weights in an 8-gpu machine, run:

python main_lincls.py \
  -a resnet50 \
  --lr 30.0 \
  --batch-size 256 \
  --pretrained [your checkpoint path]/checkpoint_0199.pth.tar \
  --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

Linear classification results on ImageNet using this repo with 8 NVIDIA V100 GPUs :

	pre-train epochs	pre-train time	MoCo v1 top-1 acc.	MoCo v2 top-1 acc.
ResNet-50	200	53 hours	60.8±0.2	67.5±0.1

Here we run 5 trials (of pre-training and linear classification) and report mean±std: the 5 results of MoCo v1 are {60.6, 60.6, 60.7, 60.9, 61.1}, and of MoCo v2 are {67.7, 67.6, 67.4, 67.6, 67.3}.

Models

Our pre-trained ResNet-50 models can be downloaded as following:

	epochs	mlp	aug+	cos	top-1 acc.	model	md5
MoCo v1	200				60.6	download	`b251726a`
MoCo v2	200	✓	✓	✓	67.7	download	`59fd9945`
MoCo v2	800	✓	✓	✓	71.1	download	`a04e12f8`

Transferring to Object Detection

See ./detection.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

ImageNet linear classifier weights?

Hi, would you mind also uploading the weights (or whole checkpoint) for a model with the linear classifier on ImageNet? I'm running main_lincls.py myself currently, but it looks like it will take quite some time to get through the 100ep needed and I guess it can be generally useful to others to have these weights readily downloadable.

opened by lucasb-eyer 18
Is shuffle batch norm tied with momentum contrast training?

Hi, first thank you for the work. I am wondering if one wants to use shuffle BN, the training has to be in the momentum update fashion. Because it seems to me that in shuffle BN, the encoder which encodes the shuffled batch cannot be updated by backprop even if one wants to. Is this a correct understanding or am I getting something wrong? Thank you.

opened by dongyaoli10x 6
Are there results with other normalizations?
Hello, thanks for the awesome project and paper.

Are there some results with other normalizations (instance norm, layer norm...) instead of shuffle BN?

I found that shuffle BN takes 20 % of times in def forward(self, im_q, im_k) with both V100x4, V100x2 settings.

In addition, the shuffle time is 6x times longer than inference of key features in https://github.com/facebookresearch/moco/blob/master/moco/builder.py#L133-L135

shuffle time (line:133): 0.06 s

inference (line 135): 0.01 s

I think if replacement of batchnorm with other normalizations does not hurts the results, we can make the model training more faster.
opened by LeeDoYup 6
strange top-1

Epoch: [34][3590/4999] Time 0.426 ( 1.635) Data 0.000 ( 0.227) Loss 6.8926e+00 (6.9147e+00) Acc@1 73.44 ( 76.76) Acc@5 87.50 ( 87.55) Epoch: [34][3600/4999] Time 0.437 ( 1.638) Data 0.000 ( 0.227) Loss 7.0694e+00 (6.9147e+00) Acc@1 59.38 ( 76.76) Acc@5 76.56 ( 87.55) Epoch: [34][3610/4999] Time 0.432 ( 1.638) Data 0.000 ( 0.226) Loss 6.9074e+00 (6.9146e+00) Acc@1 78.12 ( 76.76) Acc@5 90.62 ( 87.55) Epoch: [34][3620/4999] Time 0.423 ( 1.639) Data 0.000 ( 0.225) Loss 6.9464e+00 (6.9146e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.55) Epoch: [34][3630/4999] Time 0.436 ( 1.644) Data 0.000 ( 0.225) Loss 6.8364e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 89.06 ( 87.56) Epoch: [34][3640/4999] Time 0.425 ( 1.646) Data 0.000 ( 0.224) Loss 6.9520e+00 (6.9145e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.56) Epoch: [34][3650/4999] Time 0.426 ( 1.646) Data 0.000 ( 0.224) Loss 6.8319e+00 (6.9145e+00) Acc@1 84.38 ( 76.77) Acc@5 87.50 ( 87.56) Epoch: [34][3660/4999] Time 0.428 ( 1.646) Data 0.000 ( 0.223) Loss 6.8066e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 90.62 ( 87.57) Epoch: [34][3670/4999] Time 0.471 ( 1.651) Data 0.000 ( 0.222) Loss 6.9694e+00 (6.9144e+00) Acc@1 78.12 ( 76.77) Acc@5 89.06 ( 87.57) Epoch: [34][3680/4999] Time 0.431 ( 1.650) Data 0.000 ( 0.222) Loss 6.8628e+00 (6.9144e+00) Acc@1 81.25 ( 76.77) Acc@5 87.50 ( 87.57) Epoch: [34][3690/4999] Time 0.428 ( 1.650) Data 0.000 ( 0.221) Loss 6.8666e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 92.19 ( 87.56) Epoch: [34][3700/4999] Time 0.434 ( 1.650) Data 0.000 ( 0.221) Loss 6.9402e+00 (6.9144e+00) Acc@1 71.88 ( 76.78) Acc@5 87.50 ( 87.57) Epoch: [34][3710/4999] Time 0.434 ( 1.654) Data 0.000 ( 0.220) Loss 6.8522e+00 (6.9144e+00) Acc@1 81.25 ( 76.78) Acc@5 92.19 ( 87.57) Epoch: [34][3720/4999] Time 0.421 ( 1.655) Data 0.000 ( 0.219) Loss 6.8393e+00 (6.9145e+00) Acc@1 79.69 ( 76.78) Acc@5 90.62 ( 87.57) Epoch: [34][3730/4999] Time 0.426 ( 1.658) Data 0.000 ( 0.219) Loss 6.9804e+00 (6.9145e+00) Acc@1 68.75 ( 76.78) Acc@5 81.25 ( 87.57) Epoch: [34][3740/4999] Time 0.424 ( 1.658) Data 0.000 ( 0.218) Loss 7.0028e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57) Epoch: [34][3750/4999] Time 0.438 ( 1.662) Data 0.000 ( 0.218) Loss 6.9528e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57) Epoch: [34][3760/4999] Time 0.423 ( 1.664) Data 0.000 ( 0.217) Loss 6.8455e+00 (6.9143e+00) Acc@1 76.56 ( 76.79) Acc@5 93.75 ( 87.57) Epoch: [34][3770/4999] Time 0.430 ( 1.666) Data 0.000 ( 0.217) Loss 6.9374e+00 (6.9143e+00) Acc@1 81.25 ( 76.79) Acc@5 90.62 ( 87.57)

I use the following command to train on ImageNet with 4 2080ti:

python main_moco.py -a resnet50 --mlp --moco-t 0.2 --aug-plus --cos --lr 0.015 --batch-size 256 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 /job/large_dataset/open_datasets/ImageNet/

I doubt it is training on the supervised manner. Are there anything wrong with my experiments?

opened by passion3394 5

Error in distributed training

I got an error frequently when distributed training is enabled. It occurs roughly for every 50~100 epochs. Here is the error message:

terminate called after throwing an instance of 'std::system_error'
  what():  Transport endpoint is not connected
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

Could you help me to resolve the issue?

opened by kibok90 5

Code and settings of semantic segmentation task

Hi, I'd like to reproduce the results of semantic segmentation task (VOC and LVIS), but I couldn't find the code and setting files. Do you have any plans to provide them with this repository? Thanks.

opened by yshinya6 4
accuracy is 100% for 1st epoch 10itr/10batches. further it decreases to 0 or very less value

I am trying to run code on my own data (apart from datasets mentioned in paper). During training, it is seen that accuracy is 100% for 1st epoch 1st 10 batches, however it is decreasing to 0 or comparatively very small value throughout further training. Also loss is increasing all time. Snapshot for reference.

opened by angy50 4
Can BN be applied within DistributedDataParallel (DDP)?

Typically, we will use SyncBN in DDP to ensure that the computed gradients are identical across different GPUs. It maintains the models on different GPUs with exactly same parameters during training.

However, in moco training (IN-1M), the encoders consist of several vanilla BNs. How to ensure that the models across GPUs are with same parameters? Thanks.

opened by Ze-Yang 4
Question about code in moco/builder.py

Hi,

Thanks for your impressive work.

In moco/builder.py, line 63:

self.queue[:, ptr:ptr + batch_size] = keys.T

I suppose that the keys is a Tensor with the batch_size dim, and T is a float scaler attribute of self as self.T .

So an AttributeError AttributeError: 'Tensor' object has no attribute 'T' would be raised if I directly run the train code.

Should it be self.T(I guess)? or any specific setting i missed?

Regards,

opened by AutumnZ-94 4
Loss curves on ImageNet

Hello -- I'm trying to reproduce some of these results on a different dataset, and the loss slowly bounces up and down, without converging (see below). Is that expected behavior? I don't think the paper shows what the loss/pretext accuracy look like in the ImageNet training -- might it be possible to share those plots here?

Thanks!

Edit: Note, my dataset has ~ 250K images, so ~25% the size of Imagnet -- I'm wondering whether the difference in dataset sizes could be causing problems? Eg perhaps because the length of the momentum buffer is 4x larger relative to the size of the dataset.

opened by bkj 4
Issue about dequeue_and_enqueue
Hi, I am a little confused about the code of _dequeue_and_enqueue

https://github.com/facebookresearch/moco/blob/main/moco/builder.py#L53-L66

# replace the keys at ptr (dequeue and enqueue) self.queue[:, ptr:ptr + batch_size] = keys.T ptr = (ptr + batch_size) % self.K # move pointer

this implement will update queue gradually, just as below:

but as described in your paper, the tensor which will be replaced should be pushed into the queue[left side] and the oldest should be removed[right side], just as fisrt-in-first-out, but the code above just replace red rectangle from left to right, so I'm confused about your code.

Paper:

The samples in the dictionary are progressively replaced. The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed.
opened by qishibo 3
How to load the Hyperparameters without command line code Argument Parser?

Hi,

Did anyone tried loading the hyperparameters or modified the code without using command line argument parser. Please share here if anyone tried doing it?

opened by arohi2bujji 0
Low Accuracy

I am run this code wiht 2-V100 GPUS python main_lincls.py
-a resnet50
--lr 30.0
--batch-size 2048 \ --pretrained moco_v1_200ep_pretrain.pth.tar
/workspace/imagenet
-j 14

run 98 epoch but my 1-acc is 57.90

Why is the performance so low?

opened by dbsdmlgus50 0
Issue with batch size

Hello @KaimingHe, I am running the MoCo pre-training with the following configurations:

Num of GPUs = 8 GPU type: NVIDIA Quadro RTX 8000

But I am unable to run this with a batch size greater than 8. It throws a PyTorch Spawn error when I set the batch size greater than 8 (ex:16, 32). I confirmed that all GPUs are running at full utilization when I run the pre training with the above mentioned batch size. Do you have any suggestions or thoughts on how I might proceed? With a batch size this small, would it provide any meaningful results?

opened by kkannan8291 1
Question about the queue for key encoder

Hey, First of all, thank you for the code.

I have a doubt about updating the queue using the mini-batch. In the paper, the size of the queue is 65,536, which is basically the number of negative keys. I have confusion about the term "keys" here, does 65,536 keys mean 65,336 images or does it mean 65,536 features generated from the key encoder?

opened by Rohit8y 2
Question about transfering to COCO with Mocov1 and Mocov2 checkpoint

Hi, I get some question when I reproduce the results on COCO.

When I use the Mocov1 pretraining checkpointreleased in the repo, I can get the same results reported in the paper as below. Evaluation results for bbox: | AP | AP50 | AP75 | APs | APm | APl | |:------:|:------:|:------:|:------:|:------:|:------:| | 38.658 | 58.372 | 41.765 | 21.351 | 43.372 | 51.629 |

Evaluation results for segm: | AP | AP50 | AP75 | APs | APm | APl | |:------:|:------:|:------:|:------:|:------:|:------:| | 34.014 | 55.207 | 36.171 | 14.915 | 37.526 | 50.783 |

However, when I change the pretraining checkpoint to the Mocov2 one. I get a worse results as below. Evaluation results for bbox: | AP | AP50 | AP75 | APs | APm | APl | |:------:|:------:|:------:|:------:|:------:|:------:| | 33.921 | 52.401 | 36.497 | 19.247 | 37.598 | 45.513 |

Evaluation results for segm: | AP | AP50 | AP75 | APs | APm | APl | |:------:|:------:|:------:|:------:|:------:|:------:| | 30.113 | 49.341 | 31.897 | 13.713 | 32.557 | 45.618 |

There is a large gap about 4% between the results of Mocov1 and Mocov2. The code I used is in this repo with official setting. Is this a normal phenomenon? Or just I make some mistakes? I am wondering if anyone could kindly give me some advices or explanations. Thanks a lot!

opened by mZhenz 0

Owner

Meta Research

GitHub

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

This is the official PyTorch implementation of the ALBEF paper [Blog]. This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.

805 Jan 9, 2023

custom pytorch implementation of MoCo v3

MoCov3-pytorch custom implementation of MoCov3 [arxiv]. I made minor modifications based on the official MoCo repository [github]. No ViT part code an

39 Nov 14, 2022

PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

MoCo v3 for Self-supervised ResNet and ViT Introduction This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT. The original M

887 Jan 8, 2023

This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

Core-tuning This repository is the official implementation of ``Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regular

18 Dec 17, 2022

The code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning"

The Code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning" Setting up and using the repo Get the dataset. Follow

4 Apr 20, 2022

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

293 Dec 20, 2022

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

49 Jan 2, 2023

This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

JigsawClustering Jigsaw Clustering for Unsupervised Visual Representation Learning Pengguang Chen, Shu Liu, Jiaya Jia Introduction This project provid

73 Sep 18, 2022

PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Contrast to Divide: self-supervised pre-training for learning with noisy labels This is an official implementation of "Contrast to Divide: self-superv

55 Nov 23, 2022

Implementation of momentum^2 teacher

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning Requirements All experiments are done with python3.6, torch

121 Sep 26, 2022

Deep learning algorithms for muon momentum estimation in the CMS Trigger System

Deep learning algorithms for muon momentum estimation in the CMS Trigger System The Compact Muon Solenoid (CMS) is a general-purpose detector at the L

2 Oct 6, 2021

This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation.

ISL This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation, which is accepted

19 May 4, 2022

PyTorch implementation of MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Related tags

Overview

MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Preparation

Unsupervised Training

Linear Classification

Models

Transferring to Object Detection

License

See Also

Comments

Owner

Meta Research

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

custom pytorch implementation of MoCo v3

PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

The code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning"

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Implementation of momentum^2 teacher

Deep learning algorithms for muon momentum estimation in the CMS Trigger System

This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation.

PyTorch implementation code for the paper MixCo: Mix-up Contrastive Learning for Visual Representation

auto-tuning momentum SGD optimizer

Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021)

The implementation of "Bootstrapping Semantic Segmentation with Regional Contrast".

This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"

Viewmaker Networks: Learning Views for Unsupervised Representation Learning

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning