PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

Overview

MoCo v3 for Self-supervised ResNet and ViT

Introduction

This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

The original MoCo v3 was implemented in Tensorflow and run in TPUs. This repo re-implements in PyTorch and GPUs. Despite the library and numerical differences, this repo reproduces the results and observations in the paper.

Main Results

The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning. All results in these tables are based on a batch size of 4096.

ResNet-50, linear classification

pretrain
epochs
pretrain
crops
linear
acc
100 2x224 68.9
300 2x224 72.8
1000 2x224 74.6

ViT, linear classification

model pretrain
epochs
pretrain
crops
linear
acc
ViT-Small 300 2x224 73.2
ViT-Base 300 2x224 76.7

ViT, end-to-end fine-tuning

model pretrain
epochs
pretrain
crops
e2e
acc
ViT-Small 300 2x224 81.4
ViT-Base 300 2x224 83.2

The end-to-end fine-tuning results are obtained using the DeiT repo, using all the default DeiT configs. ViT-B is fine-tuned for 150 epochs (vs DeiT-B's 300ep, which has 81.8% accuracy).

Usage: Preparation

Install PyTorch and download the ImageNet dataset following the official PyTorch ImageNet training code. Similar to MoCo v1/2, this repo contains minimal modifications on the official PyTorch ImageNet code. We assume the user can successfully run the official PyTorch ImageNet code. For ViT models, install timm (timm==0.4.9).

The code has been tested with CUDA 10.2/CuDNN 7.6.5, PyTorch 1.9.0 and timm 0.4.9.

Usage: Self-supervised Pre-Training

Below are three examples for MoCo v3 pre-training.

ResNet-50 with 2-node (16-GPU) training, batch 4096

On the first node, run:

python main_moco.py \
  --moco-m-cos --crop-min=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On the second node, run the same command with --rank 1. With a batch size of 4096, the training can fit into 2 nodes with a total of 16 Volta 32G GPUs.

ViT-Small with 1-node (8-GPU) training, batch 1024

python main_moco.py \
  -a vit_small -b 1024 \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

ViT-Base with 8-node training, batch 4096

With a batch size of 4096, ViT-Base is trained with 8 nodes:

python main_moco.py \
  -a vit_base \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 8 --rank 0 \
  [your imagenet-folder with train and val folders]

On other nodes, run the same command with --rank 1, ..., --rank 7 respectively.

Notes:

  1. The batch size specified by -b is the total batch size across all GPUs.
  2. The learning rate specified by --lr is the base lr, and is adjusted by the linear lr scaling rule in this line.
  3. Using a smaller batch size has a more stable result (see paper), but has lower speed. Using a large batch size is critical for good speed in TPUs (as we did in the paper).
  4. In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.

Usage: Linear Classification

By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

Usage: End-to-End Fine-tuning ViT

To perform end-to-end fine-tuning for ViT, use our script to convert the pre-trained ViT checkpoint to DEiT format:

python convert_to_deit.py \
  --input [your checkpoint path]/[your checkpoint file].pth.tar \
  --output [target checkpoint file].pth

Then run the training (in the DeiT repo) with the converted checkpoint:

python $DEIT_DIR/main.py \
  --resume [target checkpoint file].pth \
  --epochs 150

This gives us 83.2% accuracy for ViT-Base with 150-epoch fine-tuning.

Note:

  1. We use --resume rather than --finetune in the DeiT repo, as its --finetune option trains under eval mode. When loading the pre-trained model, revise model_without_ddp.load_state_dict(checkpoint['model']) with strict=False.
  2. Our ViT-Small is with heads=12 in the Transformer block, while by default in DeiT it is heads=6. Please modify the DeiT code accordingly when fine-tuning our ViT-Small model.

Model Configs

See the commands listed in CONFIG.md.

Transfer Learning

See the instruction in the transfer dir.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@Article{chen2021mocov3,
  author  = {Xinlei Chen* and Saining Xie* and Kaiming He},
  title   = {An Empirical Study of Training Self-Supervised Vision Transformers},
  journal = {arXiv preprint arXiv:2104.02057},
  year    = {2021},
}
Comments
  • Tensorflow version

    Tensorflow version

    Thank you for open source the Pytorch implementation. I wonder if the original tensorflow implementation have been released for the purpose of training on TPUs.

    opened by wildphoton 7
  • How about the loss converges during training?

    How about the loss converges during training?

    During training, I find that the training loss is not monotonically decreasing, is it right? Does the loss number indicate the training situation? If not, when the pretraining finished, how much is the samples matching accuracy should be?

    opened by Vickeyhw 5
  • Link to file with pre-trained weights to README

    Link to file with pre-trained weights to README

    Hi, thanks for releasing the code and weights! Given that there's two issues asking about weights (#6 and #10) and I also went to the issues for asking after not seeing them mentioned anywhere in the README, I think this is confusing and worth linking from the README. So where we go.

    opened by lucasb-eyer 5
  • Hi,An error occurred in torch.multiprocessing.spawn

    Hi,An error occurred in torch.multiprocessing.spawn

    [libprotobuf FATAL google/protobuf/stubs/common.cc:87] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".) terminate called after throwing an instance of 'google::protobuf::FatalException' what(): This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".) Traceback (most recent call last): File "train.py", line 413, in main() File "train.py", line 140, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/wxq/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/wxq/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/wxq/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

    I don't know what error has occurred and ask for help,thank you. my device: 4 nvidia 1080ti gpus cuda version : 11.0

    opened by XiaoqiWang 4
  • Any hyper parameter suggestions for other model architectures?

    Any hyper parameter suggestions for other model architectures?

    I noticed that this repository only provide the results and experiment settings of ResNet50 and ViT series model.

    And when I try to reproduce the results, I found that the final linear probing accuracy is very sensitive to the hyper parameters, such as learning rate, optimizer, augmentations, etc.

    Are there any suggestions for training MoCo-v3 on other models, such as EfficientNet, ResNet101, etc. ? And how to adjust the hyper parameters for different model architectures?

    opened by Harick1 3
  • Links for pre-trained models are broken

    Links for pre-trained models are broken

    Hi, I'm trying to download your pre-trained models from CONFIG.ml, but seems like the tar files don't contain any weights after all.

    This is what I see in every pre-trained I tried to download.

    image
    opened by toliz 1
  • How many TPUs ?

    How many TPUs ?

    Hi, In the moco-v3 paper there is a section about the computation time. It says that for the ViT-B, 100 epochs of imagenet take 2.1h hours. It is not clear if it 512 TPU devices or 512 TPU cores. To be precise, there are two types of TPUs available on google cloud: v2-[32,512] and v3-[32-2048]. Which one of them was used in the experiment and how many for each instance ?

    opened by jrabary 1
  • Fine-tuning vs Linear probing

    Fine-tuning vs Linear probing

    Hi,

    I am wondering why there is a significant performance gap between Fine-tuning and Linear probing? Additionally, why the fine-tuning is not used for ResNet model?

    Thank you in advance!

    opened by zhangdan8962 1
  • how about the training results for vit_conv_?

    how about the training results for vit_conv_?

    I want to know how to train vit_conv_small and its performance. according to the paper, is necessary for stop gradient for the four convolution embedding?

    opened by Dongshengjiang 1
  • API: Change main_lincls --batch-size argument to match main_moco

    API: Change main_lincls --batch-size argument to match main_moco

    The --batch-size argument in main_moco.py is the total batch size across all nodes (as specified in the README). However, the argparser documentation mistakenly said the batch size for one node. This argparser description is fixed in this PR to bring it in line with the actual behaviour and the README.

    The handling of --batch-size in main_lincls.py was inconsistent with that of main_moco.py, and was still handled as if it were the total batch size per node, not over all nodes (as per the previous behaviour from the moco/simsiam repos). In this PR, we updated it to be consistent with main_moco.py, specifying the total batch size over all GPUs on all nodes. The default behaviour remains consistent with the instructions from the README, which say to use a total batch size of 1024. (The README specifies 8 GPUs on a single node, but now the behaviour will also be correct with 2 nodes with 4 GPUs each, etc.)

    CLA Signed 
    opened by scottclowe 1
  • Does this implementation support non-distributed training?

    Does this implementation support non-distributed training?

    I found if I didn't use distributed training, i.e. set the --multiprocessing-distributed=False and use single GPU, there seems to be no problems in main_moco.py with

       torch.cuda.set_device(args.gpu)
       model = model.cuda(args.gpu)
    

    However, this error occurred when training started

    AssertionError: Default process group is not initialized

    This error can be traced back to

    File "~/moco-v3/moco/builder.py", line 68, in contrastive_loss k = concat_all_gather(k)

    and

    File "~/moco-v3/moco/builder.py", line 178, in concat_all_gather for _ in range(torch.distributed.get_world_size())]

    This error is caused by computation of contrastive_loss, which still relies on distributed training. So I wonder if the non-distributed training is not supported even if we set multiprocessing-distributed=False.

    opened by Euphoria16 1
  • About Linear Probe Accuracy of Resenet-50

    About Linear Probe Accuracy of Resenet-50

    I ran the code using the parameters specified in the CONFIG.md for Resnet-50, 1000 epoch pre-training and then fine-tuned using the linear probe method. All the parameters were kept the same as mentioned in the CONFIG.md file. However, after linear probing, my accuracy can reach only 74.36% and not 74.6%. I am not sure what I might be missing here.

    Could you help me out?

    Additionally, I ran the official checkpoint provided and that is able to achieve 74.6%.

    opened by avinabsaha 0
  • Most of the learning time is spent loading data. This makes it impossible to use GPU resources efficiently.

    Most of the learning time is spent loading data. This makes it impossible to use GPU resources efficiently.

    Most of the learning time is spent loading data. This makes it impossible to use GPU resources efficiently. I wonder if this is the right learning state. I experiment with 224 sizes of images, and the command is as follows.

    python main_moco.py
    -a resnet18 -b 1024
    --moco-m-cos --crop-min=.2
    --dist-url 'tcp://localhost:10001'
    --multiprocessing-distributed --world-size 1 --rank 0
    ../data/

    opened by JINSUBY 0
  • the linear-prob acc1 of ViT-tiny on ImageNet is bad

    the linear-prob acc1 of ViT-tiny on ImageNet is bad

    Hi, thanks for your great work. I found a problem in our experiment: first, I train a vit-tiny on imagenet in moco-v3
    second, I fine-tuning the vit-tiny on imagenet with only train a classifiar (linear-prob)

    And, I found the top1 acc only 32%, is right? Anyone has the MoCo-v3 results of vit-tiny on ImageNet?

    opened by PeiqinSun 0
  • MOCO V3 vit_small error: object has no attribute

    MOCO V3 vit_small error: object has no attribute "num_tokens"

    When I attempt to pre-train moco v3's vit_small model, I run into the following bug:

    raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'VisionTransformerMoCo' object has no attribute 'num_tokens'

    After changing the line vits.py-line-66-LINK to assert self.num_prefix_tokens == 1, 'Assuming one and only one token, [cls]' I don't see the bug anymore. It seems like the base class timm.models.vision_transformer has an argument named num_prefix_tokens but not num_tokens and hence vit_small is erroring out at the above mentioned line.

    The command I used to run the code is: python main_moco.py \ -a vit_small -b 1024 \ --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \ --epochs=400 --warmup-epochs=40 \ --stop-grad-conv1 --moco-m-cos --moco-t=.2 \ --dist-url 'tcp://localhost:8080' \ --multiprocessing-distributed --world-size 1 --rank 0 \ /data/

    Please let me know if this is an accurate fix, or if I missed something. Thanks in advance!

    opened by shree-lily 1
  • How to fine-tune?

    How to fine-tune?

    大佬您好,想请教一下这个要怎么根据自己的数据集进行微调,好像提供的预训练权重缺少一部分内容 Hello, I would like to ask how to fine-tune this according to my own dataset, it seems that the provided pre-training weights are missing some content. Thanks!

    main_moco.py, line 247, in main_worker optimizer.load_state_dict(checkpoint['optimizer']) optimizer.py, line 137, in load_state_dict saved_groups = state_dict['param_groups'] TypeError: 'NoneType' object is not subscriptable

    opened by DWCTOD 0
  • About the learning rate for resnet-50

    About the learning rate for resnet-50

    I met an issue training resnet-50 with moco-v3. Under the distributed training setting with 16 V100 GPUs (each process only has one gpu, batch size 4096), I can get the training loss at about 27.2 in the 100-th epoch. When I lower the learning to 1.5e-4 (the default one is 0.6), the loss decreases more resonably and it reaches 27.0 in the 100-th epoch. Could you please verify if this is reasonable.

    opened by cswaynecool 1
Owner
Facebook Research
Facebook Research
PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

null 36 Oct 30, 2022
custom pytorch implementation of MoCo v3

MoCov3-pytorch custom implementation of MoCov3 [arxiv]. I made minor modifications based on the official MoCo repository [github]. No ViT part code an

null 39 Nov 14, 2022
SelfAugment extends MoCo to include automatic unsupervised augmentation selection.

SelfAugment extends MoCo to include automatic unsupervised augmentation selection. In addition, we've included the ability to pretrain on several new datasets and included a wandb integration.

Colorado Reed 24 Oct 26, 2022
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

cdf_att_classification classes = {0: 'cat', 1: 'dog', 2: 'flower'} In this project we use both Resnet and Self-attention layer for cdf-Classification.

null 3 Nov 23, 2022
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
A PyTorch Implementation of ViT (Vision Transformer)

ViT - Vision Transformer This is an implementation of ViT - Vision Transformer by Google Research Team through the paper "An Image is Worth 16x16 Word

Quan Nguyen 7 May 11, 2022
PyTorch implementation of the R2Plus1D convolution based ResNet architecture described in the paper "A Closer Look at Spatiotemporal Convolutions for Action Recognition"

R2Plus1D-PyTorch PyTorch implementation of the R2Plus1D convolution based ResNet architecture described in the paper "A Closer Look at Spatiotemporal

Irhum Shafkat 342 Dec 16, 2022
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

This is a playground for pytorch beginners, which contains predefined models on popular dataset. Currently we support mnist, svhn cifar10, cifar100 st

Aaron Chen 2.4k Dec 28, 2022
Implementing Vision Transformer (ViT) in PyTorch

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project ?? ⚡ ?? Click on Use this template to initialize new re

null 2 Dec 24, 2021
Reproduces ResNet-V3 with pytorch

ResNeXt.pytorch Reproduces ResNet-V3 (Aggregated Residual Transformations for Deep Neural Networks) with pytorch. Tried on pytorch 1.6 Trains on Cifar

Pau Rodriguez 481 Dec 23, 2022
DeepLab resnet v2 model in pytorch

pytorch-deeplab-resnet DeepLab resnet v2 model implementation in pytorch. The architecture of deepLab-ResNet has been replicated exactly as it is from

Isht Dwivedi 601 Dec 22, 2022
A simple program for training and testing vit

Vit This is a simple program for training and testing vit. Key requirements: torch, torchvision and timm. Dataset I put 5 categories of the cub classi

xiezhenyu 2 Oct 11, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Patch-Rotation(PatchRot) Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models Submitted to Neurips2021 To

null 4 Jul 12, 2021
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 282 Jan 9, 2023
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
NFT-Price-Prediction-CNN - Using visual feature extraction, prices of NFTs are predicted via CNN (Alexnet and Resnet) architectures.

NFT-Price-Prediction-CNN - Using visual feature extraction, prices of NFTs are predicted via CNN (Alexnet and Resnet) architectures.

null 5 Nov 3, 2022
So-ViT: Mind Visual Tokens for Vision Transformer

So-ViT: Mind Visual Tokens for Vision Transformer        Introduction This repository contains the source code under PyTorch framework and models trai

Jiangtao Xie 44 Nov 24, 2022
Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer This repository contains the PyTorch code for Evo-ViT. This work proposes a slow-fas

YifanXu 53 Dec 5, 2022