PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Facebook Research

Last update: Jan 3, 2023

Related tags

Deep Learning dino

Overview

Self-Supervised Vision Transformers with DINO

PyTorch implementation and pretrained models for DINO. For details, see Emerging Properties in Self-Supervised Vision Transformers.
[blogpost] [arXiv]

Pretrained models

You can choose to download only the weights of the pretrained backbone used for downstream tasks, or the full checkpoint which contains backbone and projection head weights for both student and teacher networks. We also provide the training and evaluation logs.

arch	params	k-nn	linear	download
DeiT-S/16	21M	74.5%	77.0%	backbone only	full checkpoint	args	logs	eval logs
DeiT-S/8	21M	78.3%	79.7%	backbone only	full checkpoint	args	logs	eval logs
ViT-B/16	85M	76.1%	78.2%	backbone only	full checkpoint	args	logs	eval logs
ViT-B/8	85M	77.4%	80.1%	backbone only	full checkpoint	args	logs	eval logs
ResNet-50	23M	67.5%	75.3%	backbone only	full checkpoint	args	logs	eval logs

The pretrained models are available on PyTorch Hub.

import torch
deits16 = torch.hub.load('facebookresearch/dino', 'dino_deits16')
deits8 = torch.hub.load('facebookresearch/dino', 'dino_deits8')
vitb16 = torch.hub.load('facebookresearch/dino', 'dino_vitb16')
vitb8 = torch.hub.load('facebookresearch/dino', 'dino_vitb8')
resnet50 = torch.hub.load('facebookresearch/dino', 'dino_resnet50')

Training

Documentation

Please install PyTorch and download the ImageNet dataset. This codebase has been developed with python version 3.6, PyTorch version 1.7.1, CUDA 11.0 and torchvision 0.8.2. The exact arguments to reproduce the models presented in our paper can be found in the args column of the pretrained models section. For a glimpse at the full documentation of DINO training please run:

python main_dino.py --help

Vanilla DINO training 🦕

Run DINO with DeiT-small network on a single node with 8 GPUs for 100 epochs with the following command. Training time is 1.75 day and the resulting checkpoint should reach ~69.3% on k-NN eval and ~73.8% on linear eval. We will shortly provide training and linear evaluation logs for this run to help reproducibility.

python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch deit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

Multi-node training

We use Slurm and submitit (pip install submitit). To train on 2 nodes with 8 GPUs each (total 16 GPUs):

python run_with_submitit.py --nodes 2 --ngpus 8 --arch deit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

DINO with ViT-base network.

python run_with_submitit.py --nodes 2 --ngpus 8 --use_volta32 --arch vit_base  --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

Boosting DINO performance 🦖

You can improve the performance of the vanilla run by:

training for more epochs: --epochs 300,
increasing the teacher temperature: --teacher_temp 0.07 --warmup_teacher_temp_epochs 30.
removing last layer normalization (only safe with --arch deit_small): --norm_last_layer false,

Full command.

python run_with_submitit.py --arch deit_small --epochs 300 --teacher_temp 0.07 --warmup_teacher_temp_epochs 30 --norm_last_layer false --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

The resulting pretrained model should reach ~73.4% on k-NN eval and ~76.1% on linear eval. Training time is 2.6 days with 16 GPUs. We will shortly provide training and linear evaluation logs for this run to help reproducibility.

ResNet-50 and other convnets trainings

This code also works for training DINO on convolutional networks, like ResNet-50 for example. We highly recommend to adapt some optimization arguments in this case. For example here is a command to train DINO on ResNet-50 on a single node with 8 GPUs for 100 epochs:

python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch resnet50 --optimizer sgd --weight_decay 1e-4 --weight_decay_end 1e-4 --global_crops_scale 0.14 1 --local_crops_scale 0.05 0.14 --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

Evaluation: k-NN classification on ImageNet

To evaluate a simple k-NN classifier with a single GPU on a pre-trained model, run:

python -m torch.distributed.launch --nproc_per_node=1 eval_knn.py --data_path /path/to/imagenet

If you choose not to specify --pretrained_weights, then DINO reference weights are used by default. If you want instead to evaluate checkpoints from a run of your own, you can run for example:

python -m torch.distributed.launch --nproc_per_node=1 eval_knn.py --pretrained_weights /path/to/checkpoint.pth --checkpoint_key teacher --data_path /path/to/imagenet

Evaluation: Linear classification on ImageNet

To train a supervised linear classifier on frozen weights on a single node with 8 gpus, run:

python -m torch.distributed.launch --nproc_per_node=8 eval_linear.py --data_path /path/to/imagenet

Self-attention visualization

You can look at the self-attention of the [CLS] token on the different heads of the last layer by running:

python visualize_attention.py

Self-attention from a Vision Transformer with 8x8 patches trained with DINO

License

See the LICENSE file for more details.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🦖 :

@article{caron2021emerging,
  title={Emerging Properties in Self-Supervised Vision Transformers},
  author={Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J\'egou, Herv\'e  and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
  journal={arXiv preprint arXiv:2104.14294},
  year={2021}
}

Comments

Error using visualize_attention.py. The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1

Hi all, I am trying to execute visualize_attention.py with default pretrained weights on my own image as below

!python visualize_attention.py --image_path 'test/finalImg_249.png'

I get size mistamatch error. Could you please let me know what changes needs to be done here?

Error stack trace:

Please use the --pretrained_weights argument to indicate the path of the checkpoint to evaluate. Since no pretrained weights have been provided, we load the reference pretrained DINO weights. /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3458: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.

"See the documentation of nn.Upsample for details.".format(mode) /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3503: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. "The default behavior for interpolate/upsample with float scale_factor changed "

Traceback (most recent call last): File "visualize_attention.py", line 162, in attentions = model.forward_selfattention(img.to(device)) File "~/dino/vision_transformer.py", line 246, in forward_selfattention x = x + pos_embed

RuntimeError: The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1

Image details: import cv2 img = cv2.imread('finalImg_249.png') print (img.shape) #output: (427, 488, 3)

opened by cishwarya 20

Error finetuning from pretrained checkpoint

Hi all, I'm running into an error when trying to fine-tune from one of the pretrained checkpoints.

Code

!mkdir "$output"
!wget -q -O "$output/checkpoint.pth" https://dl.fbaipublicfiles.com/dino/dino_deitsmall16_pretrain/dino_deitsmall16_pretrain.pth

!python -m torch.distributed.launch \
  --nproc_per_node=1 ./dino/main_dino.py \
  --arch deit_small \
  --data_path "$input" \
  --output_dir "$output"

Error

| distributed init (rank 0): env://
git:
  sha: 8aa93fdc90eae4b183c4e3c005174a9f634ecfbf, status: clean, branch: main

arch: deit_small
batch_size_per_gpu: 64
...
...
Student and Teacher are built: they are both deit_small network.
Loss, optimizer and schedulers ready.
Found checkpoint at ./drive/MyDrive/DINO/checkpoint.pth
=> failed to load student from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load teacher from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load optimizer from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load fp16_scaler from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load dino_loss from checkpoint './drive/MyDrive/DINO/checkpoint.pth'

Any suggestions would be very much appreciated.

opened by yadamonk 12

Loss not dropping on custom dataset :(

Hi, thanks for the wonderful work, @mathildecaron31! Reported video is inspiring :D __ I am experimenting with a custom dataset. The thing is, it's totally okay to train vision transformer (deit_small) in supervised manner and loss drops fine. Even managed to apply visualize_attention.py to see heatmaps for a separately trained ViT. But when I switch to use self-supervised Dino setup, there is almost no change in loss during training. Do you have idea why it could happen or possible solutions? __ Thanks!

I am attaching screenshot from training and arguments I have used for training script.

loss-stop

arch ='deit_small'
patch_size = 16
out_dim = 10000 # default 65536
norm_last_layer = False
momentum_teacher = 0.996 # check this according to batch_size
bsize = 256 #####
use_bn_in_head = False
warmup_teacher_temp = 0.0005 # less if does not decrease, default 0.04
teacher_temp = 0.3 # increase if needed, default: 0.04
warmup_teacher_temp_epochs = 0 # default 30 to warmup
use_fp16 = False #disable is loss is unstable, default: True
weight_decay = 0.04 # a smaller value works well
weight_decay_end = 0.4 # final value of weight decay
clip_grad = 3.0 # max parameter gradient norm, 0 for disabling # default, 3.0
batch_size_per_gpu = 256 # reduce this if not fit, default 64
epochs = 100
freeze_last_layer = 5 # default 1, Try increasing this value if the loss does not decrease.
lr = 0.005 #linear with batch size scaled, for ref of 256, def 0.0005
warmup_epochs = 0 #linear warmup def 10
min_lr = 1e-6 # target lr at the optimization
optimizer = 'sgd' # def: adamw
global_crops_scale = (0.4, 1.)
local_crops_number = 8 # local small views
local_crops_scale = (0.05, 0.4) # def (0.05, 0.4)
data_path = train_dataset_dir #
output_dir = "./dirlog"
saveckp_freq = 20
seed = 0 # random seed
num_workers = 40 #def:10
dist_url = "env://"
local_rank = 0
device_ids = [0, 1, 2, 3, 4, 5] # use 6 gpus

opened by tuttelikz 11

`interpolate_pos_encoding(x, pos_embed)` doesnt return correct dimension for images that is not square (w != h)

I notice the generation of positional embedding in interpolate_pos_encoding method is slightly different than the one in the forward_selfattention method. The following simple modification bring both into the same page, to your interest.

    def interpolate_pos_encoding(self, x, pos_embed, w, h):  # passing w and h as arguments
        npatch = x.shape[1] - 1
        N = pos_embed.shape[1] - 1
        if npatch == N:
            return pos_embed
        class_emb = pos_embed[:, 0]
        pos_embed = pos_embed[:, 1:]
        dim = x.shape[-1]
        w0 = w // self.patch_embed.patch_size  # just copy paste from forward_selfattention
        h0 = h // self.patch_embed.patch_size
        pos_embed = nn.functional.interpolate(
            pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(0, 3, 1, 2),
            scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)),  # replace math.sqrt(npatch / N) with one from forward_selfattention
            mode='bicubic',
        )
        pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
        return torch.cat((class_emb.unsqueeze(0), pos_embed), dim=1)

opened by enverfakhan 11

model collapse after a few steps

I use custom data to train DINO, the model seems collapsed after a few steps, the feature seems to be uniform. I use larger teacher temputure to enhance "sharping", but the model collapsed after all. I wonder if DINO is sensitive to the data, in other word, does DINO tend to collapse when training at differnet data?

opened by Doom9234 8
Onnx pretrained

Your work looks very interesting. I'm not familiar with Pytorch / Python and it would be great if the pre-trained nets could be provided in ONNX format.

Regards Armin

opened by Armin234 8
Training/Transferring on CIFAR10

Hi

Thanks for your nice work. I wonder if you can share the hyperparameter for transfer learning on CIFAR10. Have you succeeded to train on cifar10 from scratch without transferring? if so would you also kindly share the hyperparameters for that?

opened by nimaous 7
hello what I would need to do to apply it to 3d medical imaging setting

Hello, I would like to use your algorithm for the 3d setting (magnetic resonance imaging of the prostate gland). I have only image-level labels, and your algorithm seems very interesting. What would I need to do to adapt it for a 3-dimensional setting?

opened by jakubMitura14 6
Scaling up DINO to larger model size

Hi @mathildecaron31, I'm recently considering scaling DINO to a larger model size, e.g., ViT-L/16. I used the almost same parameters as ViT-B/16 and pre-train DINO for 400 epochs but the k-NN and linear probing accuracy are ~1% and ~2% worse than the base-size model respectively. Do you have any related experience with that? Thanks for your help!

opened by shallowtoil 6

knn_eval() with resnet50 has missing keys in state_dict

While the fc layer is not needed when extracting features from ResNet50, the following command

$ python eval_knn.py --dump_features resnet50_features --arch resnet50 --data_path imagenet1k_folder

generates this error:

RuntimeError: Error(s) in loading state_dict for ResNet:
        Missing key(s) in state_dict: "fc.weight", "fc.bias".

Here is the complete output:

Will run the code on one GPU.
| distributed init (rank 0): env://
fatal: Not a git repository (or any parent up to mount point /home)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
git:
  sha: N/A, status: clean, branch: N/A

arch: resnet50
batch_size_per_gpu: 128
checkpoint_key: teacher
data_path: imagenet1k_folder
dist_url: env://
dump_features: resnet50_features
gpu: 0
load_features: None
local_rank: 0
nb_knn: [10, 20, 100, 200]
num_workers: 10
patch_size: 16
pretrained_weights: 
rank: 0
temperature: 0.07
use_cuda: True
world_size: 1
/home/user/anaconda3/envs/vissl/lib/python3.7/site-packages/torchvision/transforms/transforms.py:258: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
/home/user/anaconda3/envs/vissl/lib/python3.7/site-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
Data loaded with 1281167 train and 50000 val imgs.
Please use the `--pretrained_weights` argument to indicate the path of the checkpoint to evaluate.
Since no pretrained weights have been provided, we load the reference pretrained DINO weights.
Traceback (most recent call last):
  File "eval_knn.py", line 227, in <module>
    train_features, test_features, train_labels, test_labels = extract_feature_pipeline(args)
  File "eval_knn.py", line 70, in extract_feature_pipeline
    utils.load_pretrained_weights(model, args.pretrained_weights, args.checkpoint_key, args.arch, args.patch_size)
  File "/home/user/codes/dino-main/utils.py", line 107, in load_pretrained_weights
    model.load_state_dict(state_dict, strict=True)
  File "/home/user/anaconda3/envs/vissl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNet:
        Missing key(s) in state_dict: "fc.weight", "fc.bias".

opened by mgpadalkar 6

Linear evaluation weight problems ("_IncompatibleKeys" and access issues)
Hello, thanks for all the amazing work you put into this. I tried downloading the pretrained weights for the 3 available ViT models and have encountered some issues:

ViT-S/16 gives me an "Access denied" message whenever I try downloading it.

For ViT-S/8 or ViT-B/16, their weights seem corrupted? Whenever I try loading them into eval_linear.py, I get a message listing a long list of missing and unexpected keys. Not exactly sure what is wrong here. The loss also starts at a quite high value and although it's dropping off, I don't think this is the intended behavior from a pretrained model.

Here's the training output (not the log) for ViT-S/8 TrainingOutput.txt

Thank you again for your work.
opened by KnockerPulsar 6
Supervised Fine-Tuning of Teacher / Student Transformer Weights

I used DINO to do self-supervised pre-training of a Small ViT on a dataset I have. Now I wanted to fine-tune the model on another dataset in a supervised way.

I know that, in a way, the code in eval_linear.py allows us to do that, but - as far as I was able to tell - it only updates the weights of the Linear model built on top of the representations generated by the pre-trained Transformer.

So my question is: has anyone tried to perform supervised fine-tuning in a way that the weights of the Teacher or Student Transformers are updated as well?

PS: I realize this might not be the ideal place to ask this question, since it sort of falls out of the DINO jurisdiction, but I figured it was worth a try.

Thanks for the amazing you work you guys did, and for sharing it with us.

opened by MatCorr 0
about the DINO training loss

Hello, I'm training resnet18 on a custom dataset. it's been running for some time with a batch size of 325 (the max my gpu can handle). the thing is the loss is flat and it's not getting better or worse. is this behavior normal ? and if so how do you decide on stopping the training ?

opened by Faisal-Hajari 0
How small batch sizes affect performance
Hi, thanks for your hard work. I am retraining DINO with my own custom dataset (~570k images).

On my local computer, the maximum batch size is 32 (1 GPU RTX 3080 TI) and a single epoch takes around 1 hour 20 minutes to complete. Is it normal?

Does small batch size matter to the performance?

Thank you!
opened by bryanwong17 0
Number of classes

Hello, thank you for your work! I have a small datsaset containing only one class of instances (trees). I looked in the code and it seems like the number of classes in the VIsionTransformer is always zero (num_classes=0). Is this normal? I am not sure I understand the difference between the num_classes and num_labels used in the eval script.

Thank you

opened by VGrondin 0
dead code in video_generation.py

The thresholded attention maps computed in this block of code aren't being used anywhere else as far as I can tell, so this section seems to just waste computation and memory: https://github.com/facebookresearch/dino/blob/main/video_generation.py#L197

opened by eminorhan 0
🦕 Created dino fork with wandb.ai support and fix another bugs

Hi! Due to the fact that the DINO model is no longer supported by the developers and the existing problems are not being solved, this is my fork of dino model where i fix #160 and problem with environment varialbe and directory in get_shared_folder() function. I use this and all seems fine! You can submit your changes in a pull request.

opened by MikeMACintosh 0

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Related tags

Overview

Self-Supervised Vision Transformers with DINO

Pretrained models

Training

Documentation

Vanilla DINO training 🦕

Multi-node training

Boosting DINO performance 🦖

ResNet-50 and other convnets trainings

Evaluation: k-NN classification on ImageNet

Evaluation: Linear classification on ImageNet

Self-attention visualization

License

Citation

Comments

Owner

Facebook Research

A simple pygame dino game which can also be trained and played by a NEAT KI

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

EsViT: Efficient self-supervised Vision Transformers

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Official implementation of the method ContIG, for self-supervised learning from medical imaging with genomics

Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

This is an official implementation for "Self-Supervised Learning with Swin Transformers".

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

Code for the paper One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation, CVPR 2021.

Training code and evaluation benchmarks for the "Self-Supervised Policy Adaptation during Deployment" paper.

[CVPR 2022] "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy" by Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang

As-ViT: Auto-scaling Vision Transformers without Training