DeiT: Data-efficient Image Transformers

Facebook Research

Last update: Jan 6, 2023

Related tags

Deep Learning deit

Overview

DeiT: Data-efficient Image Transformers

This repository contains PyTorch evaluation code, training code and pretrained models for DeiT (Data-Efficient Image Transformers).

They obtain competitive tradeoffs in terms of speed / precision:

For details see Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles and Hervé Jégou.

If you use this code for a paper please cite:

@article{touvron2020deit,
  title={Training data-efficient image transformers & distillation through attention},
  author={Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Herv\'e J\'egou},
  journal={arXiv preprint arXiv:2012.12877},
  year={2020}
}

Model Zoo

We provide baseline DeiT models pretrained on ImageNet 2012.

name	acc@1	acc@5	#params	url
DeiT-tiny	72.2	91.1	5M	model
DeiT-small	79.9	95.0	22M	model
DeiT-base	81.8	95.6	86M	model
DeiT-tiny distilled	74.5	91.9	6M	model
DeiT-small distilled	81.2	95.4	22M	model
DeiT-base distilled	83.4	96.5	87M	model
DeiT-base 384	82.9	96.2	87M	model
DeiT-base distilled 384 (1000 epochs)	85.2	97.2	88M	model

The models are also available via torch hub. Before using it, make sure you have the pytorch-image-models package timm==0.3.2 by Ross Wightman installed. Note that our work relies of the augmentations proposed in this library. In particular, the RandAugment and RandErasing augmentations that we invoke are the improved versions from the timm library, which already led the timm authors to report up to 79.35% top-1 accuracy with Imagenet training for their best model, i.e., an improvement of about +1.5% compared to prior art.

To load DeiT-base with pretrained weights on ImageNet simply do:

import torch
# check you have the right version of timm
import timm
assert timm.__version__ == "0.3.2"

# now load it with torchhub
model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)

Additionnally, we provide a Colab notebook which goes over the steps needed to perform inference with DeiT.

Usage

First, clone the repository locally:

git clone https://github.com/facebookresearch/deit.git

Then, install PyTorch 1.7.0+ and torchvision 0.8.1+ and pytorch-image-models 0.3.2:

conda install -c pytorch pytorch torchvision
pip install timm==0.3.2

Data preparation

Download and extract ImageNet train and val images from http://image-net.org/. The directory structure is the standard layout for the torchvision datasets.ImageFolder, and the training and validation data is expected to be in the train/ folder and val folder respectively:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Evaluation

To evaluate a pre-trained DeiT-base on ImageNet val with a single GPU run:

python main.py --eval --resume https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth --data-path /path/to/imagenet

This should give

* Acc@1 81.846 Acc@5 95.594 loss 0.820

For Deit-small, run:

python main.py --eval --resume https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth --model deit_small_patch16_224 --data-path /path/to/imagenet

giving

* Acc@1 79.854 Acc@5 94.968 loss 0.881

Note that Deit-small is not the same model as in Timm.

And for Deit-tiny:

python main.py --eval --resume https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth --model deit_tiny_patch16_224 --data-path /path/to/imagenet

which should give

* Acc@1 72.202 Acc@5 91.124 loss 1.219

Here you'll find the command-lines to reproduce the inference results for the distilled and finetuned models

deit_base_distilled_patch16_224

python main.py --eval --model deit_base_distilled_patch16_224 --resume https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_224-df68dfff.pth

giving

* Acc@1 83.372 Acc@5 96.482 loss 0.685

deit_small_distilled_patch16_224

python main.py --eval --model deit_small_distilled_patch16_224 --resume https://dl.fbaipublicfiles.com/deit/deit_small_distilled_patch16_224-649709d9.pth

giving

* Acc@1 81.164 Acc@5 95.376 loss 0.752

deit_tiny_distilled_patch16_224

python main.py --eval --model deit_tiny_distilled_patch16_224 --resume https://dl.fbaipublicfiles.com/deit/deit_tiny_distilled_patch16_224-b40b3cf7.pth

giving

* Acc@1 74.476 Acc@5 91.920 loss 1.021

deit_base_patch16_384

python main.py --eval --model deit_base_patch16_384 --input-size 384 --resume https://dl.fbaipublicfiles.com/deit/deit_base_patch16_384-8de9b5d1.pth

giving

* Acc@1 82.890 Acc@5 96.222 loss 0.764

deit_base_distilled_patch16_384

python main.py --eval --model deit_base_distilled_patch16_384 --input-size 384 --resume https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_384-d0272ac0.pth

giving

* Acc@1 85.224 Acc@5 97.186 loss 0.636

Training

To train DeiT-small and Deit-tiny on ImageNet on a single node with 4 gpus for 300 epochs run:

DeiT-small

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model deit_small_patch16_224 --batch-size 256 --data-path /path/to/imagenet --output_dir /path/to/save

DeiT-tiny

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model deit_tiny_patch16_224 --batch-size 256 --data-path /path/to/imagenet --output_dir /path/to/save

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

To train DeiT-base model on ImageNet on 2 nodes with 8 gpus each for 300 epochs:

python run_with_submitit.py --model deit_base_patch16_224 --data-path /path/to/imagenet

To train DeiT-base with hard distillation using a RegNetY-160 as teacher, on 2 nodes with 8 GPUs with 32GB each for 300 epochs (make sure that the model weights for the teacher have been downloaded before to the correct location, to avoid multiple workers writing to the same file):

python run_with_submitit.py --model deit_base_distilled_patch16_224 --distillation-type hard --teacher-model regnety_160 --teacher-path https://dl.fbaipublicfiles.com/deit/regnety_160-a5fe301d.pth --use_volta32

To finetune a DeiT-base on 384 resolution images for 30 epochs, starting from a DeiT-base trained on 224 resolution images, do (make sure that the weights to the original model have been downloaded before, to avoid multiple workers writing to the same file):

python run_with_submitit.py --model deit_base_patch16_384 --batch-size 32 --finetune https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth --input-size 384 --use_volta32 --nodes 2 --lr 5e-6 --weight-decay 1e-8 --epochs 30 --min-lr 5e-6

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Comments

Loss NAN for Deit Base

I have reproduced the small and tiny model but met with problems for reproducing the base model with 224 and 384 image size. With a large probability, the loss came to NAN after training with few epochs. My setting is 16 GPUs and the batch size is 64 on each GPU and I do not change any hyper-parameters in run_with_submitit.py. Do you have any idea to solve this problem? Thanks for your help.
awaiting response

opened by ChengyueGongR 24

I need some help to reproduce DeiT-III finetuning result

Thank you for sharing finetune code & training logs On IN-1k pretraining, I got similar results to your log: ViT-S 81.43 and ViT-B 82.88 But, I failed to reproduce finetune performance even with your official finetuning setting So, I would like to ask for advice or help.

Here is my fine-tune result with ViT-B on IN-1k.

I expected performance will increase as your fine-tune log, but. instead, the finetune degrades the performance. I can't use submitit, so I used the following command on 1 node 8 GPUs A100 machine

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=${num_gpus_per_node} --nnodes=${WORLD_SIZE} --node_rank=${RANK}  --master_addr=${MASTER_ADDR}  --master_port=${MASTER_PORT} --use_env main.py \
    --model deit_base_patch16_LS \
    --data-path ${local_data_path} \
    --finetune ${SAVE_BASE_PATH}/pretraining/checkpoint-${epoch}.pth \
    --output_dir ${SAVE_BASE_PATH}/finetune4 \
    --batch-size 64 \
    --print_freq 400 \
    --epochs 20 \
    --smoothing 0.1 \
    --reprob 0.0 \
    --opt adamw \
    --lr 1e-5 \
    --weight-decay 0.1 \
    --input-size 224 \
    --drop 0.0 \
    --drop-path 0.2 \
    --mixup 0.8 \
    --cutmix 1.0 \
    --unscale-lr \
    --no-repeated-aug \
    --aa rand-m9-mstd0.5-inc1 \

and full args printed on the command line

Namespace(ThreeAugment=False, aa='rand-m9-mstd0.5-inc1', attn_only=False, auto_resume=True, batch_size=64, bce_loss=False, clip_grad=None, color_jitter=0.3, cooldown_epochs=10, cutmix=1.0, cutmix_minmax=None, data_path='/mnt/ddn/datasets/ILSVRC2015/train/Data/CLS-LOC', data_set='IMNET', decay_epochs=30, decay_rate=0.1, device='cuda', dist_backend='nccl', dist_eval=False, dist_url='env://', distillation_alpha=0.5, distillation_tau=1.0, distillation_type='none', distributed=True, drop=0.0, drop_path=0.2, epochs=20, eval=False, finetune='/mnt/backbone-nfs/bhheo/checkpoints/deit_codebase_deit_base_patch16_LS_800epoch_reproduce/pretraining/checkpoint-800.pth', gpu=0, inat_category='name', input_size=224, log_dir='nsmlv2', log_name='finetune', lr=1e-05, lr_noise=None, lr_noise_pct=0.67, lr_noise_std=1.0, min_lr=1e-05, mixup=0.8, mixup_mode='batch', mixup_prob=1.0, mixup_switch_prob=0.5, model='deit_base_patch16_LS', model_ema=True, model_ema_decay=0.99996, model_ema_force_cpu=False, momentum=0.9, num_workers=10, opt='adamw', opt_betas=None, opt_eps=1e-08, output_dir='/mnt/backbone-nfs/bhheo/checkpoints/deit_codebase_deit_base_patch16_LS_800epoch_reproduce/finetune4', patience_epochs=10, pin_mem=True, print_freq=400, rank=0, recount=1, remode='pixel', repeated_aug=False, reprob=0.0, resplit=False, resume='', save_periods=['last2'], sched='cosine', seed=0, smoothing=0.1, src=False, start_epoch=0, teacher_model='regnety_160', teacher_path='', train_interpolation='bicubic', unscale_lr=True, warmup_epochs=5, warmup_lr=1e-06, weight_decay=0.1, world_size=8)

I think it is the same as your finetune setting. I double-checked my code but I still don't know why the result is totally different.

I'm using different library versions torch : 1.11.0a0+b6df043, torchvision: 0.11.0a0, timm: 0.5.4 It might cause some problems, but there was no problem in pretraining and the performance difference is too severe for a simple library version issue.

I'm sorry to keep bothering you, but could you please let me know if there is something wrong with my setting? Or could you please share the ViT-B weights pretrained on IN-1k 192x192 resolution without finetuning on 224x224? If you share the weights before finetune, I can verify my finetune code without doubting my pretraining.

opened by bhheo 23

Fine-tuning details

Hi,

I am trying to replicate the results of the paper that have been fine-tuned to datasets such as CIFAR-10 and Stanford Cars. Could you give details about hyper-parameters used (like batch size, learning rate etc.)

Thanks.
question

opened by nakashima-kodai 14
No learning when transfer learning with Cait XXS24 224

Hello,

Thanks a lot for this this great repo. I'm currently doing transfer learning with Cait XXS24 224, but I have a problem when loading the pretrained weights : when I train cait on the new task, the accuracy will start from 10 (random) and won't increase. I tried to train small deit on this task with transfer learning, and this time it worked well (with the same training functions). Do you have any idea what could be the problem here ?

Here is the code to load weigths (actually it is the one that you provide) :

v = cait_XXS_224(pretrained = False) checkpoint = torch.load('logs/ImageNet/XXS24_24.pth') checkpoint_no_module = {} for k in v.state_dict().keys(): checkpoint_no_module[k] = checkpoint["model"]['module.'+k] v.load_state_dict(checkpoint_no_module)

I'm using torch 1.7.1 and timm 0.4.5.
awaiting response

opened by BasileR 10
Resume Broken

It keeps complaining about no state_dict_model_ema : Failed to find state_dict_ema, starting from loaded model weights But in the checkpoint there is clearly a model_ema: dict_keys(['model', 'optimizer', 'lr_scheduler', 'epoch', 'model_ema', 'scaler', 'args'])

and the loss goes to NaN a few hundred steps after resume...

opened by kyleliang919 8

The training log of DeiT III

Hi, I'm trying to reproduce the base model of DeiT III on ImageNet-1k with the suggested hyper-parameters. By running:

python run_with_submitit.py --model deit_base_patch16_LS --data-path /path/to/imagenet --batch 256 --lr 3e-3 --epochs 800 --weight-decay 0.05 --sched cosine --input-size 192 --reprob 0.0 --node 1 --gpu 8 --smoothing 0.0 --warmup-epochs 5 --drop 0.0 --nb-classes 1000 --seed 0 --opt fusedlamb --warmup-lr 1e-6 --mixup .8 --drop-path 0.2 --cutmix 1.0 --unscale-lr --repeated-aug --bce-loss  --color-jitter 0.3 --ThreeAugment

Will the training log be available? Or is it ok to share the accuracy on 1/2, 1/4 of the total schedule?

Thanks!

opened by tgxs002 7

Question about implementing finetuning on iNat-18 dataset

Hi, I run following command to implement:

python -m torch.distributed.launch
--nproc_per_node=8
--use_env main.py
--model deit_base_patch16_224
--data-set INAT
--batch-size 96
--lr 7.5e-5
--opt AdamW
--weight-decay 0.05
--epochs 360
--repeated-aug
--reprob 0.1
--drop-path 0.1
--data-path /data/Dataset/inat2018_tar
--finetune ./output/deit_base_patch16_224-b5f2ef4d.pth
--output_dir ./output/finetune_inat18_deit

Other arguments are the same as the default values in main.py.

But I only got 71% acc within 300 epochs. Should I continue to finetune until 360 epochs?

opened by cokezrr 7
Question about Repeated Augmentation

Hi, first of all, thank you for releasing the code base. I have a small question about the sampler for Repeated Augmentation. What does this 256*256 mean?

https://github.com/facebookresearch/deit/blob/cb29b5efd522a0ac83d64aa8b41fe27cead3a030/samplers.py#L32

Thank you!
question

opened by moskomule 7
Image throughput numbers

What do the image / sec throughput numbers represent (train, inferences, batch size, mixed-prc or float32, etc)? They are lower than any inference numbers I'm familiar with for any of the listed models. They also don't seem to match expected training throughputs and have an odd spread (smallest to largest models), being quite low for the smaller models (CPU bound?).

I don't spend much time with V100, but relative to Titan RTX and RTX 3090 I have a fairly good idea where the numbers should fall...

Thanks
question

opened by rwightman 7
`kxd` matrix or `1xd` vector?

In section 3 of paper 'Augmenting Convolutional networks with attention-based aggregation': ··· We can easily specialize the attention maps per class by replacing the CLS vector with a k × d matrix, where each of the k columns is associated with one of the classes. This specialization allows us to visualize an attention map for each class, as shown in Figure 2. ··· But I only found 1 x d vector. Where is k x d matrix?

https://github.com/facebookresearch/deit/blob/40ae72b79cc5cd48dac2b02e1fceb03ee4192676/patchconvnet_models.py#L201

opened by densechen 6
Question about the convergence of the Deit-base model
Great work! and thanks for sharing the codes.

I am trying to re-train Deit base model but I encountered some issues. May I ask for your insights?

I can reproduce the reported results 81.8% with all default setting; however, the performance degrades a lot if I change two very minor hyperparameters

Change batch size to 512 (default is 1024), and learning rate is automatically scaled based on your codes.

Keep batch size to 1024 but increase the warmup epochs to 10 (default is 5).

Here is the test accuracy over epochs

The orange line is the default setting. (81.8%) The blue line is batch size 512. (78.8%) The green line is using 10 epochs for warmup. (79.2%)

Testing accuracy curve

Zoom in for the first 50 epochs

For the default setting, it seems that the model is going to diverge around the 6-th epoch but it recovers later, and then it eventually achieve pretty good results. (81.8%) However, when using smaller batch size or warmup for additional 5 epochs, the performance degrades ~3%

I wonder that do you observe the same trend? and do you have any insights into why two small changes I made will affect so much?

My env: pytorch 1.7, timm 0.3.2, torchvision 0.8

Thanks.
question
opened by chunfuchen 6
Meaning of the model name ( ResMLP)

Hello, thanks for sharing great work!

I had small question of the model name. I wondering about the meaning of 'S24' in ResMLP-S24. I think 'S' can mean a small-scale model and '24' may mean that model was consist of 24 layers. But I can not find any description in the paper.

Could you tell me the meaning like 'S24' or 'B24' ? Thanks!

opened by YHYeooooong 0
Can I use timm==0.4.12 instead of timm==0.3.2 ?

I have created an conda env and installed the following: conda install -c pytorch pytorch torchvision pip install timm==0.3.2

I tried to run the main.py for evaluation. Gives the following error: With the given cannot import name 'container_abcs' from 'torch._six' Is there a fix for this package issue?

Alternatively I tried to evaluate (DeiT-base) with timm==0.4.12 I got the Acc@1 81.802 instead of 81.846. Is this slight difference caused by the difference of timm versions ?

opened by irhallac 0
What batch size number other than 1024 have been tried when training a DeiT or ViT model?

What batch size number other than batch size of 1024 have been tried when training a DeiT or ViT model? In the paper, DeiT (https://arxiv.org/abs/2012.12877), they used a batch size of 1024 and they mentioned that the learning rate should be scaled according to the batch size.

However, I was wondering if anyone have any experience or successfully train a DeiT model with a batch size that is even less than 512? If yes, what accuracy did you achieve?

opened by CharlesLeeeee 0
Multinode Slurm Training

Hello, I'm trying to use the run_with_submitit.py file to run the model on the Slurm cluster, but I do not get any output log file to see the training progress. All I have here are logs of each node initiating. Can you please help me with this multinode training? Best regards, Mehdi

opened by yazdanimehdi 0
Does the EMA is used in DeiT-III?

I'm working on reproducing the accuracy of DeiT-III, and I notice that the EMA is enabled during pre-training, but it's not used during evaluation. So does the EMA model is used in any location?

opened by mzr1996 2
What's the accuracy of deit-S without pre-trained on CIFAR10

Hi,

What's the accuracy of deit-S without pre-trained on CIFAR10? Mine is only 63.2. Would you like to told me yours? And my hyper-parameters follow this linkhttps://github.com/facebookresearch/deit/issues/45#issuecomment-765213622

Thanks.

opened by hanwenran1 0

Owner

Facebook Research

GitHub

Official Implementation of DE-CondDETR and DELA-CondDETR in "Towards Data-Efficient Detection Transformers"

DE-DETRs By Wen Wang, Jing Zhang, Yang Cao, Yongliang Shen, and Dacheng Tao This repository is an official implementation of DE-CondDETR and DELA-Cond

41 Dec 12, 2022

Official Implementation of DE-DETR and DELA-DETR in "Towards Data-Efficient Detection Transformers"

DE-DETRs By Wen Wang, Jing Zhang, Yang Cao, Yongliang Shen, and Dacheng Tao This repository is an official implementation of DE-DETR and DELA-DETR in

61 Dec 12, 2022

Official repository for "Restormer: Efficient Transformer for High-Resolution Image Restoration". SOTA for motion deblurring, image deraining, denoising (Gaussian/real data), and defocus deblurring.

Restormer: Efficient Transformer for High-Resolution Image Restoration Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan,