EsViT: Efficient self-supervised Vision Transformers

Microsoft

Last update: Dec 25, 2022

Related tags

Overview

Efficient Self-Supervised Vision Transformers (EsViT)

PyTorch implementation for EsViT, built with two techniques:

A multi-stage Transformer architecture. Three multi-stage Transformer variants are implemented under the folder models.
A region-level matching pre-train task. The region-level matching task is implemented in function DDINOLoss(nn.Module) (Line 648) in main_esvit.py. Please use --use_dense_prediction True, otherwise only the view-level task is used.

Efficiency vs accuracy comparison under the linear classification protocol on ImageNet with EsViT

Figure: Efficiency vs accuracy comparison under the linear classification protocol on ImageNet. Left: Throughput of all SoTA SSL vision systems, circle sizes indicates model parameter counts; Right: performance over varied parameter counts for models with moderate (throughout/#parameters) ratio. Please refer Section 4.1 for details.

Pretrained models

You can download the full checkpoint (trained with both view-level and region-level tasks, batch size=512 and ImageNet-1K.), which contains backbone and projection head weights for both student and teacher networks.

arch	params	linear	k-nn	download	logs
EsViT (Swin-T, W=7)	28M	78.0%	75.7%	full ckpt	train	linear	knn
EsViT (Swin-S, W=7)	49M	79.5%	77.7%	full ckpt	train	linear	knn
EsViT (Swin-B, W=7)	87M	80.4%	78.9%	full ckpt	train	linear	knn
EsViT (Swin-T, W=14)	28M	78.7%	77.0%	full ckpt	train	linear	knn
EsViT (Swin-S, W=14)	49M	80.8%	79.1%	full ckpt	train	linear	knn
EsViT (Swin-B, W=14)	87M	81.3%	79.3%	full ckpt	train	linear	knn

EsViT (Swin-T, W=7) with different pre-train datasets (view-level task only)

arch	params	batch size	pre-train dataset	linear	k-nn	download	logs
EsViT	28M	512	ImageNet-1K	77.0%	74.2%	full ckpt	train	linear	knn
EsViT	28M	1024	ImageNet-1K	77.1%	73.7%	full ckpt	train	linear	knn
EsViT	28M	1024	WebVision-v1	75.4%	69.4%	full ckpt	train	linear	knn
EsViT	28M	1024	OpenImages-v4	69.6%	60.3%	full ckpt	train	linear	knn
EsViT	28M	1024	ImageNet-22K	73.5%	66.1%	full ckpt	train	linear	knn

Pre-training

One-node training

To train on 1 node with 16 GPUs for Swin-T model size:

PROJ_PATH=your_esvit_project_path
DATA_PATH=$PROJ_PATH/project/data/imagenet

OUT_PATH=$PROJ_PATH/output/esvit_exp/ssl/swin_tiny_imagenet/
python -m torch.distributed.launch --nproc_per_node=16 main_esvit.py --arch swin_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml

The main training script is main_esvit.py and conducts the training loop, taking the following options (among others) as arguments:

--use_dense_prediction: whether or not to use the region matching task in pre-training
--arch: switch between different sparse self-attention in the multi-stage Transformer architecture. Example architecture choices for EsViT training include [swin_tiny, swin_small, swin_base, swin_large,cvt_tiny, vil_2262]. The configuration files should be adjusted accrodingly, we provide example below. One may specify the network configuration by editing the YAML file under experiments/imagenet/*/*.yaml. The default window size=7; To consider a multi-stage architecture with window size=14, please choose yaml files with window14 in filenames.

To train on 1 node with 16 GPUs for Convolutional vision Transformer (CvT) models:

python -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch cvt_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/cvt_v4/s1.yaml

To train on 1 node with 16 GPUs for Vision Longformer (ViL) models:

python -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch vil_2262 --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/vil/vil_small/base.yaml MODEL.SPEC.MSVIT.ARCH 'l1,h3,d96,n2,s1,g1,p4,f7,a0_l2,h6,d192,n2,s1,g1,p2,f7,a0_l3,h12,d384,n6,s0,g1,p2,f7,a0_l4,h24,d768,n2,s0,g0,p2,f7,a0' MODEL.SPEC.MSVIT.MODE 1 MODEL.SPEC.MSVIT.VIL_MODE_SWITCH 0.75

Multi-node training

To train on 2 nodes with 16 GPUs each (total 32 GPUs) for Swin-Small model size:

OUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14
python main_evsit_mnodes.py --num_nodes 2 --num_gpus_per_node 16 --data_path $DATA_PATH/train --output_dir $OUT_PATH/continued_from0200_dense --batch_size_per_gpu 16 --arch swin_small --zip_mode True --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --cfg experiments/imagenet/swin/swin_small_patch4_window14_224.yaml --use_dense_prediction True --pretrained_weights_ckpt $OUT_PATH/checkpoint0200.pth

Evaluation:

k-NN and Linear classification on ImageNet

To train a supervised linear classifier on frozen weights on a single node with 4 gpus, run eval_linear.py. To train a k-NN classifier on frozen weights on a single node with 4 gpus, run eval_knn.py. Please specify --arch, --cfg and --pretrained_weights to choose a pre-trained checkpoint. If you want to evaluate the last checkpoint of EsViT with Swin-T, you can run for example:

PROJ_PATH=your_esvit_project_path
DATA_PATH=$PROJ_PATH/project/data/imagenet

OUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300
CKPT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/checkpoint.pth

python -m torch.distributed.launch --nproc_per_node=4 eval_linear.py --data_path $DATA_PATH --output_dir $OUT_PATH/lincls/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --n_last_blocks 4 --num_labels 1000 MODEL.NUM_CLASSES 0

python -m torch.distributed.launch --nproc_per_node=4 eval_knn.py --data_path $DATA_PATH --dump_features $OUT_PATH/features/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml MODEL.NUM_CLASSES 0

Analysis/Visualization of correspondence and attention maps

You can analyze the learned models by running python run_analysis.py. One example to analyze EsViT (Swin-T) is shown.

For an invidiual image (with path --image_path $IMG_PATH), we visualize the attention maps and correspondence of the last layer:

python run_analysis.py --arch swin_tiny --image_path $IMG_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --vis_attention True --vis_correspondence True MODEL.NUM_CLASSES 0

For an image dataset (with path --data_path $DATA_PATH), we quantatively measure the correspondence:

python run_analysis.py --arch swin_tiny --data_path $DATA_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml  --measure_correspondence True MODEL.NUM_CLASSES 0

For more examples, please see scripts/scripts_local/run_analysis.sh.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🍺 :

@article{li2021esvit,
  title={Efficient Self-supervised Vision Transformers for Representation Learning},
  author={Li, Chunyuan and Yang, Jianwei and Zhang, Pengchuan and Gao, Mei and Xiao, Bin and Dai, Xiyang and Yuan, Lu and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2106.09785},
  year={2021}
}

Related Projects/Codebase

[Swin Transformers] [Vision Longformer] [Convolutional vision Transformers (CvT)] [Focal Transformers]

Acknowledgement

Our implementation is built partly upon packages: [Dino] [Timm]

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments

Questions about downstream COCO detection

Hi, I’m wondering if you can provide a recipe to reproduce the results of CoCo detection? I’ve tried to use your pre-trained checkpoint to train the downstream task with Mask R-CNN, but cannot get the results reported in the paper. Not sure if there was something wrong during the training. Could you please provide more details? Thank you!

opened by actuy 4
Unable to reproduce the KNN results

Hi, I am trying to reproduce the knn results but fail to do so. I am using the pretrained model from the checkpoint on ImageNet-1K following the script provided.

I got the following results:

10-NN classifier result: Top1: 1.876, Top5: 3.462
20-NN classifier result: Top1: 1.872, Top5: 3.912
100-NN classifier result: Top1: 1.85, Top5: 4.884
200-NN classifier result: Top1: 1.834, Top5: 5.352

Is there any chance that the model checkpoint is incorrect?

Thanks!

opened by kikacaty 3

Throughput comparison (Table 1)

Hello, I have read your paper and found it very interesting. I was particularly intrigued by Table 1 where you compare the throughput against other methods, including DINO with a deit_tiny and patch size of 16. From the table, EsViT with Swin-T(/W=7) has a throughput of 808 and DINO with DeiT-T/16 has 1007. So I expected EsViT to be +- slower by 20%. Yet, when I run both I do not get this. I attached both logs below.

DINO

arch: deit_tiny
batch_size_per_gpu: 200
clip_grad: 3.0
data_path: /ilsvrc2012/ILSVRC2012_img_train
dist_url: env://
epochs: 100
freeze_last_layer: 1
global_crops_scale: (0.4, 1.0)
gpu: 0
local_crops_number: 8
local_crops_scale: (0.05, 0.4)
local_rank: 0
lr: 0.0005
min_lr: 1e-06
momentum_teacher: 0.996
norm_last_layer: True
num_workers: 24
optimizer: adamw
out_dim: 65536
output_dir: output_dir
patch_size: 16
rank: 0
saveckp_freq: 10
seed: 0
teacher_temp: 0.04
use_bn_in_head: False
use_fp16: True
warmup_epochs: 10
warmup_teacher_temp: 0.04
warmup_teacher_temp_epochs: 0
weight_decay: 0.04
weight_decay_end: 0.4
world_size: 4
Data loaded: there are 1281167 images.
Student and Teacher are built: they are both deit_tiny network.
Loss, optimizer and schedulers ready.
Starting DINO training !

Epoch: [0/100] Total time: 0:38:22 (1.438374 s / it)
Averaged stats: loss: 6.691907e+00 (8.885959e+00)  lr: 1.551861e-04 (7.808108e-05)  wd: 4.008760e-02 (4.002958e-02)

EsViT

aa: rand-m9-mstd0.5-inc1
arch: swin_tiny
aug_opt: dino_aug
batch_size_per_gpu: 48
cfg: experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml
clip_grad: 3.0
color_jitter: 0.4
cutmix: 1.0
cutmix_minmax: None
data_path: /ilsvrc2012/ILSVRC2012_img_train
dataset: imagenet1k
dist_url: env://
epochs: 100
freeze_last_layer: 1
global_crops_scale: (0.4, 1.0)
gpu: 0
local_crops_number: (8,)
local_crops_scale: (0.05, 0.4)
local_crops_size: (96,)
local_rank: 0
lr: 0.0005
min_lr: 1e-06
mixup: 0.8
mixup_mode: batch
mixup_prob: 1.0
mixup_switch_prob: 0.5
momentum_teacher: 0.996
norm_last_layer: False
num_mixup_views: 10
num_workers: 10
optimizer: adamw
opts: []
out_dim: 65536
output_dir: output_dir
patch_size: 16
pretrained_weights_ckpt: 
rank: 0
recount: 1
remode: pixel
reprob: 0.25
resplit: False
sampler: distributed
saveckp_freq: 5
seed: 0
smoothing: 0.0
teacher_temp: 0.07
train_interpolation: bicubic
tsv_mode: False
use_bn_in_head: False
use_dense_prediction: True
use_fp16: True
use_mixup: False
warmup_epochs: 10
warmup_teacher_temp: 0.04
warmup_teacher_temp_epochs: 30
weight_decay: 0.04
weight_decay_end: 0.4
world_size: 4
zip_mode: False
Data loaded: there are 1281167 images.
=> merge config from experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml
Unknow architecture: swin_tiny
Student and Teacher are built: they are both swin_tiny network.
Loss, optimizer and schedulers ready.
Starting training of EsViT ! from epoch 0

Epoch: [0/100] Total time: 2:09:19 (1.162958 s / it)
Averaged stats: loss: 4.714716 (6.780889)  lr: 0.000037 (0.000019)  wd: 0.040089 (0.040030)

So EsViT (with swin_tiny W=7) is about 3 times slower than DINO (with deit_tiny and P=16). This is run on a machine with 4xV100 GPUs. In both cases, I set the batch size to the +- highest value I could without having out of memory exceptions.

Is it the case that my run of EsViT should be this row in table 1?

EsViT, Swin-T 28 808 78.1 75.7

If so, do you know why I am getting such contradictory results?

Thank you!

opened by tileb1 3

Mixup & Cutmix during Pre-Training

Hi @ChunyuanLI, I've noticed the usage of mixup and cutmix during pre-training, which is not included in DINO. I'm wondering the performance gain brought by applying mixup & cutmix. Have you ever run any related experiments pre-trained w.o. mixup? I'm especially interested in vanilla DINO with Swin-T/Swin-B as the backbone, i.e., EsViT w. only view-level task, w.o. mixup & cutmix. It would be nice if you could inform me of those results.

opened by cashincashout 2
Results without multi-crop

Hello, Thanks for the code. I have noticed that the multi-crop trick can boost the performance by about 5% top-1 acc (on DINO, SwAV). Since your code base supports disabling this trick, did you conduct the experiments without this multi-crop trick, and would you be so kind that share the results on ImageNet?
enhancement

opened by BoPang1996 2

Missing requirements

Hi!

I am trying to load esvit on Google Colaboratory with the following code:

!git clone https://github.com/microsoft/esvit.git
!pip install -r ./esvit/requirements.txt

import models.vision_transformer as vits

I got the following error:

...
[/usr/local/lib/python3.7/dist-packages/timm/models/layers/helpers.py](https://localhost:8080/#) in <module>
      4 """
      5 from itertools import repeat
----> 6 from torch._six import container_abcs
      7 
      8
ImportError: cannot import name 'container_abcs' from 'torch._six' (/usr/local/lib/python3.7/dist-packages/torch/_six.py)

which seems to be related to the torch version. However, downgrading torch (<1.11.0) I obtain errors on other torch imports.

Is there available a testing notebook?

opened by robertanto 1

[QUESTION] Results on correspondence learning

Hello, I cannot seem to find in the paper which features are used for doing the correspondence matching in the appendix. Is it the last layer features (rough-grained) or the first layer features (fine-grained) or a combination of features at all depths (if so how is the combination?) ? Thanks!

opened by tileb1 1
Maybe a bug in SwinTrans

https://github.com/microsoft/esvit/blob/c5d73eba76d76136a5ed162263b934df57ec04dc/models/swin_transformer.py#L300

In this line, should (self.H, self.W) be (H, W)?

opened by BoPang1996 1
Is `self.head_dense` missing in model definition?

A liittle confused that self.head_dense is not explicitly defined in several model files. There is only a None assignment statement in:

https://github.com/microsoft/esvit/blob/main/models/swin_transformer.py#L655 https://github.com/microsoft/esvit/blob/main/models/vision_longformer.py#L518 https://github.com/microsoft/esvit/blob/main/models/vision_transformer.py#L171

Am I missing something?

opened by WarBean 1
Questions about paper COCO detection numbers
Hi all,

In table 4 of the arxiv preprint https://arxiv.org/pdf/2106.09785.pdf, the reported AP^bb of Supervised = 46.0 Why is this number lower than the ones reported in the Swin paper ?

See Table 2 (b) of https://arxiv.org/pdf/2103.14030.pdf

Swin-S AP^box=51.8

Also, what object detection method are you using? Is it Mask RCNN or Cascade? There is no mention of the detection method used in the paper.

Thanks!
opened by gabrielhuang 1
Training on custom dataset

What a custom dataset structure should be like and how to train on it? Let's say I have a dataset of two classes as the folder (binary): 1. Has cat, 2. No cat. In each sub-folder, there are images. What changes to the code and dataset should I make? Thanks in advance.

opened by madr3z 1
Add `$schema` to `cgmanifest.json`

This pull request adds the JSON schema for cgmanifest.json.

FAQ

Why?

A JSON schema helps you to ensure that your cgmanifest.json file is valid. JSON schema validation is a build-in feature in most modern IDEs like Visual Studio and Visual Studio Code. Most modern IDEs also provide code-completion for JSON schemas.

How can I validate my cgmanifest.json file?

Most modern IDEs like Visual Studio and Visual Studio Code have a built-in feature to validate JSON files. You can also use this small script to validate your cgmanifest.json file.

Why does it suggest camel case for the properties?

Component Detection is able to read camel case and pascal case properties. However, the JSON schema doesn't have a case-insensitive mode. We therefore suggest camel case as it's the most common format for JSON.

Why is the diff so large?

To deserialize the cgmanifest.json file, we use JSON.parse(). However, to serialize the JSON again we use prettier. We found that, in general, it gave smaller diffs than the default JSON.stringify() function.

opened by JamieMagee 0
Loss stops decreasing

Hi,

I'm retraining from scratch EsVIT on a custom dataset (1.7M images) with tiny swin, W=14, and a batch size of 64, default lr and wd, and the following hp --teacher_temp 0.04
--warmup_teacher_temp 0.03
--momentum_teacher 0.9996
--warmup_epochs 10
--warmup_teacher_temp_epochs 30
--use_dense_prediction True
--use_fp16 True
--out_dim 65536
--epochs 300 \

The loss does not decrease from epoch 70 onwards.

Which hp would you recommend tuning now resuming from let's say epoch 70 ?

Thanks

opened by SarahFrem 0
Bump numpy from 1.19.3 to 1.22.0
Bumps numpy from 1.19.3 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
can't load swin-tiny checkpoint right

hi, I use the swin-transformer.py to load the swin-tiny model pretrained by imagenet1k. And the message is here: msg: _IncompatibleKeys(missing_keys=['layers.0.blocks.1.attn_mask', 'layers.1.blocks.1.attn_mask', 'layers.2.blocks.1.attn_mask', 'layers.2.blocks.3.attn_mask', 'layers.2.blocks.5.attn_mask', 'head.weight', 'head.bias'], unexpected_keys=['head.mlp.0.weight', 'head.mlp.0.bias', 'head.mlp.2.weight', 'head.mlp.2.bias', 'head.mlp.4.weight', 'head.mlp.4.bias', 'head.last_layer.weight_g', 'head.last_layer.weight_v']) why is there missing keys here?

opened by ywdong 0
Question about the Learning Rate used for pretraining

Hello.

Thank you for the wonderful work! I have some questions about the learning rate used to pretrain the Swin model in Table 1. As the logs show, the learning rate for the Swin-T model is 0.0005180447994195404 at 201 epoch, while the learning rate for the Swin-S/B model is 0.00025939212681290886 at 201 epoch. however, the parameters shown for the 'args' keyword in the pre-trained model are the same.

Could you please tell me why there is a difference in learning rate in the training log?

Thanks in advance.

opened by Annbless 0

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

59 Dec 28, 2022

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

92 Nov 30, 2022

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

440 Jan 2, 2023

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification Created by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Ch

414 Jan 1, 2023

Code for "Searching for Efficient Multi-Stage Vision Transformers"

Searching for Efficient Multi-Stage Vision Transformers This repository contains the official Pytorch implementation of "Searching for Efficient Multi

62 Oct 25, 2022

Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Focal Transformer This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transf

486 Dec 20, 2022

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

BossNAS This repository contains PyTorch evaluation code, retraining code and pretrained models of our paper: BossNAS: Exploring Hybrid CNN-transforme

127 Dec 26, 2022

This is an official implementation for "Self-Supervised Learning with Swin Transformers".

Self-Supervised Learning with Vision Transformers By Zhenda Xie*, Yutong Lin*, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao and Han Hu This repo is the

529 Jan 2, 2023

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv] VideoMAE: Masked Autoencoders are Data-Efficient Learne

Multimedia Computing Group, Nanjing University

697 Jan 7, 2023

Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

Hierarchical Pretraining: Research Repository This is a research repository for reproducing the results from the project "Self-supervised pretraining

53 Nov 9, 2022

SiT: Self-supervised vIsion Transformer

This repository contains the official PyTorch self-supervised pretraining, finetuning, and evaluation codes for SiT (Self-supervised image Transformer).

275 Dec 28, 2022

[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, CVPR 2021. Ayan Kumar Bhunia, Pinaki nath Chowdhury, Yongxin Yan

44 Dec 12, 2022

EsViT: Efficient self-supervised Vision Transformers

Related tags

Overview

Efficient Self-Supervised Vision Transformers (EsViT)

Pretrained models

Pre-training

One-node training

Multi-node training

Evaluation:

k-NN and Linear classification on ImageNet

Analysis/Visualization of correspondence and attention maps

Citation

Related Projects/Codebase

Acknowledgement

Contributing

Trademarks

Comments

DINO

EsViT

FAQ

Why?

How can I validate my cgmanifest.json file?

Why does it suggest camel case for the properties?

Why is the diff so large?

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Owner

Microsoft

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Code for "Searching for Efficient Multi-Stage Vision Transformers"

Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

This is an official implementation for "Self-Supervised Learning with Swin Transformers".

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

SiT: Self-supervised vIsion Transformer

[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

DeiT: Data-efficient Image Transformers

Official implementation of "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers"

Efficient Training of Visual Transformers with Small Datasets

Efficient Training of Audio Transformers with Patchout

How can I validate my `cgmanifest.json` file?

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio