EsViT: Efficient self-supervised Vision Transformers

Overview

Efficient Self-Supervised Vision Transformers (EsViT)

PWC

PyTorch implementation for EsViT, built with two techniques:

  • A multi-stage Transformer architecture. Three multi-stage Transformer variants are implemented under the folder models.
  • A region-level matching pre-train task. The region-level matching task is implemented in function DDINOLoss(nn.Module) (Line 648) in main_esvit.py. Please use --use_dense_prediction True, otherwise only the view-level task is used.
Efficiency vs accuracy comparison under the linear classification protocol on ImageNet with EsViT
Figure: Efficiency vs accuracy comparison under the linear classification protocol on ImageNet. Left: Throughput of all SoTA SSL vision systems, circle sizes indicates model parameter counts; Right: performance over varied parameter counts for models with moderate (throughout/#parameters) ratio. Please refer Section 4.1 for details.

Pretrained models

You can download the full checkpoint (trained with both view-level and region-level tasks, batch size=512 and ImageNet-1K.), which contains backbone and projection head weights for both student and teacher networks.

arch params linear k-nn download logs
EsViT (Swin-T, W=7) 28M 78.0% 75.7% full ckpt train linear knn
EsViT (Swin-S, W=7) 49M 79.5% 77.7% full ckpt train linear knn
EsViT (Swin-B, W=7) 87M 80.4% 78.9% full ckpt train linear knn
EsViT (Swin-T, W=14) 28M 78.7% 77.0% full ckpt train linear knn
EsViT (Swin-S, W=14) 49M 80.8% 79.1% full ckpt train linear knn
EsViT (Swin-B, W=14) 87M 81.3% 79.3% full ckpt train linear knn

EsViT (Swin-T, W=7) with different pre-train datasets (view-level task only)

arch params batch size pre-train dataset linear k-nn download logs
EsViT 28M 512 ImageNet-1K 77.0% 74.2% full ckpt train linear knn
EsViT 28M 1024 ImageNet-1K 77.1% 73.7% full ckpt train linear knn
EsViT 28M 1024 WebVision-v1 75.4% 69.4% full ckpt train linear knn
EsViT 28M 1024 OpenImages-v4 69.6% 60.3% full ckpt train linear knn
EsViT 28M 1024 ImageNet-22K 73.5% 66.1% full ckpt train linear knn

Pre-training

One-node training

To train on 1 node with 16 GPUs for Swin-T model size:

PROJ_PATH=your_esvit_project_path
DATA_PATH=$PROJ_PATH/project/data/imagenet

OUT_PATH=$PROJ_PATH/output/esvit_exp/ssl/swin_tiny_imagenet/
python -m torch.distributed.launch --nproc_per_node=16 main_esvit.py --arch swin_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml 

The main training script is main_esvit.py and conducts the training loop, taking the following options (among others) as arguments:

  • --use_dense_prediction: whether or not to use the region matching task in pre-training
  • --arch: switch between different sparse self-attention in the multi-stage Transformer architecture. Example architecture choices for EsViT training include [swin_tiny, swin_small, swin_base, swin_large,cvt_tiny, vil_2262]. The configuration files should be adjusted accrodingly, we provide example below. One may specify the network configuration by editing the YAML file under experiments/imagenet/*/*.yaml. The default window size=7; To consider a multi-stage architecture with window size=14, please choose yaml files with window14 in filenames.

To train on 1 node with 16 GPUs for Convolutional vision Transformer (CvT) models:

python -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch cvt_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/cvt_v4/s1.yaml

To train on 1 node with 16 GPUs for Vision Longformer (ViL) models:

python -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch vil_2262 --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/vil/vil_small/base.yaml MODEL.SPEC.MSVIT.ARCH 'l1,h3,d96,n2,s1,g1,p4,f7,a0_l2,h6,d192,n2,s1,g1,p2,f7,a0_l3,h12,d384,n6,s0,g1,p2,f7,a0_l4,h24,d768,n2,s0,g0,p2,f7,a0' MODEL.SPEC.MSVIT.MODE 1 MODEL.SPEC.MSVIT.VIL_MODE_SWITCH 0.75

Multi-node training

To train on 2 nodes with 16 GPUs each (total 32 GPUs) for Swin-Small model size:

OUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14
python main_evsit_mnodes.py --num_nodes 2 --num_gpus_per_node 16 --data_path $DATA_PATH/train --output_dir $OUT_PATH/continued_from0200_dense --batch_size_per_gpu 16 --arch swin_small --zip_mode True --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --cfg experiments/imagenet/swin/swin_small_patch4_window14_224.yaml --use_dense_prediction True --pretrained_weights_ckpt $OUT_PATH/checkpoint0200.pth

Evaluation:

k-NN and Linear classification on ImageNet

To train a supervised linear classifier on frozen weights on a single node with 4 gpus, run eval_linear.py. To train a k-NN classifier on frozen weights on a single node with 4 gpus, run eval_knn.py. Please specify --arch, --cfg and --pretrained_weights to choose a pre-trained checkpoint. If you want to evaluate the last checkpoint of EsViT with Swin-T, you can run for example:

PROJ_PATH=your_esvit_project_path
DATA_PATH=$PROJ_PATH/project/data/imagenet

OUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300
CKPT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/checkpoint.pth

python -m torch.distributed.launch --nproc_per_node=4 eval_linear.py --data_path $DATA_PATH --output_dir $OUT_PATH/lincls/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --n_last_blocks 4 --num_labels 1000 MODEL.NUM_CLASSES 0

python -m torch.distributed.launch --nproc_per_node=4 eval_knn.py --data_path $DATA_PATH --dump_features $OUT_PATH/features/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml MODEL.NUM_CLASSES 0

Analysis/Visualization of correspondence and attention maps

You can analyze the learned models by running python run_analysis.py. One example to analyze EsViT (Swin-T) is shown.

For an invidiual image (with path --image_path $IMG_PATH), we visualize the attention maps and correspondence of the last layer:

python run_analysis.py --arch swin_tiny --image_path $IMG_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --vis_attention True --vis_correspondence True MODEL.NUM_CLASSES 0 

For an image dataset (with path --data_path $DATA_PATH), we quantatively measure the correspondence:

python run_analysis.py --arch swin_tiny --data_path $DATA_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml  --measure_correspondence True MODEL.NUM_CLASSES 0 

For more examples, please see scripts/scripts_local/run_analysis.sh.

Citation

If you find this repository useful, please consider giving a star and citation 🍺 :

@article{li2021esvit,
  title={Efficient Self-supervised Vision Transformers for Representation Learning},
  author={Li, Chunyuan and Yang, Jianwei and Zhang, Pengchuan and Gao, Mei and Xiao, Bin and Dai, Xiyang and Yuan, Lu and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2106.09785},
  year={2021}
}

Related Projects/Codebase

[Swin Transformers] [Vision Longformer] [Convolutional vision Transformers (CvT)] [Focal Transformers]

Acknowledgement

Our implementation is built partly upon packages: [Dino] [Timm]

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments
  • Questions about downstream COCO detection

    Questions about downstream COCO detection

    Hi, I’m wondering if you can provide a recipe to reproduce the results of CoCo detection? I’ve tried to use your pre-trained checkpoint to train the downstream task with Mask R-CNN, but cannot get the results reported in the paper. Not sure if there was something wrong during the training. Could you please provide more details? Thank you!

    opened by actuy 4
  • Unable to reproduce the KNN results

    Unable to reproduce the KNN results

    Hi, I am trying to reproduce the knn results but fail to do so. I am using the pretrained model from the checkpoint on ImageNet-1K following the script provided.

    I got the following results:

    10-NN classifier result: Top1: 1.876, Top5: 3.462
    20-NN classifier result: Top1: 1.872, Top5: 3.912
    100-NN classifier result: Top1: 1.85, Top5: 4.884
    200-NN classifier result: Top1: 1.834, Top5: 5.352

    Is there any chance that the model checkpoint is incorrect?

    Thanks!

    opened by kikacaty 3
  • Throughput comparison (Table 1)

    Throughput comparison (Table 1)

    Hello, I have read your paper and found it very interesting. I was particularly intrigued by Table 1 where you compare the throughput against other methods, including DINO with a deit_tiny and patch size of 16. From the table, EsViT with Swin-T(/W=7) has a throughput of 808 and DINO with DeiT-T/16 has 1007. So I expected EsViT to be +- slower by 20%. Yet, when I run both I do not get this. I attached both logs below.

    DINO

    arch: deit_tiny
    batch_size_per_gpu: 200
    clip_grad: 3.0
    data_path: /ilsvrc2012/ILSVRC2012_img_train
    dist_url: env://
    epochs: 100
    freeze_last_layer: 1
    global_crops_scale: (0.4, 1.0)
    gpu: 0
    local_crops_number: 8
    local_crops_scale: (0.05, 0.4)
    local_rank: 0
    lr: 0.0005
    min_lr: 1e-06
    momentum_teacher: 0.996
    norm_last_layer: True
    num_workers: 24
    optimizer: adamw
    out_dim: 65536
    output_dir: output_dir
    patch_size: 16
    rank: 0
    saveckp_freq: 10
    seed: 0
    teacher_temp: 0.04
    use_bn_in_head: False
    use_fp16: True
    warmup_epochs: 10
    warmup_teacher_temp: 0.04
    warmup_teacher_temp_epochs: 0
    weight_decay: 0.04
    weight_decay_end: 0.4
    world_size: 4
    Data loaded: there are 1281167 images.
    Student and Teacher are built: they are both deit_tiny network.
    Loss, optimizer and schedulers ready.
    Starting DINO training !
    
    Epoch: [0/100] Total time: 0:38:22 (1.438374 s / it)
    Averaged stats: loss: 6.691907e+00 (8.885959e+00)  lr: 1.551861e-04 (7.808108e-05)  wd: 4.008760e-02 (4.002958e-02)
    
    

    EsViT

    aa: rand-m9-mstd0.5-inc1
    arch: swin_tiny
    aug_opt: dino_aug
    batch_size_per_gpu: 48
    cfg: experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml
    clip_grad: 3.0
    color_jitter: 0.4
    cutmix: 1.0
    cutmix_minmax: None
    data_path: /ilsvrc2012/ILSVRC2012_img_train
    dataset: imagenet1k
    dist_url: env://
    epochs: 100
    freeze_last_layer: 1
    global_crops_scale: (0.4, 1.0)
    gpu: 0
    local_crops_number: (8,)
    local_crops_scale: (0.05, 0.4)
    local_crops_size: (96,)
    local_rank: 0
    lr: 0.0005
    min_lr: 1e-06
    mixup: 0.8
    mixup_mode: batch
    mixup_prob: 1.0
    mixup_switch_prob: 0.5
    momentum_teacher: 0.996
    norm_last_layer: False
    num_mixup_views: 10
    num_workers: 10
    optimizer: adamw
    opts: []
    out_dim: 65536
    output_dir: output_dir
    patch_size: 16
    pretrained_weights_ckpt: 
    rank: 0
    recount: 1
    remode: pixel
    reprob: 0.25
    resplit: False
    sampler: distributed
    saveckp_freq: 5
    seed: 0
    smoothing: 0.0
    teacher_temp: 0.07
    train_interpolation: bicubic
    tsv_mode: False
    use_bn_in_head: False
    use_dense_prediction: True
    use_fp16: True
    use_mixup: False
    warmup_epochs: 10
    warmup_teacher_temp: 0.04
    warmup_teacher_temp_epochs: 30
    weight_decay: 0.04
    weight_decay_end: 0.4
    world_size: 4
    zip_mode: False
    Data loaded: there are 1281167 images.
    => merge config from experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml
    Unknow architecture: swin_tiny
    Student and Teacher are built: they are both swin_tiny network.
    Loss, optimizer and schedulers ready.
    Starting training of EsViT ! from epoch 0
    
    Epoch: [0/100] Total time: 2:09:19 (1.162958 s / it)
    Averaged stats: loss: 4.714716 (6.780889)  lr: 0.000037 (0.000019)  wd: 0.040089 (0.040030)
    

    So EsViT (with swin_tiny W=7) is about 3 times slower than DINO (with deit_tiny and P=16). This is run on a machine with 4xV100 GPUs. In both cases, I set the batch size to the +- highest value I could without having out of memory exceptions.

    Is it the case that my run of EsViT should be this row in table 1?

    EsViT, Swin-T 28 808 78.1 75.7
    

    If so, do you know why I am getting such contradictory results?

    Thank you!

    opened by tileb1 3
  • Mixup & Cutmix during Pre-Training

    Mixup & Cutmix during Pre-Training

    Hi @ChunyuanLI, I've noticed the usage of mixup and cutmix during pre-training, which is not included in DINO. I'm wondering the performance gain brought by applying mixup & cutmix. Have you ever run any related experiments pre-trained w.o. mixup? I'm especially interested in vanilla DINO with Swin-T/Swin-B as the backbone, i.e., EsViT w. only view-level task, w.o. mixup & cutmix. It would be nice if you could inform me of those results.

    opened by cashincashout 2
  • Results without multi-crop

    Results without multi-crop

    Hello, Thanks for the code. I have noticed that the multi-crop trick can boost the performance by about 5% top-1 acc (on DINO, SwAV). Since your code base supports disabling this trick, did you conduct the experiments without this multi-crop trick, and would you be so kind that share the results on ImageNet?

    enhancement 
    opened by BoPang1996 2
  • Missing requirements

    Missing requirements

    Hi!

    I am trying to load esvit on Google Colaboratory with the following code:

    !git clone https://github.com/microsoft/esvit.git
    !pip install -r ./esvit/requirements.txt
    
    import models.vision_transformer as vits
    

    I got the following error:

    ...
    [/usr/local/lib/python3.7/dist-packages/timm/models/layers/helpers.py](https://localhost:8080/#) in <module>
          4 """
          5 from itertools import repeat
    ----> 6 from torch._six import container_abcs
          7 
          8
    ImportError: cannot import name 'container_abcs' from 'torch._six' (/usr/local/lib/python3.7/dist-packages/torch/_six.py)
    

    which seems to be related to the torch version. However, downgrading torch (<1.11.0) I obtain errors on other torch imports.

    Is there available a testing notebook?

    opened by robertanto 1
  • [QUESTION]  Results on correspondence learning

    [QUESTION] Results on correspondence learning

    Hello, I cannot seem to find in the paper which features are used for doing the correspondence matching in the appendix. Is it the last layer features (rough-grained) or the first layer features (fine-grained) or a combination of features at all depths (if so how is the combination?) ? Thanks!

    opened by tileb1 1
  • Maybe a bug in SwinTrans

    Maybe a bug in SwinTrans

    https://github.com/microsoft/esvit/blob/c5d73eba76d76136a5ed162263b934df57ec04dc/models/swin_transformer.py#L300

    In this line, should (self.H, self.W) be (H, W)?

    opened by BoPang1996 1
  • Is `self.head_dense` missing in model definition?

    Is `self.head_dense` missing in model definition?

    A liittle confused that self.head_dense is not explicitly defined in several model files. There is only a None assignment statement in:

    https://github.com/microsoft/esvit/blob/main/models/swin_transformer.py#L655 https://github.com/microsoft/esvit/blob/main/models/vision_longformer.py#L518 https://github.com/microsoft/esvit/blob/main/models/vision_transformer.py#L171

    Am I missing something?

    opened by WarBean 1
  • Questions about paper COCO detection numbers

    Questions about paper COCO detection numbers

    Hi all,

    In table 4 of the arxiv preprint https://arxiv.org/pdf/2106.09785.pdf, the reported AP^bb of Supervised = 46.0 Why is this number lower than the ones reported in the Swin paper ?

    • See Table 2 (b) of https://arxiv.org/pdf/2103.14030.pdf
    • Swin-S AP^box=51.8

    Also, what object detection method are you using? Is it Mask RCNN or Cascade? There is no mention of the detection method used in the paper.

    Thanks!

    opened by gabrielhuang 1
  • Training on custom dataset

    Training on custom dataset

    What a custom dataset structure should be like and how to train on it? Let's say I have a dataset of two classes as the folder (binary): 1. Has cat, 2. No cat. In each sub-folder, there are images. What changes to the code and dataset should I make? Thanks in advance.

    opened by madr3z 1
  • Add `$schema` to `cgmanifest.json`

    Add `$schema` to `cgmanifest.json`

    This pull request adds the JSON schema for cgmanifest.json.

    FAQ

    Why?

    A JSON schema helps you to ensure that your cgmanifest.json file is valid. JSON schema validation is a build-in feature in most modern IDEs like Visual Studio and Visual Studio Code. Most modern IDEs also provide code-completion for JSON schemas.

    How can I validate my cgmanifest.json file?

    Most modern IDEs like Visual Studio and Visual Studio Code have a built-in feature to validate JSON files. You can also use this small script to validate your cgmanifest.json file.

    Why does it suggest camel case for the properties?

    Component Detection is able to read camel case and pascal case properties. However, the JSON schema doesn't have a case-insensitive mode. We therefore suggest camel case as it's the most common format for JSON.

    Why is the diff so large?

    To deserialize the cgmanifest.json file, we use JSON.parse(). However, to serialize the JSON again we use prettier. We found that, in general, it gave smaller diffs than the default JSON.stringify() function.

    opened by JamieMagee 0
  • Loss stops decreasing

    Loss stops decreasing

    Hi,

    I'm retraining from scratch EsVIT on a custom dataset (1.7M images) with tiny swin, W=14, and a batch size of 64, default lr and wd, and the following hp --teacher_temp 0.04
    --warmup_teacher_temp 0.03
    --momentum_teacher 0.9996
    --warmup_epochs 10
    --warmup_teacher_temp_epochs 30
    --use_dense_prediction True
    --use_fp16 True
    --out_dim 65536
    --epochs 300 \

    The loss does not decrease from epoch 70 onwards.

    Which hp would you recommend tuning now resuming from let's say epoch 70 ?

    Thanks

    image

    opened by SarahFrem 0
  • Bump numpy from 1.19.3 to 1.22.0

    Bump numpy from 1.19.3 to 1.22.0

    Bumps numpy from 1.19.3 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • can't load swin-tiny checkpoint right

    can't load swin-tiny checkpoint right

    hi, I use the swin-transformer.py to load the swin-tiny model pretrained by imagenet1k. And the message is here: msg: _IncompatibleKeys(missing_keys=['layers.0.blocks.1.attn_mask', 'layers.1.blocks.1.attn_mask', 'layers.2.blocks.1.attn_mask', 'layers.2.blocks.3.attn_mask', 'layers.2.blocks.5.attn_mask', 'head.weight', 'head.bias'], unexpected_keys=['head.mlp.0.weight', 'head.mlp.0.bias', 'head.mlp.2.weight', 'head.mlp.2.bias', 'head.mlp.4.weight', 'head.mlp.4.bias', 'head.last_layer.weight_g', 'head.last_layer.weight_v']) why is there missing keys here?

    opened by ywdong 0
  • Question about the Learning Rate used for pretraining

    Question about the Learning Rate used for pretraining

    Hello.

    Thank you for the wonderful work! I have some questions about the learning rate used to pretrain the Swin model in Table 1. As the logs show, the learning rate for the Swin-T model is 0.0005180447994195404 at 201 epoch, while the learning rate for the Swin-S/B model is 0.00025939212681290886 at 201 epoch. however, the parameters shown for the 'args' keyword in the pre-trained model are the same.

    Could you please tell me why there is a difference in learning rate in the training log?

    Thanks in advance.

    opened by Annbless 0
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

QData 440 Jan 2, 2023
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification Created by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Ch

Yongming Rao 414 Jan 1, 2023
Code for "Searching for Efficient Multi-Stage Vision Transformers"

Searching for Efficient Multi-Stage Vision Transformers This repository contains the official Pytorch implementation of "Searching for Efficient Multi

Yi-Lun Liao 62 Oct 25, 2022
Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Focal Transformer This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transf

Microsoft 486 Dec 20, 2022
BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

BossNAS This repository contains PyTorch evaluation code, retraining code and pretrained models of our paper: BossNAS: Exploring Hybrid CNN-transforme

Changlin Li 127 Dec 26, 2022
This is an official implementation for "Self-Supervised Learning with Swin Transformers".

Self-Supervised Learning with Vision Transformers By Zhenda Xie*, Yutong Lin*, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao and Han Hu This repo is the

Swin Transformer 529 Jan 2, 2023
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv] VideoMAE: Masked Autoencoders are Data-Efficient Learne

Multimedia Computing Group, Nanjing University 697 Jan 7, 2023
Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

Hierarchical Pretraining: Research Repository This is a research repository for reproducing the results from the project "Self-supervised pretraining

Colorado Reed 53 Nov 9, 2022
SiT: Self-supervised vIsion Transformer

This repository contains the official PyTorch self-supervised pretraining, finetuning, and evaluation codes for SiT (Self-supervised image Transformer).

Sara Ahmed 275 Dec 28, 2022
[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, CVPR 2021. Ayan Kumar Bhunia, Pinaki nath Chowdhury, Yongxin Yan

Ayan Kumar Bhunia 44 Dec 12, 2022
PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

null 36 Oct 30, 2022
Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Patch-Rotation(PatchRot) Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models Submitted to Neurips2021 To

null 4 Jul 12, 2021
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 282 Jan 9, 2023
DeiT: Data-efficient Image Transformers

DeiT: Data-efficient Image Transformers This repository contains PyTorch evaluation code, training code and pretrained models for DeiT (Data-Efficient

Facebook Research 3.2k Jan 6, 2023
Official implementation of "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers"

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 31, 2022
Efficient Training of Visual Transformers with Small Datasets

Official codes for "Efficient Training of Visual Transformers with Small Datasets", NerIPS 2021.

Yahui Liu 112 Dec 25, 2022
Efficient Training of Audio Transformers with Patchout

PaSST: Efficient Training of Audio Transformers with Patchout This is the implementation for Efficient Training of Audio Transformers with Patchout Pa

null 165 Dec 26, 2022