Unofficial PyTorch reimplementation of the paper Swin Transformer V2: Scaling Up Capacity and Resolution

Overview

Swin Transformer V2: Scaling Up Capacity and Resolution

Unofficial PyTorch reimplementation of the paper Swin Transformer V2: Scaling Up Capacity and Resolution by Ze Liu, Han Hu et al. (Microsoft Research Asia).

This repository includes a pure PyTorch implementation of the Swin Transformer V2.

The official Swin Transformer V1 implementation is available here. Currently (10.01.2022), an official implementation of the Swin Transformer V2 is not publicly available.

Installation

You can simply install the Swin Transformer V2 implementation as a Python package by using pip.

pip install git+https://github.com/ChristophReich1996/Involution

Alternatively, you can clone the repository and use the implementation in swin_transformer_v2 directly in your project.

Usage

This implementation provides the configurations reported in the paper (SwinV2-T, SwinV2-S, etc.). You can build the model by calling the corresponding function. Please note that the Swin Transformer V2 (SwinTransformerV2 class) implementation returns the feature maps of each stage of the network (List[torch.Tensor]). If you want to use this implementation for image classification simply wrap this model and take the final feature map.

from swin_transformer_v2 import SwinTransformerV2

from swin_transformer_v2 import swin_transformer_v2_t, swin_transformer_v2_s, swin_transformer_v2_b, \
    swin_transformer_v2_l, swin_transformer_v2_h, swin_transformer_v2_g

# SwinV2-T
swin_transformer: SwinTransformerV2 = swin_transformer_v2_t(in_channels=3,
                                                            window_size=8,
                                                            input_resolution=(256, 256),
                                                            sequential_self_attention=False,
                                                            use_checkpoint=False)

If you want to change the resolution and/or the window size for fine-tuning or inference pleas use the update_resolution method.

# Change resolution and window size of the model
swin_transformer.update_resolution(new_window_size=16, new_input_resolution=(512, 512))

In case you want to use a custom configuration you can use the SwinTransformerV2 class. The constructor method takes the following parameters.

Parameter Description Type
in_channels Number of input channels int
depth Depth of the stage (number of layers) int
downscale If true input is downsampled (see Fig. 3 or V1 paper) bool
input_resolution Input resolution Tuple[int, int]
number_of_heads Number of attention heads to be utilized int
window_size Window size to be utilized int
shift_size Shifting size to be used int
ff_feature_ratio Ratio of the hidden dimension in the FFN to the input channels int
dropout Dropout in input mapping float
dropout_attention Dropout rate of attention map float
dropout_path Dropout in main path float
use_checkpoint If true checkpointing is utilized bool
sequential_self_attention If true sequential self-attention is performed bool

This file includes a full example how to use this implementation.

Disclaimer

This is a very experimental implementation based on the Swin Transformer V2 paper and the official implementation of the Swin Transformer V1. Since an official implementation of the Swin Transformer V2 is not yet published, it is not possible to say to which extent this implementation might differ from the original one. If you have any issues with this implementation please raise an issue.

Reference

@article{Liu2021,
    title={{Swin Transformer V2: Scaling Up Capacity and Resolution}},
    author={Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, 
            Yue and Zhang, Zheng and Dong, Li and others},
    journal={arXiv preprint arXiv:2111.09883},
    year={2021}
}
Comments
  • problem about DeformableSwinTransformerBlock

    problem about DeformableSwinTransformerBlock

    when i feed the input data whose shape is [1, 3, 240, 250], there is an error at https://github.com/ChristophReich1996/Swin-Transformer-V2/blob/d1b89227ef0045c3ab667a4f2cdea9ec4f240236/swin_transformer_v2/model_parts.py#L574 , the error is listed below:

    RuntimeError: The size of tensor a (48) must match the size of tensor b (256) at non-singleton dimension 2

    it appears that the shape of self.default_grid.repeat_interleave(repeats=offsets.shape[0], dim=0) and offsets is different, I wonder if you have the same problem and it will be of great help if you can help me solve it, thank you~

    opened by nullxjx 14
  • How do I get it to work at 512*640 resolution?

    How do I get it to work at 512*640 resolution?

    model.update_resolution(new_window_size=8, new_input_resolution=(512, 640)) ------> RuntimeError: shape '[0, 2, 2, 768, 8, 8]' is invalid for input of size 196608

    opened by WY-2022 4
  • Training and inference implementation of Swin v.2 for oject detection task

    Training and inference implementation of Swin v.2 for oject detection task

    Hi, I've worked on Swin Transformer v.1 earlier for object detection training and inference. Now I want to improve the result with Swin Transformer v.2. Is it available and is there any way to do that? Many thanks.

    opened by queman 2
  • 大佬,您能导入窗口为16的预训练权重吗

    大佬,您能导入窗口为16的预训练权重吗

    源码改建 (当我导入swinv2_tiny_patch4_window8_256.pth,使用窗口为8时候,可以正常跑代码;但是当我导入swinv2_tiny_patch4_window16_256.pth,使用窗口为16时候,导入权重出现不匹配情况;不知道如何处理,请大佬解答一下。 问题如下:) RuntimeError: Error(s) in loading state_dict for Model: size mismatch for model.7.blocks.0.attn.relative_coords_table: copying a param with shape torch.Size([1, 15, 15, 2]) from checkpoint, the shape in current model is torch.Size([1, 31, 31, 2]). size mismatch for model.7.blocks.0.attn.relative_position_index: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([256, 256]). size mismatch for model.7.blocks.1.attn.relative_coords_table: copying a param with shape torch.Size([1, 15, 15, 2]) from checkpoint, the shape in current model is torch.Size([1, 31, 31, 2]). size mismatch for model.7.blocks.1.attn.relative_position_index: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([256, 256]).

    opened by LUO77123 1
  • About Checkpoints

    About Checkpoints

    Hi! I have another question. If I just pip, and then :

    class SWIN(nn.Module):
         def __init__(self, num_classes=4):
            super().__init__()
            self.num_classes = num_classes
            # self.pool = nn.MaxPool2d(2, 2)
            self.encoder: SwinTransformerV2 = swin_transformer_v2_t(in_channels=3,
                                                                window_size=8,
                                                                input_resolution=(1024, 1280),
                                                                sequential_self_attention=False,
                                                                use_checkpoint=True)
            self.p=self.encoder.patch_embedding
            self.encoder0 = self.encoder.stages[0]
            ... ...
    

    How to use the checkpoint now? And Is there a pre-trained model for v2_base? (And when I just run like above, a wired problem arises: 'warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")')

    opened by WY-2022 1
  • relative_coordinates_log

    relative_coordinates_log

    window_attention.relative_coordinates_log: copying a param with shape torch.Size([256, 2]) from checkpoint, the shape in current model is torch.Size([4096, 2]).

    opened by yuangui0316 1
  • Problems encountered

    Problems encountered

    Hello, I have encountered some small problems when using SwinV2 these days. I would like to get your answers here.

    1. when my input size is small, such as 96*96, window_szie=8, will appear https://github.com/ChristophReich1996/Swin-Transformer-V2/blob/cff3824f5d75dfd93553867efcf53562c24dd555/swin_transformer_v2/model_parts.py#L289 RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

    2. In this paper, sequential self-attention calculation is used to save GPU memory, but in large-size image application, I set Sequential_self_attention =True, which will result in OOM. Set Sequential_self_attention =False does not.

    3. when i update the code https://github.com/ChristophReich1996/Swin-Transformer-V2/blob/cff3824f5d75dfd93553867efcf53562c24dd555/swin_transformer_v2/model_parts.py#L207,GPU memory usage also soared further.

    I am currently experimenting with the application of swinV2, an efficient and memory saving network, to 3D data, therefore pay more attention to the occupation problem on the display memory.

    opened by Breeze-Zero 1
  • Scaled cosine attention error

    Scaled cosine attention error

    The scaled cosine attention part of the implementation seems wrong in model_parts.py

    attention_map: torch.Tensor = torch.einsum("bhqd, bhkd -> bhqk", query, key) \
                                          / torch.maximum(torch.norm(query, dim=-1, keepdim=True)
                                                          * torch.norm(key, dim=-1, keepdim=True),
                                                          torch.tensor(1e-06, device=query.device, dtype=query.dtype))
    

    should be corrected as

    attention_map: torch.Tensor = torch.einsum("bhqd, bhkd -> bhqk", query, key) \
                                          / torch.maximum(torch.norm(query, dim=-1, keepdim=True)
                                                          @ torch.norm(key, dim=-1, keepdim=True).transpose(-2, -1),
                                                          torch.tensor(1e-06, device=query.device, dtype=query.dtype))
    

    since the equation normalizes the attention values for each query and key pair. The original code would produce a norm vector of shape (B, H, N, 1), while the actual norm matrix we need should be in shape (B, H, N, N).

    opened by YeolJ00 1
  • Thank you very much for your code, it is concise and easy to read, and I would like to ask you a few more questions

    Thank you very much for your code, it is concise and easy to read, and I would like to ask you a few more questions

    Thank you very much for your code, it is concise and easy to read, and I would like to ask you a few more questions:

    1. Since the output of swin-v2 is a feature map, if I want to use it for semantic segmentation, can I directly upsample to get the final result?
    2. If I want to change the dataset, what do I need to do?
    opened by lizhenye2017 0
Owner
Christoph Reich
Autonomous systems and electrical engineering student @ Technical University of Darmstadt
Christoph Reich
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

TecoGAN-PyTorch Introduction This is a PyTorch reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution (VSR). Please refer to

null 165 Dec 17, 2022
Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Introduction This repository contains my unofficial reimplementation of the standard ECAPA-TDNN, which is the speaker recognition in VoxCeleb2 dataset

Tao Ruijie 277 Dec 31, 2022
Implementation of the Swin Transformer in PyTorch.

Swin Transformer - PyTorch Implementation of the Swin Transformer architecture. This paper presents a new vision Transformer, called Swin Transformer,

null 597 Jan 3, 2023
Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Updates (2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training. Pyr

null 1.3k Jan 4, 2023
This project aims to explore the deployment of Swin-Transformer based on TensorRT, including the test results of FP16 and INT8.

Swin Transformer This project aims to explore the deployment of SwinTransformer based on TensorRT, including the test results of FP16 and INT8. Introd

maggiez 87 Dec 21, 2022
[ICCV 2021] Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation

EPCDepth EPCDepth is a self-supervised monocular depth estimation model, whose supervision is coming from the other image in a stereo pair. Details ar

Rui Peng 110 Dec 23, 2022
Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

DFSA Unofficial pytorch implementation of the ICCV 2021 paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution" (p

null 2 Nov 15, 2021
Tensorflow implementation of Swin Transformer model.

Swin Transformer (Tensorflow) Tensorflow reimplementation of Swin Transformer model. Based on Official Pytorch implementation. Requirements tensorflow

null 167 Jan 8, 2023
The codes for the work "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation"

Swin-Unet The codes for the work "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation"(https://arxiv.org/abs/2105.05537). A validatio

null 869 Jan 7, 2023
SwinIR: Image Restoration Using Swin Transformer

SwinIR: Image Restoration Using Swin Transformer This repository is the official PyTorch implementation of SwinIR: Image Restoration Using Shifted Win

Jingyun Liang 2.4k Jan 8, 2023
Image Restoration Using Swin Transformer for VapourSynth

SwinIR SwinIR function for VapourSynth, based on https://github.com/JingyunLiang/SwinIR. Dependencies NumPy PyTorch, preferably with CUDA. Note that t

Holy Wu 11 Jun 19, 2022
This repository contains a CBIR system that uses swin transformer to extract image's feature.

Swin-transformer based CBIR This repository contains a CBIR(content-based image retrieval) system. Here we use Swin-transformer to extract query image

JsHou 12 Nov 17, 2022
PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

Christoph Reich 10 Jan 2, 2023
PyTorch reimplementation of the paper Involution: Inverting the Inherence of Convolution for Visual Recognition [CVPR 2021].

Involution: Inverting the Inherence of Convolution for Visual Recognition Unofficial PyTorch reimplementation of the paper Involution: Inverting the I

Christoph Reich 100 Dec 1, 2022
Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

MobileViT RegNet Unofficial PyTorch implementation of MobileViT based on paper MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TR

Hong-Jia Chen 91 Dec 2, 2022
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 170 Jan 4, 2023