Unofficial PyTorch reimplementation of the paper Swin Transformer V2: Scaling Up Capacity and Resolution

Christoph Reich

Last update: Dec 12, 2022

Related tags

Overview

Swin Transformer V2: Scaling Up Capacity and Resolution

Unofficial PyTorch reimplementation of the paper Swin Transformer V2: Scaling Up Capacity and Resolution by Ze Liu, Han Hu et al. (Microsoft Research Asia).

This repository includes a pure PyTorch implementation of the Swin Transformer V2.

The official Swin Transformer V1 implementation is available here. Currently (10.01.2022), an official implementation of the Swin Transformer V2 is not publicly available.

Installation

You can simply install the Swin Transformer V2 implementation as a Python package by using pip.

pip install git+https://github.com/ChristophReich1996/Involution

Alternatively, you can clone the repository and use the implementation in swin_transformer_v2 directly in your project.

Usage

This implementation provides the configurations reported in the paper (SwinV2-T, SwinV2-S, etc.). You can build the model by calling the corresponding function. Please note that the Swin Transformer V2 (SwinTransformerV2 class) implementation returns the feature maps of each stage of the network (List[torch.Tensor]). If you want to use this implementation for image classification simply wrap this model and take the final feature map.

from swin_transformer_v2 import SwinTransformerV2

from swin_transformer_v2 import swin_transformer_v2_t, swin_transformer_v2_s, swin_transformer_v2_b, \
    swin_transformer_v2_l, swin_transformer_v2_h, swin_transformer_v2_g

# SwinV2-T
swin_transformer: SwinTransformerV2 = swin_transformer_v2_t(in_channels=3,
                                                            window_size=8,
                                                            input_resolution=(256, 256),
                                                            sequential_self_attention=False,
                                                            use_checkpoint=False)

If you want to change the resolution and/or the window size for fine-tuning or inference pleas use the update_resolution method.

# Change resolution and window size of the model
swin_transformer.update_resolution(new_window_size=16, new_input_resolution=(512, 512))

In case you want to use a custom configuration you can use the SwinTransformerV2 class. The constructor method takes the following parameters.

Parameter	Description	Type
in_channels	Number of input channels	int
depth	Depth of the stage (number of layers)	int
downscale	If true input is downsampled (see Fig. 3 or V1 paper)	bool
input_resolution	Input resolution	Tuple[int, int]
number_of_heads	Number of attention heads to be utilized	int
window_size	Window size to be utilized	int
shift_size	Shifting size to be used	int
ff_feature_ratio	Ratio of the hidden dimension in the FFN to the input channels	int
dropout	Dropout in input mapping	float
dropout_attention	Dropout rate of attention map	float
dropout_path	Dropout in main path	float
use_checkpoint	If true checkpointing is utilized	bool
sequential_self_attention	If true sequential self-attention is performed	bool

This file includes a full example how to use this implementation.

Disclaimer

This is a very experimental implementation based on the Swin Transformer V2 paper and the official implementation of the Swin Transformer V1. Since an official implementation of the Swin Transformer V2 is not yet published, it is not possible to say to which extent this implementation might differ from the original one. If you have any issues with this implementation please raise an issue.

Reference

@article{Liu2021,
    title={{Swin Transformer V2: Scaling Up Capacity and Resolution}},
    author={Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, 
            Yue and Zhang, Zheng and Dong, Li and others},
    journal={arXiv preprint arXiv:2111.09883},
    year={2021}
}

Comments

problem about DeformableSwinTransformerBlock

when i feed the input data whose shape is [1, 3, 240, 250], there is an error at https://github.com/ChristophReich1996/Swin-Transformer-V2/blob/d1b89227ef0045c3ab667a4f2cdea9ec4f240236/swin_transformer_v2/model_parts.py#L574 , the error is listed below:

RuntimeError: The size of tensor a (48) must match the size of tensor b (256) at non-singleton dimension 2

it appears that the shape of self.default_grid.repeat_interleave(repeats=offsets.shape[0], dim=0) and offsets is different, I wonder if you have the same problem and it will be of great help if you can help me solve it, thank you~

opened by nullxjx 14
How do I get it to work at 512*640 resolution?

model.update_resolution(new_window_size=8, new_input_resolution=(512, 640)) ------> RuntimeError: shape '[0, 2, 2, 768, 8, 8]' is invalid for input of size 196608

opened by WY-2022 4
Training and inference implementation of Swin v.2 for oject detection task

Hi, I've worked on Swin Transformer v.1 earlier for object detection training and inference. Now I want to improve the result with Swin Transformer v.2. Is it available and is there any way to do that? Many thanks.

opened by queman 2
大佬，您能导入窗口为16的预训练权重吗

源码改建（当我导入swinv2_tiny_patch4_window8_256.pth，使用窗口为8时候，可以正常跑代码；但是当我导入swinv2_tiny_patch4_window16_256.pth，使用窗口为16时候，导入权重出现不匹配情况；不知道如何处理，请大佬解答一下。问题如下：） RuntimeError: Error(s) in loading state_dict for Model: size mismatch for model.7.blocks.0.attn.relative_coords_table: copying a param with shape torch.Size([1, 15, 15, 2]) from checkpoint, the shape in current model is torch.Size([1, 31, 31, 2]). size mismatch for model.7.blocks.0.attn.relative_position_index: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([256, 256]). size mismatch for model.7.blocks.1.attn.relative_coords_table: copying a param with shape torch.Size([1, 15, 15, 2]) from checkpoint, the shape in current model is torch.Size([1, 31, 31, 2]). size mismatch for model.7.blocks.1.attn.relative_position_index: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([256, 256]).

opened by LUO77123 1

About Checkpoints

Hi! I have another question. If I just pip, and then :

class SWIN(nn.Module):
     def __init__(self, num_classes=4):
        super().__init__()
        self.num_classes = num_classes
        # self.pool = nn.MaxPool2d(2, 2)
        self.encoder: SwinTransformerV2 = swin_transformer_v2_t(in_channels=3,
                                                            window_size=8,
                                                            input_resolution=(1024, 1280),
                                                            sequential_self_attention=False,
                                                            use_checkpoint=True)
        self.p=self.encoder.patch_embedding
        self.encoder0 = self.encoder.stages[0]
        ... ...

How to use the checkpoint now? And Is there a pre-trained model for v2_base? (And when I just run like above, a wired problem arises: 'warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")')

opened by WY-2022 1

relative_coordinates_log

window_attention.relative_coordinates_log: copying a param with shape torch.Size([256, 2]) from checkpoint, the shape in current model is torch.Size([4096, 2]).

opened by yuangui0316 1
Problems encountered
Hello, I have encountered some small problems when using SwinV2 these days. I would like to get your answers here.

when my input size is small, such as 96*96, window_szie=8, will appear https://github.com/ChristophReich1996/Swin-Transformer-V2/blob/cff3824f5d75dfd93553867efcf53562c24dd555/swin_transformer_v2/model_parts.py#L289 RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

In this paper, sequential self-attention calculation is used to save GPU memory, but in large-size image application, I set Sequential_self_attention =True, which will result in OOM. Set Sequential_self_attention =False does not.

when i update the code https://github.com/ChristophReich1996/Swin-Transformer-V2/blob/cff3824f5d75dfd93553867efcf53562c24dd555/swin_transformer_v2/model_parts.py#L207,GPU memory usage also soared further.

I am currently experimenting with the application of swinV2, an efficient and memory saving network, to 3D data, therefore pay more attention to the occupation problem on the display memory.
opened by Breeze-Zero 1

Scaled cosine attention error

The scaled cosine attention part of the implementation seems wrong in model_parts.py

attention_map: torch.Tensor = torch.einsum("bhqd, bhkd -> bhqk", query, key) \
                                      / torch.maximum(torch.norm(query, dim=-1, keepdim=True)
                                                      * torch.norm(key, dim=-1, keepdim=True),
                                                      torch.tensor(1e-06, device=query.device, dtype=query.dtype))

should be corrected as

attention_map: torch.Tensor = torch.einsum("bhqd, bhkd -> bhqk", query, key) \
                                      / torch.maximum(torch.norm(query, dim=-1, keepdim=True)
                                                      @ torch.norm(key, dim=-1, keepdim=True).transpose(-2, -1),
                                                      torch.tensor(1e-06, device=query.device, dtype=query.dtype))

since the equation normalizes the attention values for each query and key pair. The original code would produce a norm vector of shape (B, H, N, 1), while the actual norm matrix we need should be in shape (B, H, N, N).

opened by YeolJ00 1

Thank you very much for your code, it is concise and easy to read, and I would like to ask you a few more questions
Thank you very much for your code, it is concise and easy to read, and I would like to ask you a few more questions:

Since the output of swin-v2 is a feature map, if I want to use it for semantic segmentation, can I directly upsample to get the final result?

If I want to change the dataset, what do I need to do?
opened by lizhenye2017 0

Owner

Christoph Reich

Autonomous systems and electrical engineering student @ Technical University of Darmstadt

GitHub https://arxiv.org/pdf/2111.09883.pdf

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

52 Dec 29, 2022

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

1.4k Dec 30, 2022

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

225 Nov 13, 2022

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

TecoGAN-PyTorch Introduction This is a PyTorch reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution (VSR). Please refer to

165 Dec 17, 2022

Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Introduction This repository contains my unofficial reimplementation of the standard ECAPA-TDNN, which is the speaker recognition in VoxCeleb2 dataset

277 Dec 31, 2022

Implementation of the Swin Transformer in PyTorch.

Swin Transformer - PyTorch Implementation of the Swin Transformer architecture. This paper presents a new vision Transformer, called Swin Transformer,

597 Jan 3, 2023

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Updates (2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training. Pyr

1.3k Jan 4, 2023

This project aims to explore the deployment of Swin-Transformer based on TensorRT, including the test results of FP16 and INT8.

Swin Transformer This project aims to explore the deployment of SwinTransformer based on TensorRT, including the test results of FP16 and INT8. Introd

87 Dec 21, 2022

[ICCV 2021] Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation

EPCDepth EPCDepth is a self-supervised monocular depth estimation model, whose supervision is coming from the other image in a stereo pair. Details ar

110 Dec 23, 2022

Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

DFSA Unofficial pytorch implementation of the ICCV 2021 paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution" (p

2 Nov 15, 2021

Tensorflow implementation of Swin Transformer model.

Swin Transformer (Tensorflow) Tensorflow reimplementation of Swin Transformer model. Based on Official Pytorch implementation. Requirements tensorflow

167 Jan 8, 2023

The codes for the work "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation"

Swin-Unet The codes for the work "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation"(https://arxiv.org/abs/2105.05537). A validatio

869 Jan 7, 2023

SwinIR: Image Restoration Using Swin Transformer

SwinIR: Image Restoration Using Swin Transformer This repository is the official PyTorch implementation of SwinIR: Image Restoration Using Shifted Win

2.4k Jan 8, 2023

Image Restoration Using Swin Transformer for VapourSynth

SwinIR SwinIR function for VapourSynth, based on https://github.com/JingyunLiang/SwinIR. Dependencies NumPy PyTorch, preferably with CUDA. Note that t

11 Jun 19, 2022

This repository contains a CBIR system that uses swin transformer to extract image's feature.

Swin-transformer based CBIR This repository contains a CBIR(content-based image retrieval) system. Here we use Swin-transformer to extract query image

12 Nov 17, 2022

PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

10 Jan 2, 2023

PyTorch reimplementation of the paper Involution: Inverting the Inherence of Convolution for Visual Recognition [CVPR 2021].

Involution: Inverting the Inherence of Convolution for Visual Recognition Unofficial PyTorch reimplementation of the paper Involution: Inverting the I

100 Dec 1, 2022

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

MobileViT RegNet Unofficial PyTorch implementation of MobileViT based on paper MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TR

91 Dec 2, 2022

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

170 Jan 4, 2023