A curated list and survey of awesome Vision Transformers.

You can use mind mapping software to open the mind mapping source file. You can also download the mind mapping HD pictures if you just want to browse them.

Survey
Papers

Survey

Only typical algorithms are listed in each category.

Image Classification

Chinese Blogs

Vision Transformer 必读系列之图像分类综述(一)：概述

Attention-based

Training Strategy

[DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
[Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]

Model Improvements

Tokenization Module

Image to Token：

Non-overlapping Patch Embedding
- [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
- [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
Overlapping Patch Embedding
- [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]
- [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
- [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
- [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
- [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

Token to Token：

Fixed sampling window tokenization
- [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
Dynamic sampling tokenization
- [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
- [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]

Position Encoding Module

Explicit position encoding：

Absolute position encoding
- [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]
- [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
- [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
Relative position encoding
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
- [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
- [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

Implicit position encoding：

[CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
[CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
[PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
[ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

Attention Module

Include only global attention：

Multi-Head attention module
- [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
Reduce global attention computation
- [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
- [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
- [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
- [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]
- [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
- [MViT] Multiscale Vision Transformers (2021.4) [Paper]
- [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
Generalized linear attention
- [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

Introduce extra local attention：

Local window mode
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
- [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
- [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
- [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
- [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
- [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
- [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
- [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
Introduce convolutional local inductive bias
- [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
- [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
Sparse attention
- [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]

FFN Module

Improve performance with Conv's local information extraction capability：

[LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
[CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]

Normalization Module Location

Pre Normalization
- [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
Post Normalization
- [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]

Classification Prediction Head Module

Class Tokens
- [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
- [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
Avgerage Pooling
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
- [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
- [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

Others

(1) How to output multi-scale feature map

Patch merging
- [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
- [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
- [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
- [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
- [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
Pooling attention
- [MViT] Multiscale Vision Transformers (2021.4) [Paper][Imporved MViT]
- [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
Dilation convolution
- [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

(2) How to train a deeper Transformer

[Cait] Going deeper with Image Transformers (2021.3) [Paper]
[DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

MLP-based

[MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]
[ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]
[gMLP] Pay Attention to MLPs (2021.5) [Paper]
[CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

ConvMixer-based

[ConvMixer] Patches Are All You Need [Paper]

General Architecture Analysis

Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
[MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
[ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Others

Object Detection

Semantic Segmentation

⬆ back to top

Papers

Transformer Original Paper

[Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]

ViT Original Paper

[ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]

Image Classification

2020

[DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
[Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]

2021

[T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]
[PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
[CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
[TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
[Cait] Going deeper with Image Transformers (2021.3) [Paper]
[DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]
[Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
[CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
[LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
[MViT] Multiscale Vision Transformers (2021.4) [Paper]
[Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
[Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
[ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
[MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]
[ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]
[gMLP] Pay Attention to MLPs (2021.5) [Paper]
[MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
[PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
[TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]
Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
[P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]
[GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
[Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
[ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
[CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]
[CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
[PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
[Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
[MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
[Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
[ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
[ConvMixer] Patches Are All You Need [Paper]

2022

[ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Object Detection

Semantic Segmentation

⬆ back to top

Stay tuned and PRs are welcomed!

Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Class Activation Map methods implemented in Pytorch pip install grad-cam ⭐ Tested on many Common CNN Networks and Vision Transformers. ⭐ Includes smoo

6.6k Jan 6, 2023

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

440 Jan 2, 2023

Explainability for Vision Transformers (in PyTorch)

Explainability for Vision Transformers (in PyTorch) This repository implements methods for explainability in Vision Transformers

442 Jan 4, 2023

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

CvT: Introducing Convolutions to Vision Transformers Pytorch implementation of CvT: Introducing Convolutions to Vision Transformers Usage: img = torch

193 Jan 3, 2023

Implementation of various Vision Transformers I found interesting

78 Dec 6, 2022

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Twins: Revisiting the Design of Spatial Attention in Vision Transformers Very recently, a variety of vision transformer architectures for dense predic

482 Dec 18, 2022

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Self-Supervised Vision Transformers with DINO PyTorch implementation and pretrained models for DINO. For details, see Emerging Properties in Self-Supe

4.2k Jan 3, 2023

Exploring whether attention is necessary for vision transformers

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet Paper/Report TL;DR We replace the attention layer in a v

461 Jan 7, 2023

This repository contains PyTorch code for Robust Vision Transformers.

117 Dec 7, 2022

Object Detection and Segmentation Updates

I'm also looking forward to Object Detection and Segmentation updates. As you may already know, the following survey will be helpful. Thank you.

[2111.06091] A Survey of Visual Transformers https://arxiv.org/abs/2111.06091

opened by Keiku 1

A curated list and survey of awesome Vision Transformers.

Related tags

Overview

Contents

Survey

Image Classification

Attention-based

Training Strategy

Model Improvements

Tokenization Module

Position Encoding Module

Attention Module

FFN Module

Normalization Module Location

Classification Prediction Head Module

Others

MLP-based

ConvMixer-based

General Architecture Analysis

Others

Object Detection

Semantic Segmentation

Papers

Transformer Original Paper

ViT Original Paper

Image Classification

2020

2021

2022

Object Detection

Semantic Segmentation

You might also like...

Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Explainability for Vision Transformers (in PyTorch)

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

Implementation of various Vision Transformers I found interesting

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Exploring whether attention is necessary for vision transformers

This repository contains PyTorch code for Robust Vision Transformers.

Comments

Object Detection and Segmentation Updates

Owner

OpenMMLab

The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

A curated list of resources for Image and Video Deblurring

A curated list of neural network pruning resources.

A curated (most recent) list of resources for Learning with Noisy Labels

A curated list of neural rendering resources.

A list of awesome PyTorch scholarship articles, guides, blogs, courses and other resources.

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

Repository for the COLING 2020 paper "Explainable Automated Fact-Checking: A Survey."

Deep Learning for 3D Point Clouds: A Survey (IEEE TPAMI, 2020)

This is the accompanying toolbox for the paper "A Survey on GANs for Anomaly Detection"