Transformer in Computer Vision

Overview

Transformer-in-VisionAwesome

A paper list of some recent Transformer-based CV works. If you find some ignored papers, please open issues or pull requests.

**Last updated: 2022/01/20

Update log

2021/April - update all of recent papers of Transformer-in-Vision.
2021/May - update all of recent papers of Transformer-in-Vision.
2021/June - update all of recent papers of Transformer-in-Vision.
2021/July - update all of recent papers of Transformer-in-Vision.
2021/August - update all of recent papers of Transformer-in-Vision.
2021/September - update all of recent papers of Transformer-in-Vision.
2021/October - update all of recent papers of Transformer-in-Vision.
2021/November - update all of recent papers of Transformer-in-Vision.
2021/December - update all of recent papers of Transformer-in-Vision.

Survey:

  • (arXiv 2022.01) Video Transformers: A Survey. [Paper]

  • (arXiv 2021.11) A Survey of Visual Transformers. [Paper]

  • (arXiv 2021.09) Survey: Transformer based Video-Language Pre-training. [Paper]

  • (arXiv 2021.03) Multi-modal Motion Prediction with Stacked Transformers. [Paper], [Code]

  • (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision. [Paper]

  • (arXiv 2020.09) Efficient Transformers: A Survey. [Paper]

  • (arXiv 2020.01) Transformers in Vision: A Survey. [Paper]

Recent Papers

Action

  • (CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
  • (arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
  • (arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
  • (arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]
  • (arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]
  • (arXiv 2021.05) Temporal Action Proposal Generation with Transformers, [Paper]
  • (arXiv 2021.06) End-to-end Temporal Action Detection with Transformer, [Paper], [Code]
  • (arXiv 2021.06) OadTR: Online Action Detection with Transformers, [Paper], [Code]
  • (arXiv 2021.07) Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition, [Paper]
  • (arXiv 2021.07) VideoLightFormer: Lightweight Action Recognition using Transformers, [Paper]
  • (arXiv 2021.07) Long Short-Term Transformer for Online Action Detection, [Paper]
  • (arXiv 2021.07) STAR: Sparse Transformer-based Action Recognition, [Paper], [Code]
  • (arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
  • (arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
  • (arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper], [Code]
  • (arXiv 2021.10) Lightweight Transformer in Federated Setting for Human Activity Recognition, [Paper]
  • (arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]
  • (arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]
  • (arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
  • (arXiv 2021.11) Evaluating Transformers for Lightweight Action Recognition, [Paper]
  • (arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]
  • (arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]
  • (arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]
  • (arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
  • (arXiv 2022.01) Transformers in Action:Weakly Supervised Action Segmentation, [Paper]

Active Learning

  • (arXiv 2021.06) Visual Transformer for Task-aware Active Learning, [Paper], [Code]

Anomaly Detection

  • (arXiv 2021.04) VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization, [Paper]
  • (arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [Paper]

Assessment

  • (arXiv 2021.01) Transformer for Image Quality Assessment, [Paper], [Code]
  • (arXiv 2021.04) Perceptual Image Quality Assessment with Transformers, [Paper], [Code]
  • (arXiv 2021.08) No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency, [Paper], [Code]
  • (arXiv 2021.08) MUSIQ: Multi-scale Image Quality Transformer, [Paper], [Code]
  • (arXiv 2021.10) VTAMIQ: Transformers for Attention Modulated Image Quality Assessment, [Paper]
  • (arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]

Captioning

  • (arXiv 2021.01) CPTR: Full Transformer Network for Image Captioning, [Paper]
  • (arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
  • (arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]
  • (arXiv 2021.06) Semi-Autoregressive Transformer for Image Captioning, [Paper], [Code]
  • (arXiv 2021.08) Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers, [Paper]
  • (arXiv 2021.08) Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning, [Paper], [Code]
  • (arXiv 2021.09) Bornon: Bengali Image Captioning with Transformer-based Deep learning approach, [Paper]
  • (arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper], [Code]
  • (arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
  • (arXiv 2021.10) Geometry Attention Transformer with Position-aware LSTMs for Image Captioning, [Paper]
  • (arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
  • (arXiv 2021.11) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]
  • (arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]
  • (arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]

Classification (Backbone)

  • (ICLR'21) MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
  • (CVPR'20) Feature Pyramid Transformer, [Paper], [Code]
  • (ICLR'21) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
  • (arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
  • (arXiv 2020.11) General Multi-label Image Classification with Transformers, [Paper]
  • (arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper], [Code]
  • (arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
  • (arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper] , [Code]
  • (arXiv.2021.02) Conditional Positional Encodings for Vision Transformers, [Paper], [Code]
  • (arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]
  • (arXiv 2021.03) Transformer in Transformer, [Paper], [Code]
  • (arXiv 2021.03) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases, [Paper], [Code]
  • (arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]
  • (arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]
  • (arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]
  • (arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]
  • (arXiv 2021.03) Understanding Robustness of Transformers for Image Classification, [Paper]
  • (arXiv 2021.03) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, [Paper]
  • (arXiv 2021.03) CvT: Introducing Convolutions to Vision Transformers, [Paper], [Code]
  • (arXiv 2021.03) Rethinking Spatial Dimensions of Vision Transformers, [Paper], [Code]
  • (arXiv 2021.03) Going deeper with Image Transformers, [Paper]
  • (arXiv 2021.04) LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference, [Paper]
  • (arXiv 2021.04) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]
  • (arXiv 2021.04) LocalViT: Bringing Locality to Vision Transformers, [Paper], [Code]
  • (arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]
  • (arXiv 2021.04) Co-Scale Conv-Attentional Image Transformers, [Paper], [Code]
  • (arXiv 2021.04) Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet, [Paper], [Code]
  • (arXiv 2021.04) So-ViT: Mind Visual Tokens for Vision Transformer, [Paper]
  • (arXiv 2021.04) Multiscale Vision Transformers, [Paper], [Code]
  • (arXiv 2021.04) Visformer: The Vision-friendly Transformer, [Paper], [Code]
  • (arXiv 2021.04) Improve Vision Transformers Training by Suppressing Over-smoothing, [Paper], [Code]
  • (arXiv 2021.04) Twins: Revisiting the Design of Spatial Attention in Vision Transformers, [Paper], [Code]
  • (arXiv 2021.04) ConTNet: Why not use convolution and transformer at the same time, [Paper], [Code]
  • (arXiv 2021.05) Rethinking the Design Principles of Robust Vision Transformer, [Paper], [Code]
  • (arXiv 2021.05) Vision Transformers are Robust Learners, [Paper], [Code]
  • (arXiv 2021.05) Rethinking Skip Connection with Layer Normalization in Transformers and ResNets, [Paper], [Code]
  • (arXiv 2021.05) Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead, [Paper]
  • (arXiv 2021.05) Intriguing Properties of Vision Transformers, [Paper], [Code]
  • (arXiv 2021.05) Aggregating Nested Transformers, [Paper]
  • (arXiv 2021.05) ResT: An Efficient Transformer for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.06) DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, [Paper], [Code]
  • (arXiv 2021.06) When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, [Paper]
  • (arXiv 2021.06) Container: Context Aggregation Network, [Paper]
  • (arXiv 2021.06) TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication, [Paper]
  • (arXiv 2021.06) KVT: k-NN Attention for Boosting Vision Transformers, [Paper]
  • (arXiv 2021.06) MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, [Paper], [Code]
  • (arXiv 2021.06) Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length, [Paper]
  • (arXiv 2021.06) Less is More: Pay Less Attention in Vision Transformers, [Paper]
  • (arXiv 2021.06) FoveaTer: Foveated Transformer for Image Classification, [Paper]
  • (arXiv 2021.06) An Attention Free Transformer, [Paper]
  • (arXiv 2021.06) Glance-and-Gaze Vision Transformer, [Paper], [Code]
  • (arXiv 2021.06) RegionViT: Regional-to-Local Attention for Vision Transformers, [Paper]
  • (arXiv 2021.06) Chasing Sparsity in Vision Transformers: An End-to-End Exploration, [Paper], [Code]
  • (arXiv 2021.06) Scaling Vision Transformers, [Paper]
  • (arXiv 2021.06) CAT: Cross Attention in Vision Transformer, [Paper], [Code]
  • (arXiv 2021.06) On Improving Adversarial Transferability of Vision Transformers, [Paper], [Code]
  • (arXiv 2021.06) Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight, [Paper]
  • (arXiv 2021.06) Patch Slimming for Efficient Vision Transformers, [Paper]
  • (arXiv 2021.06) Transformer in Convolutional Neural Networks, [Paper], [Code]
  • (arXiv 2021.06) ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias, [Paper], [Code]
  • (arXiv 2021.06) Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, [Paper]
  • (arXiv 2021.06) Refiner: Refining Self-attention for Vision Transformers, [Paper]
  • (arXiv 2021.06) Reveal of Vision Transformers Robustness against Adversarial Attacks, [Paper]
  • (arXiv 2021.06) Efficient Training of Visual Transformers with Small-Size Datasets, [Paper]
  • (arXiv 2021.06) MlTr: Multi-label Classification with Transformer, [Paper], [Code]
  • (arXiv 2021.06) Delving Deep into the Generalization of Vision Transformers under Distribution Shifts, [Paper]
  • (arXiv 2021.06) BEIT: BERT Pre-Training of Image Transformers, [Paper], [Code]
  • (arXiv 2021.06) XCiT: Cross-Covariance Image Transformers, [Paper], [Code]
  • (arXiv 2021.06) How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, [Paper], [Code1], [Code2]
  • (arXiv 2021.06) Exploring Vision Transformers for Fine-grained Classification, [Paper], [Code]
  • (arXiv 2021.06) TokenLearner: What Can 8 Learned Tokens Do for Images and Videos, [Paper]
  • (arXiv 2021.06) Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers, [Paper], [Code]
  • (arXiv 2021.06) VOLO: Vision Outlooker for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.06) IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers, [Paper], [Project]
  • (arXiv 2021.06) PVTv2: Improved Baselines with Pyramid Vision Transformer, [Paper], [Code]
  • (arXiv 2021.06) Early Convolutions Help Transformers See Better, [Paper]
  • (arXiv 2021.06) Multi-Exit Vision Transformer for Dynamic Inference, [Paper]
  • (arXiv 2021.07) Augmented Shortcuts for Vision Transformers, [Paper]
  • (arXiv 2021.07) Improving the Efficiency of Transformers for Resource-Constrained Devices, [Paper]
  • (arXiv 2021.07) CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, [Paper], [Code]
  • (arXiv 2021.07) Focal Self-attention for Local-Global Interactions in Vision Transformers, [Paper]
  • (arXiv 2021.07) Cross-view Geo-localization with Evolving Transformer, [Paper]
  • (arXiv 2021.07) What Makes for Hierarchical Vision Transformer, [Paper]
  • (arXiv 2021.07) Efficient Vision Transformers via Fine-Grained Manifold Distillation, [Paper]
  • (arXiv 2021.07) Vision Xformers: Efficient Attention for Image Classification, [Paper]
  • (arXiv 2021.07) Long-Short Transformer: Efficient Transformers for Language and Vision, [Paper]
  • (arXiv 2021.07) Feature Fusion Vision Transformer for Fine-Grained Visual Categorization, [Paper]
  • (arXiv 2021.07) Local-to-Global Self-Attention in Vision Transformers, [Paper], [Code]
  • (arXiv 2021.07) Visual Parser: Representing Part-whole Hierarchies with Transformers, [Paper], [Code]
  • (arXiv 2021.07) CMT: Convolutional Neural Networks Meet Vision Transformers, [Paper]
  • (arXiv 2021.07) Combiner: Full Attention Transformer with Sparse Computation Cost, [Paper]
  • (arXiv 2021.07) A Comparison of Deep Learning Classification Methods on Small-scale Image Data set: from Convolutional Neural Networks to Visual Transformers, [Paper]
  • (arXiv 2021.07) Query2Label: A Simple Transformer Way to Multi-Label Classification, [Paper], [Code]
  • (arXiv 2021.07) Contextual Transformer Networks for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.07) Rethinking and Improving Relative Position Encoding for Vision Transformer, [Paper], [Code]
  • (arXiv 2021.08) CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention, [Paper], [Code]
  • (arXiv 2021.08) Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, [Paper]
  • (arXiv 2021.08) Vision Transformer with Progressive Sampling, [Paper], [Code]
  • (arXiv 2021.08) Armour: Generalizable Compact Self-Attention for Vision Transformers, [Paper]
  • (arXiv 2021.08) ConvNets vs. Transformers: Whose Visual Representations are More Transferable, [Paper]
  • (arXiv 2021.08) Mobile-Former: Bridging MobileNet and Transformer, [Paper]
  • (arXiv 2021.08) Do Vision Transformers See Like Convolutional Neural Networks, [Paper]
  • (arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
  • (arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
  • (arXiv 2021.08) Scaled ReLU Matters for Training Vision Transformers, [Paper]
  • (arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
  • (arXiv 2021.09) DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers, [Paper], [Code]
  • (arXiv 2021.09) Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers, [Paper]
  • (arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
  • (arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
  • (arXiv 2021.10) MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer, [Paper]
  • (arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]
  • (arXiv 2021.10) Token Pooling in Visual Transformers, [Paper]
  • (arXiv 2021.10) NViT: Vision Transformer Compression and Parameter Redistribution, [Paper]
  • (arXiv 2021.10) Adversarial Token Attacks on Vision Transformers, [Paper]
  • (arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]
  • (arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]
  • (arXiv 2021.10) Bilateral-ViT for Robust Fovea Localization, [Paper]
  • (arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]
  • (arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]
  • (arXiv 2021.11) Can Vision Transformers Perform Convolution, [Paper]
  • (arXiv 2021.11) Sliced Recursive Transformer, [Paper], [Code]
  • (arXiv 2021.11) Hybrid BYOL-ViT: Efficient approach to deal with small Datasets, [Paper]
  • (arXiv 2021.11) Are Transformers More Robust Than CNNs, [Paper], [Code]
  • (arXiv 2021.11) iBOT: Image BERT Pre-Training with Online Tokenizer, [Paper]
  • (arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]
  • (arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]
  • (arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]
  • (arXiv 2021.11) Are Vision Transformers Robust to Patch Perturbations, [Paper]
  • (arXiv 2021.11) Discrete Representations Strengthen Vision Transformer Robustness, [Paper]
  • (arXiv 2021.11) Zero-Shot Certified Defense against Adversarial Patches with Vision Transformers, [Paper]
  • (arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]
  • (arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]
  • (arXiv 2021.11) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]
  • (arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]
  • (arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]
  • (arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]
  • (arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]
  • (arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]
  • (arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]
  • (arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]
  • (arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]
  • (arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]
  • (arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
  • (arXiv 2021.12) Dynamic Token Normalization Improves Vision Transformer, [Paper], [Code]
  • (arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]
  • (arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]
  • (arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]
  • (arXiv 2021.12) Couplformer:Rethinking Vision Transformer with Coupling Attention Map, [Paper]
  • (arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]
  • (arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]
  • (arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]
  • (arXiv 2021.12) MPViT: Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]
  • (arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation, [Paper]
  • (arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]
  • (arXiv 2021.12) SimViT: Exploring a Simple Vision Transformer with sliding windows, [Paper], [Code]
  • (arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]
  • (arXiv 2021.12) ViR: the Vision Reservoir, [Paper]
  • (arXiv 2021.12) Augmenting Convolutional networks with attention-based aggregation, [Paper]
  • (arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention, [Paper], [Code]
  • (arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]
  • (arXiv 2021.12) Stochastic Layers in Vision Transformers, [Paper]
  • (arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]
  • (arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]
  • (arXiv 2022.01) QuadTree Attention for Vision Transformers, [Paper], [Code]

Completion

  • (arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]
  • (arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]

Compression

  • (arXiv 2021.11) Transformer-based Image Compression, [Paper]
  • (arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper], [Code]
  • (arXiv 2021.12) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]
  • (arXiv 2022.01) Multi-Dimensional Model Compression of Vision Transformer, [Paper]

Crowd Counting

  • (arXiv 2021.04) TransCrowd: Weakly-Supervised Crowd Counting with Transformer, [Paper], [Code]
  • (arXiv 2021.05) Boosting Crowd Counting with Transformers, [Paper], [Code]
  • (arXiv 2021.08) Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer, [Paper]
  • (arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper], [Code]
  • (arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]
  • (arXiv 2022.01) Scene-Adaptive Attention Network for Crowd Counting, [Paper]

Depth

  • (arXiv 2020.11) Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [Paper], [Code]
  • (arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]
  • (arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]
  • (arXiv 2021.09) Improving 360 Monocular Depth Estimation via Non-local Dense Prediction Transformer and Joint Supervised and Self-supervised Learning, [Paper]

Deepfake Detection

  • (arXiv.2021.02) Deepfake Video Detection Using Convolutional Vision Transformer, [Paper]
  • (arXiv 2021.04) Deepfake Detection Scheme Based on Vision Transformer and Distillation, [Paper]
  • (arXiv 2021.04) M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, [Paper]
  • (arXiv 2021.07) Combining EfficientNet and Vision Transformers for Video Deepfake Detection, [Paper]
  • (arXiv 2021.08) Video Transformer for Deepfake Detection with Incremental Learning, [Paper]

Dehazing

  • (arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]

Detection

  • (ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
  • (ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
  • (CVPR'21) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper], [Code]
  • (arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]
  • (arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
  • (arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
  • (arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
  • (arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
  • (arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
  • (arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]
  • (arXiv 2021.03) SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving, [Paper]
  • (arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]
  • (arXiv 2021.03) TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization, [Paper]
  • (arXiv 2021.03) CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [Paper]
  • (arXiv 2021.03) DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention, [Paper]
  • (arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]
  • (arXiv 2021.04) Points as Queries: Weakly Semi-supervised Object Detection by Points, [Paper]
  • (arXiv 2021.04) CAT: Cross-Attention Transformer for One-Shot Object Detection, [Paper]
  • (arXiv 2021.05) Content-Augmented Feature Pyramid Network with Light Linear Transformers, [Paper]
  • (arXiv 2021.06) You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection, [Paper]
  • (arXiv 2021.06) DETReg: Unsupervised Pretraining with Region Priors for Object Detection, [Paper],[Project]
  • (arXiv 2021.06) Oriented Object Detection with Transformer, [Paper]
  • (arXiv 2021.06) MODETR: Moving Object Detection with Transformers, [Paper]
  • (arXiv 2021.07) ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer, [Paper]
  • (arXiv 2021.07) OODformer: Out-Of-Distribution Detection Transformer, [Paper]
  • (arXiv 2021.07) Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers, [Paper],[Code]
  • (arXiv 2021.08) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper],[Code]
  • (arXiv 2021.08) PSViT: Better Vision Transformer via Token Pooling and Attention Sharing, [Paper]
  • (arXiv 2021.08) Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation), [Paper],[Code]
  • (arXiv 2021.08) Conditional DETR for Fast Training Convergence, [Paper],[Code]
  • (arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
  • (arXiv 2021.08) TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios, [Paper]
  • (arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper],[Code]
  • (arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
  • (arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]
  • (arXiv 2021.10) IViDT: An Efficient and Effective Fully Transformer-based Object Detector, [Paper],[Code]
  • (arXiv 2021.10) DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries, [Paper],[Code]
  • (arXiv 2021.10) CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector, [Paper],[Code]
  • (arXiv 2021.11) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper],[Code]
  • (arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]
  • (arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]
  • (arXiv 2021.11) Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity, [Paper], [Code]
  • (arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]
  • (arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]
  • (arXiv 2021.12) BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View, [Paper]
  • (arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]
  • (arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]

Face

  • (arXiv 2021.03) Face Transformer for Recognition, [Paper]
  • (arXiv 2021.03) Robust Facial Expression Recognition with Convolutional Visual Transformers, [Paper]
  • (arXiv 2021.04) TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection, [Paper]
  • (arXiv 2021.04) Facial Attribute Transformers for Precise and Robust Makeup Transfer, [Paper]
  • (arXiv 2021.04) Learning to Cluster Faces via Transformer, [Paper]
  • (arXiv 2021.06) VidFace: A Full-Transformer Solver for Video Face Hallucination with Unaligned Tiny Snapshots, [Paper]
  • (arXiv 2021.06) MViT: Mask Vision Transformer for Facial Expression Recognition in the wild, [Paper]
  • (arXiv 2021.06) Shuffle Transformer with Feature Alignment for Video Face Parsing, [Paper]
  • (arXiv 2021.06) A Latent Transformer for Disentangled and Identity-Preserving Face Editing, [Paper], [Code]
  • (arXiv 2021.07) ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer, [Paper]
  • (arXiv 2021.08) FT-TDR: Frequency-guided Transformer and Top-Down Refinement Network for Blind Face Inpainting, [Paper]
  • (arXiv 2021.08) Learning Fair Face Representation With Progressive Cross Transformer, [Paper]
  • (arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [Paper]
  • (arXiv 2021.09) TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network, [Paper]
  • (arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper],[Code]
  • (arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper],[Code]
  • (arXiv 2021.09) MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition, [Paper]
  • (arXiv 2021.11) FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations, [Paper]
  • (arXiv 2021.12) SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal, [Paper],[Code]
  • (arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]
  • (arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]
  • (arXiv 2022.01) RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs, [Paper]

Few-shot Learning

  • (arXiv 2021.04) Rich Semantics Improve Few-shot Learning, [Paper], [Code]
  • (arXiv 2021.04) Few-Shot Segmentation via Cycle-Consistent Transformer, [Paper]
  • (arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper]
  • (arXiv 2021.12) Cost Aggregation Is All You Need for Few-Shot Segmentation, [Paper], [Code]
  • (arXiv 2022.01) HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning, [Paper]

Fusion

  • (arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]

GAN

  • (arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
  • (arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]
  • (arXiv 2021.04) VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers, [Paper], [Code]
  • (arXiv 2021.05) Combining Transformer Generators with Convolutional Discriminators, [Paper], [Code]
  • (arXiv 2021.06) ViT-Inception-GAN for Image Colourising, [Paper]
  • (arXiv 2021.06) Improved Transformer for High-Resolution GANs, [Paper]
  • (arXiv 2021.06) Styleformer: Transformer based Generative Adversarial Networks with Style Vector, [Paper], [Code]
  • (arXiv 2021.07) ViTGAN: Training GANs with Vision Transformers, [Paper]
  • (arXiv 2021.10) Generating Symbolic Reasoning Problems with Transformer GANs, [Paper]
  • (arXiv 2021.10) STransGAN: An Empirical Study on Transformer in GANs, [Paper], [Project]
  • (arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]
  • (arXiv 2022.01) RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark, [Paper]

Gaze

  • (arXiv 2021.06) Gaze Estimation using Transformer, [Paper], [Code]

HOI

  • (CVPR'21) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper], [Code]
  • (arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
  • (arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
  • (arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
  • (arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
  • (arXiv 2021.08) GTNet:Guided Transformer Network for Detecting Human-Object Interactions, [Paper], [Code]
  • (arXiv 2021.12) Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer, [Paper], [Code]

Hyperspectral

  • (arXiv 2021.07) SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers, [Paper], [Code]
  • (arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
  • (arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
  • (arXiv 2021.11) Learning A 3D-CNN and Transformer Prior for Hyperspectral Image Super-Resolution, [Paper]

Incremental Learning

  • (arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]

In-painting

  • (ECCV'20) Learning Joint Spatial-Temporal Transformations for Video Inpainting, [Paper], [Code]
  • (arXiv 2021.04) Aggregated Contextual Transformations for High-Resolution Image Inpainting, [Paper], [Code]
  • (arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]

Instance Segmentation

  • (CVPR'21) End-to-End Video Instance Segmentation with Transformers, [Paper], [Code]
  • (arXiv 2021.04) ISTR: End-to-End Instance Segmentation with Transformers, [Paper], [Code]
  • (arXiv 2021.08) SOTR: Segmenting Objects with Transformers, [Paper], [Code]
  • (arXiv 2021.12) SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation, [Paper], [Code]
  • (arXiv 2021.12) A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation, [Paper]
  • (arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]

Layout

  • (CVPR'21) Variational Transformer Networks for Layout Generation, [Paper]
  • (arXiv 2021.10) The Layout Generation Algorithm of Graphic Design Based on Transformer-CVAE, [Paper]
  • (arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]

Matching

  • (CVPR'21') LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]

Medical

  • (arXiv 2021.02) TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.02) Medical Transformer: Gated Axial-Attention for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.03) SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation, [Paper], [Code]
  • (arXiv 2021.03) TransBTS: Multimodal Brain Tumor Segmentation Using Transformer, [Paper], [Code]
  • (arXiv 2021.03) TransMed: Transformers Advance Multi-modal Medical Image Classification, [Paper]
  • (arXiv 2021.03) U-Net Transformer: Self and Cross Attention for Medical Image Segmentation, [Paper]
  • (arXiv 2021.03) SUNETR: Transformers for 3D Medical Image Segmentation, [Paper]
  • (arXiv 2021.04) DeepProg: A Multi-modal Transformer-based End-to-end Framework for Predicting Disease Prognosis, [Paper]
  • (arXiv 2021.04) ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration, [Paper], [Code]
  • (arXiv 2021.04) Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification, [Paper]
  • (arXiv 2021.04) Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer, [Paper]
  • (arXiv 2021.04) Medical Transformer: Universal Brain Encoder for 3D MRI Analysis, [Paper]
  • (arXiv 2021.04) Crossmodal Matching Transformer for Interventional in TEVAR, [Paper]
  • (arXiv 2021.04) GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification, [Paper]
  • (arXiv 2021.04) Pyramid Medical Transformer for Medical Image Segmentation, [Paper]
  • (arXiv 2021.05) Anatomy-Guided Parallel Bottleneck Transformer Network for Automated Evaluation of Root Canal Therapy, [Paper]
  • (arXiv 2021.05) Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.05) Is Image Size Important? A Robustness Comparison of Deep Learning Methods for Multi-scale Cell Image Classification Tasks: from Convolutional Neural Networks to Visual Transformers, [Paper]
  • (arXiv 2021.05) Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers, [Paper]
  • (arXiv 2021.05) Medical Image Segmentation using Squeeze-and-Expansion Transformers, [Paper], [Code]
  • (arXiv 2021.05) POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound, [Paper]
  • (arXiv 2021.05) COTR: Convolution in Transformer Network for End to End Polyp Detection, [Paper]
  • (arXiv 2021.05) PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer, [Paper]
  • (arXiv 2021.06) TED-net: Convolution-free T2T Vision Transformerbased Encoder-decoder Dilation network for Low-dose CT Denoising, [Paper]
  • (arXiv 2021.06) A Multi-Branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation, [Paper]
  • (arXiv 2021.06) Task Transformer Network for Joint MRI Reconstruction and Super-Resolution, [Paper], [Code]
  • (arXiv 2021.06) DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation, [Paper]
  • (arXiv 2021.06) More than Encoder: Introducing Transformer Decoder to Upsample, [Paper]
  • (arXiv 2021.06) Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image, [Paper]
  • (arXiv 2021.06) MTrans: Multi-Modal Transformer for Accelerated MR Imaging, [Paper], [Code]
  • (arXiv 2021.06) Multi-Compound Transformer for Accurate Biomedical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.07) ResViT: Residual vision transformers for multi-modal medical image synthesis, [Paper]
  • (arXiv 2021.07) E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception, [Paper]
  • (arXiv 2021.07) UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation, [Paper]
  • (arXiv 2021.07) COVID-VIT: Classification of Covid-19 from CT chest images based on vision transformer models, [Paper]
  • (arXiv 2021.07) RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting, [Paper], [Code]
  • (arXiv 2021.07) Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation, [Paper]
  • (arXiv 2021.07) Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries, [Paper]
  • (arXiv 2021.07) EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification, [Paper]
  • (arXiv 2021.07) Visual Transformer with Statistical Test for COVID-19 Classification, [Paper]
  • (arXiv 2021.07) TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation, [Paper]
  • (arXiv 2021.07) Few-Shot Domain Adaptation with Polymorphic Transformers, [Paper], [Code]
  • (arXiv 2021.07) TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation, [Paper]
  • (arXiv 2021.07) Surgical Instruction Generation with Transformers, [Paper]
  • (arXiv 2021.07) LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.07) TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee Topology Preservation in Segmentations, [Paper], [Code]
  • (arXiv 2021.08) Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers, [Paper], [Code]
  • (arXiv 2021.08) Is it Time to Replace CNNs with Transformers for Medical Images, [Paper], [Code]
  • (arXiv 2021.09) nnFormer: Interleaved Transformer for Volumetric Segmentation, [Paper], [Code]
  • (arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
  • (arXiv 2021.09) MISSFormer: An Effective Medical Image Segmentation Transformer, [Paper]
  • (arXiv 2021.09) Eformer: Edge Enhancement based Transformer for Medical Image Denoising, [Paper]
  • (arXiv 2021.09) Transformer-Unet: Raw Image Processing with Unet, [Paper]
  • (arXiv 2021.09) BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation, [Paper]
  • (arXiv 2021.09) GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation, [Paper]
  • (arXiv 2021.10) Transformer Assisted Convolutional Network for Cell Instance Segmentation, [Paper]
  • (arXiv 2021.10) A transformer-based deep learning approach for classifying brain metastases into primary organ sites using clinical whole brain MRI images, [Paper]
  • (arXiv 2021.10) Boundary-aware Transformers for Skin Lesion Segmentation, [Paper], [Code]
  • (arXiv 2021.10) Vision Transformer based COVID-19 Detection using Chest X-rays, [Paper]
  • (arXiv 2021.10) Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining, [Paper], [Code]
  • (arXiv 2021.10) CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans, [Paper], [Code]
  • (arXiv 2021.10) COVID-19 Detection in Chest X-ray Images Using Swin-Transformer and Transformer in Transformer, [Paper], [Code]
  • (arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
  • (arXiv 2021.10) Vision Transformer for Classification of Breast Ultrasound Images, [Paper]
  • (arXiv 2021.11) Federated Split Vision Transformer for COVID-19CXR Diagnosis using Task-Agnostic Training, [Paper]
  • (arXiv 2021.11) Hepatic vessel segmentation based on 3D swin-transformer with inductive biased multi-head self-attention, [Paper]
  • (arXiv 2021.11) Lymph Node Detection in T2 MRI with Transformers, [Paper]
  • (arXiv 2021.11) Mixed Transformer U-Net For Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.11) Transformer for Polyp Detection, [Paper]
  • (arXiv 2021.11) DuDoTrans: Dual-Domain Transformer Provides More Attention for Sinogram Restoration in Sparse-View CT Reconstruction, [Paper], [Code]
  • (arXiv 2021.11) A Volumetric Transformer for Accurate 3D Tumor Segmentation, [Paper], [Code]
  • (arXiv 2021.11) Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis, [Paper], [Code]
  • (arXiv 2021.11) MIST-net: Multi-domain Integrative Swin Transformer network for Sparse-View CT Reconstruction, [Paper]
  • (arXiv 2021.12) MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification, [Paper], [Code]
  • (arXiv 2021.12) 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis, [Paper], [Code]
  • (arXiv 2021.12) Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer, [Paper], [Code]
  • (arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper], [Code]
  • (arXiv 2021.12) MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer, [Paper], [Code]
  • (arXiv 2022.01) D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation, [Paper]
  • (arXiv 2022.01) Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images, [Paper], [Code]
  • (arXiv 2022.01) Swin Transformer for Fast MRI, [Paper], [Code]
  • (arXiv 2022.01) ViTBIS: Vision Transformer for Biomedical Image Segmentation, [Paper]

Motion

  • (arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]
  • (arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
  • (arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
  • (arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper]
  • (arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
  • (arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]

Multi-task/modal

  • (arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Code]
  • (arXiv 2021.04) MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding, [Paper], [Code]
  • (arXiv 2021.04) Multi-Modal Fusion Transformer for End-to-End Autonomous Driving, [Paper]
  • (arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]
  • (arXiv 2021.06) Scene Transformer: A Unified Multi-task Model for Behavior Prediction and Planning, [Paper]
  • (arXiv 2021.06) Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation, [Paper]
  • (arXiv 2021.06) A Transformer-based Cross-modal Fusion Model with Adversarial Training, [Paper]
  • (arXiv 2021.07) Attention Bottlenecks for Multimodal Fusion, [Paper]
  • (arXiv 2021.07) Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots, [Paper]
  • (arXiv 2021.07) Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions, [Paper]
  • (arXiv 2021.07) Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers, [Paper], [Code]
  • (arXiv 2021.08) StrucTexT: Structured Text Understanding with Multi-Modal Transformers, [Paper]
  • (arXiv 2021.08) Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations, [Paper]
  • (arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
  • (arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
  • (arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
  • (arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
  • (arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
  • (arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
  • (arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper], [Code]
  • (arXiv 2021.10) VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language Transformer Decomposing, [Paper]
  • (arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Code]
  • (arXiv 2021.10) Detecting Dementia from Speech and Transcripts using Transformers, [Paper]
  • (arXiv 2021.11) MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition, [Paper]
  • (arXiv 2021.11) VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
  • (arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
  • (arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
  • (arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
  • (arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code1], [Code2]
  • (arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
  • (arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
  • (arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
  • (arXiv 2021.11) VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
  • (arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
  • (arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
  • (arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
  • (arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper], [Code]
  • (arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
  • (arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper],[Code]
  • (arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
  • (arXiv 2021.12) VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling, [Paper]
  • (arXiv 2021.12) VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper],[Code]
  • (arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
  • (arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper],[Code]
  • (arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
  • (arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]
  • (arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper],[Code]
  • (arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper],[Code]
  • (arXiv 2022.01) Robust Self-Supervised Audio-Visual Speech Recognition, [Paper],[Code]
  • (arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
  • (arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]
  • (arXiv 2022.01) Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning, [Paper],[Code]
  • (arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper],[Code]

Multi-view Stereo

  • (arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
  • (arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]

NAS

  • (CVPR'21) HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers, [Paper], [Code]
  • (arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
  • (arXiv.2021.03) BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search, [Paper], [Code]
  • (arXiv.2021.06) Vision Transformer Architecture Search, [Paper], [Code]
  • (arXiv.2021.07) AutoFormer: Searching Transformers for Visual Recognition, [Paper], [Code]
  • (arXiv.2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [Paper]
  • (arXiv.2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper]
  • (arXiv.2021.10) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [Paper]
  • (arXiv.2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
  • (arXiv.2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]

Navigation

  • (ICLR'21) VTNet: Visual Transformer Network for Object Goal Navigation, [Paper]
  • (arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
  • (arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
  • (arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [Paper]
  • (arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]

OCR

  • (arXiv 2021.04) Handwriting Transformers, [Paper]
  • (arXiv 2021.05) I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition, [Paper]
  • (arXiv 2021.05) Vision Transformer for Fast and Efficient Scene Text Recognition, [Paper]
  • (arXiv 2021.06) DocFormer: End-to-End Transformer for Document Understanding, [Paper]
  • (arXiv 2021.08) A Transformer-based Math Language Model for Handwritten Math Expression Recognition, [Paper]
  • (arXiv 2021.09) TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, [Paper], [Code]
  • (arXiv 2021.10) Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks, [Paper], [Code]
  • (arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper]
  • (arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
  • (arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
  • (arXiv 2021.12) SPTS: Single-Point Text Spotting, [Paper]

Octree

  • (arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]

Panoptic Segmentation

  • (arXiv.2020.12) MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, [Paper]
  • (arXiv 2021.09) Panoptic SegFormer, [Paper]
  • (arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
  • (arXiv 2021.10) An End-to-End Trainable Video Panoptic Segmentation Method using Transformers, [Paper]
  • (arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
  • (arXiv 2021.12) PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, [Paper], [Code]

Point Cloud

  • (ICRA'21) NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation, [Paper]
  • (arXiv 2020.12) Point Transformer, [Paper]
  • (arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
  • (arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
  • (arXiv 2021.03) You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module, [Paper], [Code]
  • (arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
  • (arXiv 2021.04) M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers, [Paper]
  • (arXiv 2021.04) Dual Transformer for Point Cloud Analysis, [Paper]
  • (arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]
  • (arXiv 2021.08) SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer, [Paper], [Code]
  • (arXiv 2021.08) PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds, [Paper], [Code]
  • (arXiv 2021.08) Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning, [Paper], [Code]
  • (arXiv 2021.08) PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers, [Paper], [Code]
  • (arXiv 2021.08) Improving 3D Object Detection with Channel-wise Transformer, [Paper], [Code]
  • (arXiv 2021.09) PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds, [Paper], [Code]
  • (arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper]
  • (arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
  • (arXiv 2021.10) PatchFormer: A Versatile 3D Transformer Based on Patch Attention, [Paper]
  • (arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
  • (arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
  • (arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
  • (arXiv 2021.11) Adaptive Channel Encoding Transformer for Point Cloud Analysis, [Paper], [Code]
  • (arXiv 2021.11) Fast Point Transformer, [Paper]
  • (arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
  • (arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]

Pose

  • (arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
  • (arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
  • (arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
  • (arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
  • (arXiv 2021.03) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]
  • (arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper]
  • (arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]
  • (arXiv 2021.04) TokenPose: Learning Keypoint Tokens for Human Pose Estimation, [Paper]
  • (arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]
  • (arXiv 2021.04) HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction, [Paper]
  • (arXiv 2021.07) Test-Time Personalization with a Transformer for Human Pose Estimation, [Paper]
  • (arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
  • (arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
  • (arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
  • (arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
  • (arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
  • (arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
  • (arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Code]
  • (arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
  • (arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper], [Code]
  • (arXiv 2021.12) Towards Deep Learning-based 6D Bin Pose Estimation in 3D Scans, [Paper]
  • (arXiv 2021.12) End-to-End Learning of Multi-category 3D Pose and Shape Estimation, [Paper]
  • (arXiv 2022.01) Swin-Pose: Swin Transformer Based Human Pose Estimation, [Paper]
  • (arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers, [Paper]

Planning

  • (arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]

Pruning & Quantization

  • (arXiv 2021.04) Visual Transformer Pruning, [Paper]
  • (arXiv 2021.06) Post-Training Quantization for Vision Transformer, [Paper]
  • (arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper], [Code]
  • (arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper]
  • (arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]

Recognition

  • (arXiv 2021.03) Global Self-Attention Networks for Image Recognition, [Paper]
  • (arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
  • (arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision, [Paper]
  • (arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [Paper]
  • (arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [Paper]
  • (arXiv 2021.08) DPT: Deformable Patch-based Transformer for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.10) A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition, [Paper]
  • (arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper], [Code]
  • (arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
  • (arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
  • (arXiv 2022.01) TransVPR: Transformer-based place recognition with multi-level attention aggregation, [Paper]

Reconstruction

  • (arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
  • (arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [Paper]
  • (arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [Paper]
  • (arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [Paper]
  • (arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
  • (arXiv 2021.11) Reference-based Magnetic Resonance Image Reconstruction Using Texture Transformer, [Paper]
  • (arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
  • (arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]

Re-identification

  • (arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
  • (arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]
  • (arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
  • (arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
  • (arXiv 2021.06) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [Paper]
  • (arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [Paper]
  • (arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [Paper]
  • (arXiv 2021.07) Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification, [Paper], [Code]
  • (arXiv 2021.07) GiT: Graph Interactive Transformer for Vehicle Re-identification, [Paper]
  • (arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [Paper]
  • (arXiv 2021.09) Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification, [Paper]
  • (arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
  • (arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
  • (arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
  • (arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
  • (arXiv 2022.01) Short Range Correlation Transformer for Occluded Person Re-Identification, [Paper]

Restoration

  • (arXiv 2021.06) Uformer: A General U-Shaped Transformer for Image Restoration, [Paper], [Code]
  • (arXiv 2021.08) SwinIR: Image Restoration Using Swin Transformer, [Paper], [Code]
  • (arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
  • (arXiv 2021.12) U2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper], [Code]
  • (arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]

Retrieval

  • (CVPR'21') Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]
  • (arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
  • (arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
  • (arXiv 2021.03) Instance-level Image Retrieval using Reranking Transformers, [Paper]
  • (arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
  • (arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]
  • (arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper], [Code]
  • (arXiv 2021.06) Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features, [Paper]
  • (arXiv 2021.06) All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers, [Paper], [Code]
  • (arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]

Salient Object Detection

  • (arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [Paper]
  • (arXiv 2021.04) Visual Saliency Transformer, [Paper]
  • (arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [Paper]
  • (arXiv 2021.08) Unifying Global-Local Representations in Salient Object Detection with Transformer, [Paper], [Code]
  • (arXiv 2021.08) TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network, [Paper], [Code]
  • (arXiv 2021.08) Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net, [Paper]
  • (arXiv 2021.12) Transformer-based Network for RGB-D Saliency Detection, [Paper]
  • (arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
  • (arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]

Scene

  • (arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
  • (arXiv 2021.05) SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation, [Paper]
  • (arXiv 2021.06) P2T: Pyramid Pooling Transformer for Scene Understanding, [Paper], [Code]
  • (arXiv 2021.07) Scenes and Surroundings: Scene Graph Generation using Relation Transformer, [Paper]
  • (arXiv 2021.07) Spatial-Temporal Transformer for Dynamic Scene Graph Generation, [Paper]
  • (arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
  • (arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper]
  • (arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
  • (arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]

Self-supervised Learning

  • (arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper], [Code]
  • (arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
  • (arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]], [Code]
  • (arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper], [Code]
  • (arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]
  • (arXiv 2021.05) Self-Supervised Learning with Swin Transformers, [Paper], [Code]
  • (arXiv 2021.06) MST: Masked Self-Supervised Transformer for Visual Representation, [Paper]
  • (arXiv 2021.06) Efficient Self-supervised Vision Transformers for Representation Learning, [Paper]
  • (arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper]
  • (arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper], [Code]
  • (arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper], [Code]

Semantic Segmentation

  • (arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
  • (arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
  • (arXiv 2021.05) Segmenter: Transformer for Semantic Segmentation, [Paper], [Code]
  • (arXiv 2021.06) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, [Paper], [Code]
  • (arXiv 2021.06) Fully Transformer Networks for Semantic Image Segmentation, [Paper]
  • (arXiv 2021.06) Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images, [Paper]
  • (arXiv 2021.06) OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments, [Paper]
  • (arXiv 2021.07) Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images, [Paper]
  • (arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [Paper]
  • (arXiv 2021.07) A Unified Efficient Pyramid Transformer for Semantic Segmentation, [Paper]
  • (arXiv 2021.08) Boosting Few-shot Semantic Segmentation with Transformers, [Paper], [Code]
  • (arXiv 2021.08) Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer, [Paper], [Code]
  • (arXiv 2021.08) Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation, [Paper], [Code]
  • (arXiv 2021.08) Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance, [Paper], [Code]
  • (arXiv 2021.08) Evaluating Transformer based Semantic Segmentation Networks for Pathological Image Segmentation, [Paper]
  • (arXiv 2021.08) Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models, [Paper]
  • (arXiv 2021.09) Efficient Hybrid Transformer: Learning Global-local Context for Urban Sence Segmentation, [Paper]
  • (arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]
  • (arXiv 2021.11) Dynamically pruning segformer for efficient semantic segmentation, [Paper]
  • (arXiv 2021.11) APANet: Adaptive Prototypes Alignment Network for Few-Shot Semantic Segmentation, [Paper]
  • (arXiv 2021.11) Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers, [Paper]
  • (arXiv 2021.11) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]
  • (arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]
  • (arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]
  • (arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]
  • (arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]

Shape

  • (WACV'21) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]

Super-Resolution

  • (CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
  • (arXiv 2021.06) LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation, [Paper]
  • (arXiv 2021.06) Video Super-Resolution Transformer, [Paper], [Code]
  • (arXiv 2021.08) Light Field Image Super-Resolution with Transformers, [Paper], [Code]
  • (arXiv 2021.08) Efficient Transformer for Single Image Super-Resolution, [Paper]
  • (arXiv 2021.09) Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution, [Paper]
  • (arXiv 2021.12) Implicit Transformer Network for Screen Content Image Continuous Super-Resolution, [Paper]
  • (arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]
  • (arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

Synthesis

  • (arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]
  • (arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [Paper]
  • (arXiv 2021.05) High-Resolution Complex Scene Synthesis with Transformers, [Paper]
  • (arXiv 2021.06) The Image Local Autoregressive Transformer, [Paper]
  • (arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Project]
  • (arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]

Tracking

  • (EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]
  • (CVPR'21) Transformer Tracking, [Paper], [Code]
  • (CVPR'21) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]
  • (arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
  • (arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
  • (arXiv 2021.03) TransCenter: Transformers with Dense Queries for Multiple-Object Tracking, [Paper]
  • (arXiv 2021.03) Learning Spatio-Temporal Transformer for Visual Tracking, [Paper], [Code]
  • (arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]
  • (arXiv 2021.04) Spatial-Temporal Graph Transformer for Multiple Object Tracking, [Paper]
  • (arXiv 2021.05) MOTR: End-to-End Multiple-Object Tracking with TRansformer, [Paper], [Code]
  • (arXiv 2021.05) TrTr: Visual Tracking with Transformer, [Paper], [Code]
  • (arXiv 2021.08) HiFT: Hierarchical Feature Transformer for Aerial Tracking, [Paper], [Code]
  • (arXiv 2021.10) Siamese Transformer Pyramid Networks for Real-Time UAV Tracking, [Paper], [Code]
  • (arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]
  • (arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]
  • (arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
  • (arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]
  • (arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]

Traffic

  • (arXiv 2021.05) Novelty Detection and Analysis of Traffic Scenario Infrastructures in the Latent Space of a Vision Transformer-Based Triplet Autoencoder, [Paper]
  • (arXiv 2021.11) DetectorNet: Transformer-enhanced Spatial Temporal Graph Neural Network for Traffic Prediction, [Paper]
  • (arXiv 2021.11) ProSTformer: Pre-trained Progressive Space-Time Self-attention Model for Traffic Flow Forecasting, [Paper]
  • (arXiv 2022.01) SwinUNet3D -- A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers, [Paper], [Code]

Texture

  • (arXiv 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]

Transfer learning

  • (arXiv 2021.06) Transformer-Based Source-Free Domain Adaptation, [Paper], [Code]
  • (arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]
  • (arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]
  • (arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]
  • (arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]
  • (arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]

Video

  • (ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
  • (ICLR'21) Support-set bottlenecks for video-text representation learning, [Paper]
  • (arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation, [Paper]
  • (arXiv 2021.02) Video Transformer Network, [Paper]
  • (arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]
  • (arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]
  • (arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]
  • (arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]
  • (arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]
  • (arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]
  • (arXiv 2021.03) ViViT: A Video Vision Transformer, [paper]
  • (arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]
  • (arXiv 2021.04) Temporal Query Networks for Fine-grained Video Understanding, [Paper], [Project]
  • (arXiv 2021.04) Higher Order Recurrent Space-Time Transformer, [Paper], [Code]
  • (arXiv 2021.04) VideoGPT: Video Generation using VQ-VAE and Transformers, [Paper], [Code]
  • (arXiv 2021.04) VidTr: Video Transformer Without Convolutions, [Paper]
  • (arXiv 2021.05) Local Frequency Domain Transformer Networks for Video Prediction, [Paper]
  • (arXiv 2021.05) End-to-End Video Object Detection with Spatial-Temporal Transformers, [Paper], [Code]
  • (arXiv 2021.06) Anticipative Video Transformer, [Paper], [Project]
  • (arXiv 2021.06) TransVOS: Video Object Segmentation with Transformers, [Paper]
  • (arXiv 2021.06) Associating Objects with Transformers for Video Object Segmentation, [Paper]
  • (arXiv 2021.06) Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers, [Paper]
  • (arXiv 2021.06) Space-time Mixing Attention for Video Transformer, [Paper]
  • (arXiv 2021.06) Video Instance Segmentation using Inter-Frame Communication Transformers, [Paper]
  • (arXiv 2021.06) Long-Short Temporal Contrastive Learning of Video Transformers, [Paper]
  • (arXiv 2021.06) Video Swin Transformer, [Paper], [Code]
  • (arXiv 2021.06) Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection, [Paper]
  • (arXiv 2021.07) Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation, [Paper], [Code]
  • (arXiv 2021.07) Generative Video Transformer: Can Objects be the Words, [Paper]
  • (arXiv 2021.07) Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection, [Paper]
  • (arXiv 2021.08) Token Shift Transformer for Video Classification, [Paper], [Code]
  • (arXiv 2021.08) Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering, [Paper]
  • (arXiv 2021.08) Video Relation Detection via Tracklet based Visual Transformer, [Paper], [Code]
  • (arXiv 2021.08) MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition, [Paper]
  • (arXiv 2021.08) ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos, [Paper]
  • (arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
  • (arXiv 2021.09) Hierarchical Multimodal Transformer to Summarize Videos, [Paper]
  • (arXiv 2021.10) Object-Region Video Transformers, [Paper], [Code]
  • (arXiv 2021.10) Can't Fool Me: Adversarially Robust Transformer for Video Understanding, [Paper], [Code]
  • (arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]
  • (arXiv 2021.11) Sparse Adversarial Video Attacks with Spatial Transformations, [Paper], [Code]
  • (arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]
  • (arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]
  • (arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]
  • (arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]
  • (arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]
  • (arXiv 2021.12) TBN-ViT: Temporal Bilateral Network with Vision Transformer for Video Scene Parsing, [Paper]
  • (arXiv 2021.12) Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval, [Paper]
  • (arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper]
  • (arXiv 2021.12) A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code]
  • (arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]
  • (arXiv 2021.12) LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]
  • (arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]
  • (arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]
  • (arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring,[Paper]
  • (arXiv 2022.01) Multiview Transformers for Video Recognition,[Paper]
  • (arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers,[Paper]

Visual Grounding

  • (arXiv 2021.04) TransVG: End-to-End Visual Grounding with Transformers, [Paper]
  • (arXiv 2021.05) Visual Grounding with Transformers, [Paper]
  • (arXiv 2021.06) Referring Transformer: A One-step Approach to Multi-task Visual Grounding, [Paper]
  • (arXiv 2021.08) Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding, [Paper]
  • (arXiv 2021.08) TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding, [Paper]
  • (arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]

Visual Reasoning

  • (arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]

Visual Relationship Detection

  • (arXiv 2021.04) RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory, [Paper]
  • (arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
  • (arXiv 2021.08) Discovering Spatial Relationships by Transformers for Domain Generalization, [Paper]

Voxel

  • (arXiv 2021.05) SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers, [Paper]
  • (arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]

Weakly Supervised Learning

  • (arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]
  • (arXiv 2022.01) CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization, [Paper]

Zero-Shot Learning

  • (arXiv 2021.08) Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning, [Paper]
  • (arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]
  • (arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]
  • (arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]

Others

  • (CVPR'21') Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
  • (CVPR'21') Pre-Trained Image Processing Transformer, [Paper]
  • (ICCV'21) PlaneTR: Structure-Guided Transformers for 3D Plane Recovery, [Paper], [Code]
  • (arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Code]
  • (arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
  • (arXiv 2021.01) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
  • (arXiv 2021.04) Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning, [Paper]
  • (arXiv 2021.04) Cloth Interactive Transformer for Virtual Try-On, [Paper], [Code]
  • (arXiv 2021.04) Fourier Image Transformer, [Paper], [Code]
  • (arXiv 2021.05) Attention for Image Registration (AiR): an unsupervised Transformer approach, [Paper]
  • (arXiv 2021.05) IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture, [Paper]
  • (arXiv 2021.05) CogView: Mastering Text-to-Image Generation via Transformers, [Paper]
  • (arXiv 2021.06) A Comparison for Anti-noise Robustness of Deep Learning Classification Methods on a Tiny Object Image Dataset: from Convolutional Neural Network to Visual Transformer and Performer, [Paper]
  • (arXiv 2021.06) Predicting Vehicles Trajectories in Urban Scenarios with Transformer Networks and Augmented Information, [Paper]
  • (arXiv 2021.06) StyTr2: Unbiased Image Style Transfer with Transformers, [Paper]
  • (arXiv 2021.06) Semantic Correspondence with Transformers, [Paper]
  • (arXiv 2021.06) Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue, [Paper]
  • (arXiv 2021.07) Grid Partitioned Attention: Efficient Transformer Approximation with Inductive Bias for High Resolution Detail Generation, [Paper], [Code]
  • (arXiv 2021.07) Image Fusion Transformer, [Paper], [Code]
  • (arXiv 2021.07) PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution, [Paper]
  • (arXiv 2021.07) PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion, [Paper]
  • (arXiv 2021.08) Applications of Artificial Neural Networks in Microorganism Image Analysis: A Comprehensive Review from Conventional Multilayer Perceptron to Popular Convolutional Neural Network and Potential Visual Transformer, [Paper]
  • (arXiv 2021.08) Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, [Paper], [Code]
  • (arXiv 2021.08) The Right to Talk: An Audio-Visual Transformer Approach, [Paper], [Code]
  • (arXiv 2021.08) Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion, [Paper], [Code]
  • (arXiv 2021.08) Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper], [Code]
  • (arXiv 2021.08) TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, [Paper]
  • (arXiv 2021.08) Investigating transformers in the decomposition of polygonal shapes as point collections, [Paper]
  • (arXiv 2021.08) Convolutional Neural Network (CNN) vs Visual Transformer (ViT) for Digital Holography, [Paper]
  • (arXiv 2021.08) Construction material classification on imbalanced datasets for construction monitoring automation using Vision Transformer (ViT) architecture, [Paper]
  • (arXiv 2021.08) Spatial Transformer Networks for Curriculum Learning, [Paper]
  • (arXiv 2021.09) TransforMesh: A Transformer Network for Longitudinal modeling of Anatomical Meshes, [Paper]
  • (arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
  • (arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
  • (arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
  • (arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
  • (arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper], [Code]
  • (arXiv 2021.10) ProTo: Program-Guided Transformer for Program-Guided Tasks, [Paper]
  • (arXiv 2021.10) TranSalNet: Visual saliency prediction using transformers, [Paper]
  • (arXiv 2021.10) Development and testing of an image transformer for explainable autonomous driving systems, [Paper]
  • (arXiv 2021.10) Leveraging redundancy in attention with Reuse Transformers, [Paper]
  • (arXiv 2021.10) Tensor-to-Image: Image-to-Image Translation with Vision Transformers, [Paper]
  • (arXiv 2021.10) Accelerating Framework of Transformer by hardware Design and Model Compression Co-Optimization, [Paper]
  • (arXiv 2021.10) Vis-TOP: Visual Transformer Overlay Processor, [Paper]
  • (arXiv 2021.10) TNTC: two-stream network with transformer-based complementarity for gait-based emotion recognition, [Paper]
  • (arXiv 2021.11) The self-supervised channel-spatial attention-based transformer network for automated, accurate prediction of crop nitrogen status from UAV imagery, [Paper]
  • (arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]
  • (arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]
  • (arXiv 2021.11) U-shape Transformer for Underwater Image Enhancement, [Paper]
  • (arXiv 2021.11) Ice hockey player identification via transformers, [Paper]
  • (arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper]
  • (arXiv 2021.11) Attention-based Dual-stream Vision Transformer for Radar Gait Recognition,[Paper]
  • (arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions,[Paper], [Code]
  • (arXiv 2021.11) BuildFormer: Automatic building extraction with vision transformer,[Paper]
  • (arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers,[Paper]
  • (arXiv 2021.12) Transformer based trajectory prediction,[Paper]
  • (arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors,[Paper], [Project]
  • (arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection,[Paper]
  • (arXiv 2021.12) 3D Question Answering,[Paper]
  • (arXiv 2021.12) Light Field Neural Rendering,[Paper], [Project]
  • (arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
  • (arXiv 2021.12) Nonlinear Transform Source-Channel Coding for Semantic Communications, [Paper]
  • (arXiv 2021.12) APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers, [Paper]
  • (arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer,[Paper], [Project]
  • (arXiv 2022.01) A Transformer-Based Siamese Network for Change Detection,[Paper], [Code]
  • (arXiv 2022.01) Learning class prototypes from Synthetic InSAR with Vision Transformers,[Paper]
  • (arXiv 2022.01) Swin transformers make strong contextual encoders for VHR image road extraction,[Paper]
  • (arXiv 2022.01) Technical Report for ICCV 2021 Challenge SSLAD-Track3B: Transformers Are Better Continual Learners,[Paper]
  • (arXiv 2022.01) Spectral Compressive Imaging Reconstruction Using Convolution and Spectral Contextual Transformer,[Paper]
  • (arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer,[Paper]
  • (arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference,[Paper]
  • (arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation,[Paper], [Code]
  • (arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation,[Paper], [Code]

Contact & Feedback

If you have any suggestions about this project, feel free to contact me.

  • [e-mail: yzhangcst[at]gmail.com]
You might also like...
Build fully-functioning computer vision models with PyTorch
Build fully-functioning computer vision models with PyTorch

Detecto is a Python package that allows you to build fully-functioning computer vision and object detection models with just 5 lines of code. Inferenc

Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.
Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

Self-attention building blocks for computer vision applications in PyTorch Implementation of self attention mechanisms for computer vision in PyTorch

Datasets, Transforms and Models specific to Computer Vision

torchvision The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision. Installat

Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

Hierarchical Pretraining: Research Repository This is a research repository for reproducing the results from the project "Self-supervised pretraining

A PyTorch-Based Framework for Deep Learning in Computer Vision

TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{you2019torchcv, author = {Ansheng You and Xiangtai Li and Zhen Zhu a

Open Source Differentiable Computer Vision Library for PyTorch
Open Source Differentiable Computer Vision Library for PyTorch

Kornia is a differentiable computer vision library for PyTorch. It consists of a set of routines and differentiable modules to solve generic computer

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

This project demonstrates the use of neural networks and computer vision to create a classifier that interprets the Brazilian Sign Language.
This project demonstrates the use of neural networks and computer vision to create a classifier that interprets the Brazilian Sign Language.

LIBRAS-Image-Classifier This project demonstrates the use of neural networks and computer vision to create a classifier that interprets the Brazilian

[CVPR 2021]
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

Comments
  • Please add RelViT

    Please add RelViT

    Hi,

    Thanks for making this learning list and indeed I learned a lot. Just want to share one of our recent works on transformers and I hope it could help the community through your platform:

    RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning (ICLR 2022) arxiv | code In this work, we propose a better training scheme for vision transformers and testify it on VQA, HOI, and visual reasoning tasks. We further introduce concept-guided contrastive learning that helps these models master visual reasoning without massive pertaining or extra training data.

    opened by jeasinema 2
  • Please add our paper to your list

    Please add our paper to your list

    Our paper titled "Bilateral-ViT for Robust Fovea Localization" has been accepted to ISBI 2022 conference and a preprint is available at this link: https://arxiv.org/abs/2110.09860

    I would appreciate it a lot if you can add our paper to your list. Thanks!

    opened by jacobdang 1
  • Code is available

    Code is available

    Hi @Yangzhangcst Thank you for this repo. The code for paper "STAR: Sparse Transformer-based Action Recognition" is available at the link: https://github.com/imj2185/STAR

    opened by shi27feng 1
Owner
null
GluonMM is a library of transformer models for computer vision and multi-modality research

GluonMM is a library of transformer models for computer vision and multi-modality research. It contains reference implementations of widely adopted baseline models and also research work from Amazon Research.

null 42 Dec 2, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Alex Pashevich 62 Dec 24, 2022
The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

null 87 Nov 29, 2022
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

Microsoft 409 Jan 6, 2023
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
Open source Python module for computer vision

About PCV PCV is a pure Python library for computer vision based on the book "Programming Computer Vision with Python" by Jan Erik Solem. More details

Jan Erik Solem 1.9k Jan 6, 2023
PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision.

PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{CV2018, author = {Donny You ([email protected])}, howpubl

Donny You 40 Sep 14, 2022