Transformer in Vision

Overview

Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

Keep updated.

Resource

Survey

  • (arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP MLP, [Paper]

  • (arXiv 2021.11) A Survey of Visual Transformers, [Paper]

  • (arXiv 2021.09) Survey: Transformer based Video-Language Pre-training, [Paper]

  • (arXiv 2021.06) A Survey of Transformers, [Paper]

  • (arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]

  • (arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]

  • (arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]

  • (arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]

  • (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]

  • (arXiv 2021.01) A Survey on Visual Transformer, [Paper]

  • (arXiv 2020.9) Efficient Transformers: A Survey, [Paper]

  • (arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

  • (arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving Hashing, [Paper]

  • (arXiv 2022.01) LANGUAGE-DRIVEN SEMANTIC SEGMENTATION, [Paper], [Code]

  • (arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]

  • (arXiv 2022.01) ImageSubject: A Large-scale Dataset for Subject Detection, [Paper]

  • (arXiv 2022.01) Detecting Twenty-thousand Classes using Image-level Supervision, [Paper], [Code]

  • (arXiv 2022.01) Generalized Category Discovery, [Paper], [Code]

  • (arXiv 2022.01) Video Summarization Based on Video-text Modelling, [Paper]

  • (arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]

  • (arXiv 2022.01) QUADTREE ATTENTION FOR VISION TRANSFORMERS, [Paper], [Code]

  • (arXiv 2022.01) A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval, [Paper], [Project]

  • (arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]

  • (arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]

  • (arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]

  • (arXiv 2022.01) Multiview Transformers for Video Recognition, [Paper]

  • (arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED FEW-SHOT LEARNING, [Paper]

  • (arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING, [Paper], [Code]

  • (arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper], [Project]

  • (arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers, [Paper]

  • (arXiv 2022.01) CLIP-Event: Connecting Text and Images with Event Structures, [Paper], [Code]

  • (arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training, [Paper]

  • (arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]

  • (arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]

  • (arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [Paper]

  • (arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]

  • (arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring, [Paper]

  • (arXiv 2022.01) Stochastic Layers in Vision Transformers, [Paper]

  • (arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR BIDIRECTIONAL VISION-LANGUAGE GENERATION, [Paper]

  • (arXiv 2022.01) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer, [Paper], [Code]

  • (arXiv 2022.01) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]

  • (arXiv 2022.01) Persformer: A Transformer Architecture for Topological Machine Learning, [Paper]

  • (arXiv 2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]

  • (arXiv 2022.01) Language as Queries for Referring Video Object Segmentation, [Paper], [Code]

  • (arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]

  • (arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION, [Paper], [Code]

  • (arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]

  • (arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]

  • (arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

2021.12

  • (arXiv 2021.12) Multi-Dimensional Model Compression of Vision Transformer, [Paper]

  • (arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]

  • (arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Atention, [Paper], [Code]

  • (arXiv 2021.12) APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers, [Paper]

  • (arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper]

  • (arXiv 2021.12) Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?, [Paper]

  • (arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]

  • (arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [Paper]

  • (arXiv 2021.12) StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, [Paper], [Code]

  • (arXiv 2021.12) A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, [Paper], [Code]

  • (arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]

  • (arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH SLIDING WINDOWS, [Paper], [Code]

  • (arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]

  • (arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]

  • (arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]

  • (arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]

  • (arXiv 2021.12) ViR: the Vision Reservoir, [Paper]

  • (arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]

  • (arXiv 2021.12) Open-Vocabulary Image Segmentation, [Paper]

  • (arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]

  • (arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]

  • (arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]

  • (arXiv 2021.12) Fine-grained Multi-Modal Self-Supervised Learning, [Paper]

  • (arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper], [Code]

  • (arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes, [Paper]

  • (arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input Adaptation, [Paper]

  • (arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2021.12) Contrastive Object Detection Using Knowledge Graph Embeddings, [Paper]

  • (arXiv 2021.12) RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality, [Paper], [Code]

  • (arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]

  • (arXiv 2021.12) MPViT : Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]

  • (arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]

  • (arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]

  • (arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]

  • (arXiv 2021.12) Tell me what you see: A zero-shot action recognition method based on natural language descriptions, [Paper], [Code]

  • (arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]

  • (arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [Paper]

  • (arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [Paper]

  • (arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]

  • (arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]

  • (arXiv 2021.12) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, [Paper], [Code]

  • (arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]

  • (arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]

  • (arXiv 2021.12) Align and Prompt: Video-and-Language Pre-training with Entity Prompts, [Paper], [Code]

  • (arXiv 2021.12) DATA EFFICIENT LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION WITH OPTIMAL TRANSPORT DISTILLATION, [Paper]

  • (arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]

  • (arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]

  • (arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources, [Paper]

  • (arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper]

  • (arXiv 2021.12) How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [Paper]

  • (arXiv 2021.12) Learning to Prompt for Continual Learning, [Paper], [Code]

  • (arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper], [Code]

  • (arXiv 2021.12) Dense Video Captioning Using Unsupervised Semantic Information, [Paper], [Code]

  • (arXiv 2021.12) Looking Outside the Box to Ground Language in 3D Scenes, [Paper], [Code]

  • (arXiv 2021.12) RegionCLIP: Region-based Language-Image Pretraining, [Paper], [Code]

  • (arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper]

  • (arXiv 2021.12) Masked Feature Prediction for Self-Supervised Visual Pre-Training, [Paper]

  • (arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning, [Paper]

  • (arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]

  • (arXiv 2021.12) Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos, [Paper], [Code]

  • (arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]

  • (arXiv 2021.12) QAHOI: Query-Based Anchors for Human-Object Interaction Detection, [Paper], [Code]

  • (arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]

  • (arXiv 2021.12) CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations, [Paper]

  • (arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]

  • (arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors, [Paper], [Project]

  • (arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]

  • (arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]

  • (arXiv 2021.12) COMPOSER: Compositional Learning of Group Activity in Videos, [Paper]

  • (arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]

  • (arXiv 2021.12) Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection, [Paper]

  • (arXiv 2021.12) SVIP: Sequence VerIfication for Procedures in Videos, [Paper]

  • (arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]

  • (arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper], [Code]

  • (arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]

  • (arXiv 2021.12) PartGlot: Learning Shape Part Segmentation from Language Reference Games, [Paper]

  • (arXiv 2021.12) Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network, [Paper]

  • (arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR TEXT-BASED PERSON SEARCH, [Paper]

  • (arXiv 2021.12) L-Verse: Bidirectional Generation Between Image and Text, [Paper]

  • (arXiv 2021.12) SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY, [Paper]

  • (arXiv 2021.12) Are Vision Transformers Robust to Patch Perturbations? [Paper]

  • (arXiv 2021.12) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]

  • (arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]

  • (arXiv 2021.12) MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning, [Paper]

  • (arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]

  • (arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]

  • (arXiv 2021.12) Rethinking the Two-Stage Framework for Grounded Situation Recognition, [Paper], [Code]

  • (arXiv 2021.12) CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions, [Paper]

  • (arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling Attention Map, [Paper]

  • (arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]

  • (arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]

  • (arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training, [Paper], [Code]

  • (arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]

  • (arXiv 2021.12) Grounded Language-Image Pre-training, [Paper], [Code]

  • (arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper]

  • (arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR POINT CLOUD ANALYSIS, [Paper]

  • (arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]

  • (arXiv 2021.12) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts, [Paper]

  • (arXiv 2021.12) PointCLIP: Point Cloud Understanding by CLIP, [Paper], [Code]

  • (arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]

  • (arXiv 2021.12) DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER, [Paper], [Code]

  • (arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]

  • (arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]

  • (arXiv 2021.12) Text2Mesh: Text-Driven Neural Stylization for Meshes, [Paper], [Project]

  • (arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]

  • (arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]

  • (arXiv 2021.12) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, [Paper], [Code]

  • (arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]

  • (arXiv 2021.12) Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer, [Paper], [Code]

  • (arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection, [Paper]

  • (arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper]

  • (arXiv 2021.12) Transformer based trajectory prediction, [Paper]

  • (arXiv 2021.12) Evaluating Transformers for Lightweight Action Recognition, [Paper]

  • (arXiv 2021.12) Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision, [Paper]

  • (arXiv 2021.12) CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, [Paper]

  • (arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]

  • (arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]

  • (arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers, [Paper]

  • (arXiv 2021.12) Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning, [Paper]

  • (arXiv 2021.12) AUDIO-VISUAL SYNCHRONISATION IN THE WILD, [Paper], [Project]

  • (arXiv 2021.12) Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, [Paper]

  • (arXiv 2021.12) Garment4D: Garment Reconstruction from Point Cloud Sequences, [Paper], [Code]

  • (arXiv 2021.12) Locally Shifted Attention**** With Early Global Integration, [Paper], [Code]

  • (arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]

  • (arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Project]

  • (arXiv 2021.12) HairCLIP: Design Your Hair by Text and Reference Image, [Paper], [Project]

  • (arXiv 2021.12) CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, [Paper], [Code]

  • (arXiv 2021.12) A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code], [Dataset]

  • (arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper], [Code]

  • (arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]

  • (arXiv 2021.12) Fast Point Transformer, [Paper]

  • (arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations, [Paper], [Project]

  • (arXiv 2021.12) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper]

  • (arXiv 2021.12) PatchFormer: An Efficient Point Transformer with Patch Attention, [Paper]

  • (arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]

  • (arXiv 2021.12) MLP Architectures for Vision-and-Language Modeling: An Empirical Study, [Paper], [Code]

  • (arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for Video Retrieval, [Paper]

  • (arXiv 2021.12) Prompting Visual-Language Models for Efficient Video Understanding, [Paper], [Project]

  • (arXiv 2021.12) FLAVA: A Foundational Language And Vision Alignment Model, [Paper]

  • (arXiv 2021.12) Embedding Arithmetic for Text-driven Image Transformation, [Paper]

  • (arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper]

  • (arXiv 2021.12) Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos, [Paper], [Project]

  • (arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]

  • (arXiv 2021.12) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, [Paper], [Code]

  • (arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]

  • (arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]

  • (arXiv 2021.12) Zero-Shot Text-Guided Object Generation with Dream Fields, [Paper], [Project]

  • (arXiv 2021.12) Video-Text Pre-training with Learned Regions, [Paper], [Code]

  • (arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]

  • (arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning, [Paper]

  • (arXiv 2021.12) DenseCLIP: Extract Free Dense Labels from CLIP, [Paper]

  • (arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper]

  • (arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]

  • (arXiv 2021.12) Object-Centric Unsupervised Image Captioning, [Paper]

  • (arXiv 2021.12) Vision Pair Learning: An Efficient Training Framework for Image Classification, [Paper]

  • (arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]

  • (arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]

  • (arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]

  • (arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]

  • (arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]

  • (arXiv 2021.12) Human-Object Interaction Detection via Weak Supervision, [Paper]

  • (arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]

  • (arXiv 2021.12) CLIPstyler: Image Style Transfer with a Single Text Condition, [Paper]

  • (arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]

  • (arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]

  • (arXiv 2021.12) Object-aware Video-language Pre-training for Retrieval, [Paper], [Code]

2021.11

  • (arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]

  • (arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, [Paper]

  • (arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]

  • (arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]

  • (arXiv 2021.11) ADAPTIVE FOURIER NEURAL OPERATORS: EFFICIENT TOKEN MIXERS FOR TRANSFORMERS, [Paper]

  • (arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]

  • (arXiv 2021.11) DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, [Paper], [Code]

  • (arXiv 2021.11) Ice hockey player identification via transformers, [Paper]

  • (arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]

  • (arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]

  • (arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]

  • (arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]

  • (arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER ROBUSTNESS, [Paper]

  • (arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning, [Paper]

  • (arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling, [Paper]

  • (arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]

  • (arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]

  • (arXiv 2021.11) ZERO-SHOT CERTIFIED DEFENSE AGAINST ADVERSARIAL PATCHES WITH VISION TRANSFORMERS, [Paper]

  • (arXiv 2021.11) PointMixer: MLP-Mixer for Point Cloud Understanding, [Paper]

  • (arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]

  • (arXiv 2021.11) Florence: A New Foundation Model for Computer Vision, [Paper]

  • (arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]

  • (arXiv 2021.11) Learning to Compose Visual Relations, [Paper], [Project]

  • (arXiv 2021.11) REFERENCE-BASED MAGNETIC RESONANCE IMAGE RECONSTRUCTION USING TEXTURE TRANSFORMER, [Paper]

  • (arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval, [Paper]

  • (arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]

  • (arXiv 2021.11) SimMIM: A Simple Framework for Masked Image Modeling, [Paper], [Code]

  • (arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]

  • (arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]

  • (arXiv 2021.11) ClipCap: CLIP Prefix for Image Captioning, [Paper], [Code]

  • (arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]

  • (arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]

  • (arXiv 2021.11) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, [Paper], [Code]

  • (arXiv 2021.11) Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning, [Paper], [Code]

  • (arXiv 2021.11) Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement, [Paper], [Code]

  • (arXiv 2021.11) Tracking People with 3D Representations, [Paper], [Code]

  • (arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Paper]

  • (arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]

  • (arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code]

  • (arXiv 2021.11) Attention Approximates Sparse Distributed Memory, [Paper]

  • (arXiv 2021.11) SLICED RECURSIVE TRANSFORMER, [Paper], [Code]

  • (arXiv 2021.11) HYBRID BYOL-VIT: EFFICIENT APPROACH TO DEAL WITH SMALL DATASETS, [Paper]

  • (arXiv 2021.11) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [Paper], [Code]

  • (arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]

  • (arXiv 2021.11) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis, [Paper], [Code]

  • (arXiv 2021.11) Revisiting spatio-temporal layouts for compositional action recognition, [Paper], [Code]

  • (arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in Referential Games, [Paper], [Code]

  • (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]

  • (arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]

  • (arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, [Paper], [Code]

  • (arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning, [Paper], [Project]

  • (arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]

  • (arXiv 2021.11) VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval, [Paper]

  • (arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]

  • (arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]

  • (arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]

  • (arXiv 2021.11) ML-Decoder: Scalable and Versatile Classification Head, [Paper], [Code]

  • (arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]

  • (arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization, [Paper]

  • (arXiv 2021.11) Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation, [Paper]

  • (arXiv 2021.11) Sparse is Enough in Scaling Transformers, [Paper]

  • (arXiv 2021.11) An implementation of the “Guess who?” game using CLIP, [Paper], [Code]

  • (arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]

  • (arXiv 2021.11) A Unified Pruning Framework for Vision Transformers, [Paper]

  • (arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]

  • (arXiv 2021.11) AssistSR: Affordance-centric Question-driven Video Segment Retrieval, [Paper], [Code & Data]

  • (arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation, [Paper], [Code]

  • (arXiv 2021.11) , [Paper]

  • (arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]

  • (arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]

  • (arXiv 2021.11) CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning, [Paper]

  • (arXiv 2021.11) CRIS: CLIP-Driven Referring Image Segmentation, [Paper]

  • (arXiv 2021.11) Shunted Self-Attention via Multi-Scale Token Aggregation, [Paper], [Code]

  • (arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning, [Paper]

  • (arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions, [Paper], [Code]

  • (arXiv 2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]

  • (arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]

  • (arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]

  • (arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]

  • (arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper], [Code]

  • (arXiv 2021.11) LAFITE : Towards Language-Free Training for Text-to-Image Generation, [Paper]

  • (arXiv 2021.11) SPARSE DETR: EFFICIENT END-TO-END OBJECT DETECTION WITH LEARNABLE SPARSITY, [Paper], [Code]

  • (arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]

  • (arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]

  • (arXiv 2021.11) Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic, [Paper], [Code]

  • (arXiv 2021.11) Blended Diffusion for Text-driven Editing of Natural Images, [Paper], [Code]

  • (arXiv 2021.11) Mask Transfiner for High-Quality Instance Segmentation, [Paper], [Code]

  • (arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]

  • (arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper], [COde]

  • (arXiv 2021.11) Towards Tokenized Human Dynamics Representation, [Paper], [Code]

  • (arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]

  • (arXiv 2021.11) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]

  • (arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]

  • (arXiv 2021.11) MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video, [Paper]

  • (arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]

  • (arXiv 2021.11) Hierarchical Modular Network for Video Captioning, [Paper]

  • (arXiv 2021.11) NU¨WA: Visual Synthesis Pre-training for Neural visUal World creAtion, [Paper], [Code]

  • (arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision MLP, [Paper]

  • (arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper]

  • (arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]

  • (arXiv 2021.11) Scaling Up Vision-Language Pre-training for Image Captioning, [Paper]

  • (arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]

  • (arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]

  • (arXiv 2021.11) RedCaps: Web-curated image-text data created by the people, for the people, [Paper], [Project]

  • (arXiv 2021.11) EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching, [Paper], [Code]

  • (arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper], [Code]

  • (arXiv 2021.11) Vis-TOP: Visual Transformer Overlay Processor, [Paper]

  • (arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]

  • (arXiv 2021.11) Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints, [Paper]

  • (arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]

  • (arXiv 2021.11) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions, [Paper]

  • (arXiv 2021.11) Combined Scaling for Zero-shot Transfer Learning, [Paper]

  • (arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]

  • (arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]

  • (arXiv 2021.11) IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER, [Paper], [Code]

  • (arXiv 2021.11) Masked Autoencoders Are Scalable Vision Learners, [Paper]

  • (arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]

  • (arXiv 2021.11) Are Transformers More Robust Than CNNs?, [Paper], [Code]

  • (arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]

  • (arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]

  • (arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]

  • (arXiv 2021.11) VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]

  • (arXiv 2021.11) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, [Paper], [Project]

  • (arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]

  • (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]

  • (arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]

2021.10

  • (arXiv 2021.10) Visual Keyword Spotting with Attention, [Paper], [[Project]](Visual Keyword Spotting with Attention)

  • (arXiv 2021.10) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery, [Paper], [Data & Code]

  • (arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval, [Paper], [Code]

  • (arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.10) Scatterbrain: Unifying Sparse and Low-rank Attention Approximation, [Paper]

  • (arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]

  • (arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]

  • (arXiv 2021.10) UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model, [Paper], [Data & Code]

  • (arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]

  • (arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]

  • (arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Project]

  • (arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation, [Paper]

  • (arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION, [Paper]

  • (arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for Visual Re-ranking, [Paper], [Code]

  • (arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]

  • (arXiv 2021.10) IMAGE-BASED CLIP-GUIDED ESSENCE TRANSFER, [Paper], [Code]

  • (arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic Attention, [Paper]

  • (arXiv 2021.10) ILLITERATE DALL·E LEARNS TO COMPOSE, [Paper], [Project], [Code]

  • (arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering, [Paper]

  • (arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]

  • (arXiv 2021.10) Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation, [Paper]

  • (arXiv 2021.10) TRANSFORMER ACCELERATION WITH DYNAMIC SPARSE ATTENTION, [Paper]

  • (arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]

  • (arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization, [Paper]

  • (arXiv 2021.10) StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects, [Paper], [Project]

  • (arXiv 2021.10) Gophormer: Ego-Graph Transformer for Node Classification, [Paper]

  • (arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN GANS, [Paper], [Code]

  • (arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]

  • (arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper], [Code]

  • (arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]

  • (arXiv 2021.10) WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP, [Paper], [Code]

  • (arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]

  • (arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]

  • (arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]

  • (arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]

  • (arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]

  • (arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]

  • (arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]

  • (arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.10) Leveraging MoCap Data for Human Mesh Recovery, [Paper]

  • (arXiv 2021.10) A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models, [Paper]

  • (arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]

  • (arXiv 2021.10) Multimodal Dialogue Response Generation, [Paper]

  • (arXiv 2021.10) Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals, [Paper]

  • (arXiv 2021.10) COMPOSITIONAL ATTENTION: DISENTANGLING SEARCH AND RETRIEVAL, [Paper], [Code]

  • (arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]

  • (arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper]

  • (arXiv 2021.10) Transformer with a Mixture of Gaussian Keys, [Paper]

  • (arXiv 2021.10) DIFFUSIONCLIP: TEXT-GUIDED IMAGE MANIPULATION USING DIFFUSION MODELS, [Paper]

  • (arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]

  • (arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH SUB-QUADRATIC COMPLEXITY, [Paper]

  • (arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]

  • (arXiv 2021.10) CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, [Paper]

  • (arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]

  • (arXiv 2021.10) SPARSE MOES MEET EFFICIENT ENSEMBLES, [Paper]

  • (arXiv 2021.10) Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? [Paper]

  • (arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition, [Paper]

  • (arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper]

  • (arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]

  • (arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM, [Paper], [Code]

  • (arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for Video Caption, [Paper]

  • (arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper]

  • (arXiv 2021.10) VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, [Paper]

  • (arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.10) NVIT: VISION TRANSFORMER COMPRESSION AND PARAMETER REDISTRIBUTION, [Paper]

  • (arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]

  • (arXiv 2021.10) CLIP-Adapter: Better Vision-Language Models with Feature Adapters, [Paper], [Code]

  • (arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Code]

  • (arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER, [Paper]

  • (arXiv 2021.10) TOKEN POOLING IN VISION TRANSFORMERS, [Paper]

  • (arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED OBJECT DETECTOR, [Paper], [Code]

  • (arXiv 2021.10) CLIP4Caption: CLIP for Video Caption, [Paper]

  • (arXiv 2021.10) OBJECT-REGION VIDEO TRANSFORMERS, [Paper], [Code]

  • (arXiv 2021.10) LEVERAGING REDUNDANCY IN ATTENTION WITH REUSE TRANSFORMERS, [Paper]

  • (arXiv 2021.10) Dynamic Inference with Neural Interpreters, [Paper]

  • (arXiv 2021.10) A CLIP-Enhanced Method for Video-Language Understanding, [Paper]

  • (arXiv 2021.10) Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries, [Paper]

  • (arXiv 2021.10) Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection, [Paper]

  • (arXiv 2021.10) Learning Structural Representations for Recipe Generation and Food Retrieval, [Paper]

  • (arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION, [Paper]

2021.09

  • (arXiv 2021.09) Joint Multimedia Event Extraction from Video and Article, [Paper]

  • (arXiv 2021.09) Long-Range Transformers for Dynamic Spatiotemporal Forecasting, [Paper]

  • (arXiv 2021.09) Visually Grounded Concept Composition, [Paper]

  • (arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation, [Paper]

  • (arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]

  • (arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]

  • (arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]

  • (arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper], [Code]

  • (arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]

  • (arXiv 2021.09) VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, [Paper], [Code]

  • (arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]

  • (arXiv 2021.09) CLIP-It! Language-Guided Video Summarization, [Paper], [Project]

  • (arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D FACIAL EXPRESSION RECOGNITION, [Paper]

  • (arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper], [Code]

  • (arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]

  • (arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper]

  • (arXiv 2021.09) MLIM: VISION-AND-LANGUAGE MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [Paper]

  • (arXiv 2021.09) Dense Contrastive Visual-Linguistic Pretraining, [Paper]

  • (arXiv 2021.09) CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS, [Paper]

  • (arXiv 2021.09) Localizing ∞-shaped fishes: Sketch-guided object localization in the wild, [Paper], [Code]

  • (arXiv 2021.09) CLIPORT: What and Where Pathways for Robotic Manipulation, [Paper], [Project], [Code]

  • (arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]

  • (arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]

  • (arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper], [Code]

  • (arXiv 2021.09) LOTR: Face Landmark Localization Using Localization Transformer, [Paper]

  • (arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]

  • (arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]

  • (arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]

  • (arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]

  • (arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]

  • (arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION, [Paper]

  • (arXiv 2021.09) ActionCLIP: A New Paradigm for Video Action Recognition, [Paper]

  • (arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]

  • (arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering, [Paper], [Code]

  • (arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper], [Code]

  • (arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper], [Code]

  • (arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]

  • (arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]

  • (arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper]

  • (arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]

  • (arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]

  • (arXiv 2021.09) Learning to Ground Visual Objects for Visual Dialog, [Paper]

  • (arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]

  • (arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.09) IS ATTENTION BETTER THAN MATRIX DECOMPOSITION? [Paper], [Code]

  • (arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]

  • (arXiv 2021.09) Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization, [Paper]

  • (arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding, [Paper]

  • (arXiv 2021.09) LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation, [Paper], [Code]

  • (arXiv 2021.09) Panoptic Narrative Grounding, [Paper]

  • (arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, [Paper]

  • (arXiv 2021.09) PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks, [Paper], [Project]

  • (arXiv 2021.09) EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [Paper]

  • (arXiv 2021.09) Scaled ReLU Matters for Training Vision Transformers, [Paper]

  • (arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]

  • (arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper]

  • (arXiv 2021.09) WHYACT: Identifying Action Reasons in Lifestyle Vlogs, [Paper]

  • (arXiv 2021.09) Zero-Shot Open Set Detection by Extending CLIP, [Paper]

  • (arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]

  • (arXiv 2021.09) Learning to Prompt for Vision-Language Models, [Paper], [Code]

  • (arXiv 2021.09) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss, [Paper], [Code]

  • (arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]

  • (arXiv 2021.09) ConvMLP: Hierarchical Convolutional MLPs for Vision, [Paper], [Code]

  • (arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]

  • (arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]

  • (arXiv 2021.09) Sparse-MLP: A Fully-MLP Architecture with Conditional Computation, [Paper]

  • (arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential Manipulation, [Paper], [Project]

  • (arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper]

  • (arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for Visual Question Answering, [Paper], [Code]

  • (arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE SUPER-RESOLUTION, [Paper]

  • (arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]

  • (arXiv 2021.09) Learning to Generate Scene Graph from Natural Language Supervision, [Paper], [Code]

  • (arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]

  • (arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]

  • (ICCV 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]

  • (arXiv 2021.09) Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation, [Paper], [Code]

  • (arXiv 2021.09) Joint Graph Learning and Matching for Semantic Feature Correspondence, [Paper]

  • (arXiv 2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper], [Code]

2021.08

  • (arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation, [Paper]

  • (arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]

  • (arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]

  • (arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]

  • (arXiv 2021.08) Cross-category Video Highlight Detection via Set-based Learning, [Paper], [Code]

  • (arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]

  • (arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments, [Paper]

  • (arXiv 2021.08) LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision, [Paper], [Project]

  • (arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]

  • (arXiv 2021.08) SIMVLM: SIMPLE VISUAL LANGUAGE MODEL PRETRAINING WITH WEAK SUPERVISION, [Paper]

  • (arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [Paper]

  • (arXiv 2021.08) Efficient Transformer for Single Image Super-Resolution, [Paper]

  • (arXiv 2021.08) Discovering Spatial Relationships by Transformers for Domain Generalization, [Paper]

  • (arXiv 2021.08) TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment, [Paper]

  • (arXiv 2021.08) MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition, [Paper]

  • (arXiv 2021.08) SwinIR: Image Restoration Using Swin Transformer, [Paper], [Code]

  • (arXiv 2021.08) Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training, [Paper]

  • (arXiv 2021.08) Improving 3D Object Detection with Channel-wise Transformer, [Paper]

  • (arXiv 2021.08) No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency, [Paper], [Code]

  • (arXiv 2021.08) SOTR: Segmenting Objects with Transformers, [Paper], [Code]

  • (arXiv 2021.08) ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, [Paper], [Code]

  • (arXiv 2021.08) Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism, [Paper], [Code]

  • (arXiv 2021.08) End-to-End Dense Video Captioning with Parallel Decoding, [Paper], [Code]

  • (arXiv 2021.08) Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance, [Paper]

  • (arXiv 2021.08) Video Relation Detection via Tracklet based Visual Transformer, [Paper], [Code]

  • (arXiv 2021.08) PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers, [Paper], [Code]

  • (arXiv 2021.08) ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis, [Paper], [Project]

  • (arXiv 2021.08) Do Vision Transformers See Like Convolutional Neural Networks? [Paper]

  • (arXiv 2021.08) TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.08) MUSIQ: Multi-scale Image Quality Transformer, [Paper]

  • (arXiv 2021.08) Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning, [Paper], [Code]

  • (arXiv 2021.08) Conditional DETR for Fast Training Convergence, [Paper], [Code]

  • (arXiv 2021.08) Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper], [Code]

  • (arXiv 2021.08) Mobile-Former: Bridging MobileNet and Transformer, [Paper]

  • (arXiv 2021.08) Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation), [Paper], [Code]

  • (arXiv 2021.08) Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations, [Paper]

  • (arXiv 2021.08) Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion, [Paper], [Code]

  • (arXiv 2021.08) Video Transformer for Deepfake Detection with Incremental Learning, [Paper]

  • (arXiv 2021.08) ConvNets vs. Transformers: Whose Visual Representations are More Transferable? [Paper]

  • (arXiv 2021.08) A Transformer-based Math Language Model for Handwritten Math Expression Recognition, [Paper]

  • (arXiv 2021.08) Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers, [Paper]

  • (arXiv 2021.08) TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding, [Paper]

  • (arXiv 2021.08) Fast Convergence of DETR with Spatially Modulated Co-Attention. [Paper], [Code]

  • (arXiv 2021.08) Token Shift Transformer for Video Classification, [Paper], [Code]

  • (arXiv 2021.08) Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer, [Paper], [Code]

  • (arXiv 2021.08) Joint Inductive and Transductive Learning for Video Object Segmentation, [Paper], [Code]

  • (arXiv 2021.08) OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning, [Paper]

  • (arXiv 2021.08) Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, [Paper], [Code-1], [Code-2]

  • (arXiv 2021.08) TransForensics: Image Forgery Localization with Dense Self-Attention, [Paper]

  • (arXiv 2021.08) TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network, [Paper]

  • (arXiv 2021.08) Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models, [Paper], [Code]

  • (arXiv 2021.08) The Right to Talk: An Audio-Visual Transformer Approach, [Paper]

  • (arXiv 2021.08) PSViT: Better Vision Transformer via Token Pooling and Attention Sharing, [Paper]

  • (arXiv 2021.08) Unifying Global-Local Representations in Salient Object Detection with Transformer, [Paper], [Code]

  • (arXiv 2021.08) Boosting Few-shot Semantic Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2021.08) Vision Transformer with Progressive Sampling, [Paper], [Code]

  • (arXiv 2021.08) Armour: Generalizable Compact Self-Attention for Vision Transformers, [Paper]

  • (arXiv 2021.08) Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, [Paper]

  • (arXiv 2021.08) S^2-MLPV2: IMPROVED SPATIAL-SHIFT MLP ARCHITECTURE FOR VISION, [Paper]

  • (arXiv 2021.08) Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer, [Paper]

  • (arXiv 2021.08) Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning, [Paper]

  • (arXiv 2021.08) CROSSFORMER: A VERSATILE VISION TRANSFORMER BASED ON CROSS-SCALE ATTENTION, [Paper], [Code]

  • (arXiv 2021.08) Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding, [Paper]

  • (arXiv 2021.08) Transformer-based deep imitation learning for dual-arm robot manipulation, [Paper]

  • (arXiv 2021.08) GTNet:Guided Transformer Network for Detecting Human-Object Interactions, [Paper], [Code]

2021.07

  • (arXiv 2021.07) Perceiver IO: A General Architecture for Structured Inputs & Outputs, [Paper], [Code]

  • (arXiv 2021.07) DPT: Deformable Patch-based Transformer for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.07) Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining, [Paper]

  • (arXiv 2021.07) Exceeding the Limits of Visual-Linguistic Multi-Task Learning, [Paper]

  • (arXiv 2021.07) UIBert: Learning Generic Multimodal Representations for UI Understanding, [Paper]

  • (arXiv 2021.07) Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection, [Paper]

  • (arXiv 2021.07) A Unified Efficient Pyramid Transformer for Semantic Segmentation, [Paper]

  • (arXiv 2021.07) PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion, [Paper]

  • (arXiv 2021.07) ReFormer: The Relational Transformer for Image Captioning, [Paper]

  • (arXiv 2021.07) Rethinking and Improving Relative Position Encoding for Vision Transformer, [Paper], [Code]

  • (arXiv 2021.07) Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers, [Paper]

  • (arXiv 2021.07) PlaneTR: Structure-Guided Transformers for 3D Plane Recovery, [Paper], [Code]

  • (arXiv 2021.07) Is Object Detection Necessary for Human-Object Interaction Recognition? [Paper]

  • (arXiv 2021.07) Exceeding the Limits of Visual-Linguistic Multi-Task Learning, [Paper]

  • (arXiv 2021.07) Don’t Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers, [Paper]

  • (arXiv 2021.07) Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers, [Paper], [Code]

  • (arXiv 2021.07) Go Wider Instead of Deeper, [Paper]

  • (arXiv 2021.07) Contextual Transformer Networks for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.07) Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives, [Paper]

  • (arXiv 2021.07) Query2Label: A Simple Transformer Way to Multi-Label Classification, [Paper], [Code]

  • (arXiv 2021.07) EAN: Event Adaptive Network for Enhanced Action Recognition, [Paper], [Code]

  • (arXiv 2021.07) CycleMLP: A MLP-like Architecture for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.07) Generative Video Transformer: Can Objects be the Words? [Paper]

  • (arXiv 2021.07) QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries, [Paper], [Code]

  • (arXiv 2021.07) PICASO: Permutation-Invariant Cascaded Attentional Set Operator, [Paper], [Code]

  • (arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [Paper]

  • (arXiv 2021.07) OODformer: Out-Of-Distribution Detection Transformer, [Paper], [Code]

  • (arXiv 2021.07) Image Fusion Transformer, [Paper], [Code]

  • (arXiv 2021.07) ResT: An Efficient Transformer for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.07) STAR: Sparse Transformer-based Action Recognition, [Paper], [Code]

  • (arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [Paper]

  • (arXiv 2021.07) How Much Can CLIP Benefit Vision-and-Language Tasks? [Paper]

  • (arXiv 2021.07) Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms, [Paper], [Code]

  • (arXiv 2021.07) Visual Parser: Representing Part-whole Hierarchies with Transformers, [Paper], [Code]

  • (arXiv 2021.07) Combiner: Full Attention Transformer with Sparse Computation Cost, [Paper]

  • (arXiv 2021.07) Per-Pixel Classification is Not All You Need for Semantic Segmentation, [Paper], [Project]

  • (arXiv 2021.07) Learning Multi-Scene Absolute Pose Regression with Transformers, [Paper]

  • (arXiv 2021.07) CMT: Convolutional Neural Networks Meet Vision Transformers, [Paper]

  • (arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [Paper], [Code]

  • (arXiv 2021.07) THE BROWNIAN MOTION IN THE TRANSFORMER MODEL, [Paper]

  • (arXiv 2021.07) Local-to-Global Self-Attention in Vision Transformers, [Paper], [Code]

  • (arXiv 2021.07) Scenes and Surroundings: Scene Graph Generation using Relation Transformer, [Paper]

  • (arXiv 2021.07) ViTGAN: Training GANs with Vision Transformers, [Paper]

  • (arXiv 2021.07) Long-Short Temporal Contrastive Learning of Video Transformers, [Paper]

  • (arXiv 2021.07) PVTv2: Improved Baselines with Pyramid Vision Transformer, [Paper], [Code]

  • (arXiv 2021.07) Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers, [Paper], [Code]

  • (arXiv 2021.07) LanguageRefer: Spatial-Language Model for 3D Visual Grounding, [Paper]

  • (arXiv 2021.07) EEG-CONVTRANSFORMER FOR SINGLE-TRIAL EEG BASED VISUAL STIMULI CLASSIFICATION, [Paper]

  • (arXiv 2021.07) Feature Fusion Vision Transformer for Fine-Grained Visual Categorization, [Paper]

  • (arXiv 2021.07) Long-Short Transformer: Efficient Transformers for Language and Vision, [Paper]

  • (arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [Paper]

  • (arXiv 2021.07) VIDLANKD: Improving Language Understanding via Video-Distilled Knowledge Transfer, [Paper], [Code]

  • (arXiv 2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [Paper]

  • (arXiv 2021.07) LEARNING VISION TRANSFORMER WITH SQUEEZE AND EXCITATION FOR FACIAL EXPRESSION RECOGNITION, [Paper]

  • (arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [Paper]

  • (arXiv 2021.07) Long Short-Term Transformer for Online Action Detection, [Paper]

  • (arXiv 2021.07) VISION XFORMERS: EFFICIENT ATTENTION FOR IMAGE CLASSIFICATION, [Paper]

  • (arXiv 2021.07) Test-Time Personalization with a Transformer for Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.07) What Makes for Hierarchical Vision Transformer? [Paper]

  • (arXiv 2021.07) Efficient Vision Transformers via Fine-Grained Manifold Distillation, [Paper]

  • (arXiv 2021.07) Visual Relationship Forecasting in Videos, [Paper]

  • (arXiv 2021.07) Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots, [Paper]

  • (arXiv 2021.07) Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions, [Paper]

  • (arXiv 2021.07) CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, [Paper], [Code]

  • (arXiv 2021.07) CLIP-It! Language-Guided Video Summarization, [Paper], [Code]

  • (arXiv 2021.07) AutoFormer: Searching Transformers for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.07) Focal Self-attention for Local-Global Interactions in Vision Transformers, [Paper]

  • (arXiv 2021.07) Global Filter Networks for Image Classification, [Paper], [Code]

  • (arXiv 2021.07) VideoLightFormer: Lightweight Action Recognition using Transformers, [Paper]

  • (arXiv 2021.07) OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation, [Paper]

  • (arXiv 2021.07) TransSC: Transformer-based Shape Completion for Grasp Evaluation, [Paper]

  • (arXiv 2021.07) Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition, [Paper]

2021.06

  • (arXiv 2021.06) Associating Objects with Transformers for Video Object Segmentation, [Paper], [Code]

  • (arXiv 2021.06) Video Super-Resolution Transformer, [Paper], [Code]

  • (arXiv 2021.06) Thinking Like Transformers, [Paper]

  • (arXiv 2021.06) Kernel Identification Through Transformers, [Paper]

  • (arXiv 2021.06) XCiT: Cross-Covariance Image Transformers, [Paper]

  • (arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [Paper]

  • (arXiv 2021.06) Probing Image–Language Transformers for Verb Understanding, [Paper]

  • (arXiv 2021.06) How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, [Paper], [Code], [Model]

  • (arXiv 2021.06) End-to-end Temporal Action Detection with Transformer, [Paper], [Code]

  • (arXiv 2021.06) Efficient Self-supervised Vision Transformers for Representation Learning, [Paper]

  • (arXiv 2021.06) CLIP2Video: Mastering Video-Text Retrieval via Image CLIP, [Paper], [Code]

  • (arXiv 2021.06) Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers, [Paper], [Code]

  • (arXiv 2021.06) Transformed ROIs for Capturing Visual Transformations in Videos, [Paper]

  • (arXiv 2021.06) Transformer in Convolutional Neural Networks, [Paper], [Code]

  • (arXiv 2021.06) Video Instance Segmentation using Inter-Frame Communication Transformers, [Paper]

  • (arXiv 2021.06) Patch Slimming for Efficient Vision Transformers, [Paper]

  • (arXiv 2021.06) CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings, [Paper]

  • (arXiv 2021.06) RegionViT: Regional-to-Local Attention for Vision Transformers, [Paper]

  • (arXiv 2021.06) Motion Planning Transformers: One Model to Plan Them All, [Paper]

  • (arXiv 2021.06) Oriented Object Detection with Transformer, [Paper]

  • (arXiv 2021.06) Referring Transformer: A One-step Approach to Multi-task Visual Grounding, [Paper]

  • (arXiv 2021.06) Grounding inductive biases in natural images: invariance stems from variations in data, [Paper]

  • (arXiv 2021.06) CoAtNet: Marrying Convolution and Attention for All Data Sizes, [Paper]

  • (arXiv 2021.06) Scaling Vision Transformers, [Paper]

  • (arXiv 2021.06) Uformer: A General U-Shaped Transformer for Image Restoration, [Paper], [Code]

  • (arXiv 2021.06) Visual Transformer for Task-aware Active Learning, [Paper], [Code]

  • (arXiv 2021.06) Chasing Sparsity in Vision Transformers: An End-to-End Exploration, [Paper], [Code]

  • (arXiv 2021.06) DETReg: Unsupervised Pretraining with Region Priors for Object Detection, [Paper], [Code]

  • (arXiv 2021.06) MVT: MASK VISION TRANSFORMER FOR FACIAL EXPRESSION RECOGNITION IN THE WILD, [Paper]

  • (arXiv 2021.06) Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight, [Paper]

  • (arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [Paper]

  • (arXiv 2021.06) MlTr: Multi-label Classification with Transformer, [Paper], [Code]

  • (arXiv 2021.06) Going Beyond Linear Transformers with Recurrent Fast Weight Programmers, [Paper], [Code]

  • (arXiv 2021.06) On Improving Adversarial Transferability of Vision Transformers, [Paper], [Code]

  • (arXiv 2021.06) Fully Transformer Networks for Semantic Image Segmentation, [Paper]

  • (arXiv 2021.06) MST: Masked Self-Supervised Transformer for Visual Representation, [Paper]

  • (arXiv 2021.06) Space-time Mixing Attention for Video Transformer, [Paper]

  • (arXiv 2021.06) VIT-INCEPTION-GAN FOR IMAGE COLOURISING, [Paper]

  • (arXiv 2021.06) HYBRID GENERATIVE-CONTRASTIVE REPRESENTATION LEARNING, [Paper], [Code]

  • (arXiv 2021.06) OadTR: Online Action Detection with Transformers, [Paper], [Code]

  • (arXiv 2021.06) VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning, [Paper], [Code]

  • (arXiv 2021.06) Delving Deep into the Generalization of Vision Transformers under Distribution Shifts, [Paper], [Code]

  • (arXiv 2021.06) Improved Transformer for High-Resolution GANs, [Paper]

  • (arXiv 2021.06) Towards Long-Form Video Understanding, [Paper], [Code]

  • (arXiv 2021.06) TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [Paper]

  • (arXiv 2021.06) More than Encoder: Introducing Transformer Decoder to Upsample, [Paper]

  • (arXiv 2021.06) A Picture May Be Worth a Hundred Words for Visual Question Answering, [Paper]

  • (arXiv 2021.06) Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training, [Paper]

  • (arXiv 2021.06) Shape registration in the time of transformers, [Paper]

  • (arXiv 2021.06) Vision Transformer Architecture Search, [Paper], [Code]

  • (arXiv 2021.06) Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue, [Paper]

  • (arXiv 2021.06) Multi-Exit Vision Transformer for Dynamic Inference, [Paper]

  • (arXiv 2021.06) Early Convolutions Help Transformers See Better, [Paper]

  • (arXiv 2021.06) Rethinking Token-Mixing MLP for MLP-based Vision Backbone, [Paper]

  • (arXiv 2021.06) Augmented Shortcuts for Vision Transformers, [Paper]

  • (arXiv 2021.06) CAT: Cross Attention in Vision Transformer, [Paper], [Code]

  • (arXiv 2021.06) Post-Training Quantization for Vision Transformer, [Paper]

  • (arXiv 2021.06) Attention Bottlenecks for Multimodal Fusion, [Paper]

  • (arXiv 2021.06) Improving the Efficiency of Transformers for Resource-Constrained Devices, [Paper]

  • (arXiv 2021.06) Multimodal Few-Shot Learning with Frozen Language Models, [Paper]

  • (arXiv 2021.06) Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation, [Paper]

  • (arXiv 2021.06) Exploring Vision Transformers for Fine-grained Classification, [Paper], [Code]

  • (arXiv 2021.06) S^2-MLP: Spatial-Shift MLP Architecture for Vision, [Paper]

  • (arXiv 2021.06) Styleformer: Transformer based Generative Adversarial Networks with Style Vector, [Paper], [Code]

  • (arXiv 2021.06) ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias, [Paper], [Code]

  • (arXiv 2021.06) Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, [Paper]

  • (arXiv 2021.06) Refiner: Refining Self-attention for Vision Transformers, [Paper], [Code]

  • (arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [Paper]

  • (arXiv 2021.06) Efficient Training of Visual Transformers with Small-Size Datasets, [Paper]

  • (arXiv 2021.06) Glance-and-Gaze Vision Transformer, [Paper], [Code]

  • (arXiv 2021.06) Few-Shot Segmentation via Cycle-Consistent Transformer, [Paper]

  • (arXiv 2021.06) Semantic Correspondence with Transformers, [Paper], [Code]

  • (arXiv 2021.06) THE IMAGE LOCAL AUTOREGRESSIVE TRANSFORMER, [Paper]

  • (arXiv 2021.06) MERLOT: Multimodal Neural Script Knowledge Models, [Paper], [Project]

  • (arXiv 2021.06) SOLQ: Segmenting Objects by Learning Queries, [Paper], [Code]

  • (arXiv 2021.06) Personalizing Pre-trained Models, [Paper], [Code]

  • (arXiv 2021.06) E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning, [Paper]

  • (arXiv 2021.06) VOLO: Vision Outlooker for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.06) Container: Context Aggregation Network, [Paper]

  • (arXiv 2021.06) Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers, [Paper]

  • (arXiv 2021.06) Video Swin Transformer, [Paper], [Code]

  • (arXiv 2021.06) IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers, [Paper], [Code]

  • (arXiv 2021.06) AudioCLIP: Extending CLIP to Image, Text and Audio, [Paper]

  • (arXiv 2021.06) VISION PERMUTATOR: A PERMUTABLE MLP-LIKE ARCHITECTURE FOR VISUAL RECOGNITION, [Paper], [Code]

  • (arXiv 2021.06) Co-advise: Cross Inductive Bias Distillation, [Paper]

  • (arXiv 2021.06) Team PyKale (xy9) Submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition, [Paper]

  • (arXiv 2021.06) P2T: Pyramid Pooling Transformer for Scene Understanding, [Paper], [Code]

  • (arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [Paper]

  • (arXiv 2021.06) Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding, [Paper]

  • (arXiv 2021.06) MODETR: Moving Object Detection with Transformers, [Paper]

  • (arXiv 2021.06) ResMLP: Feedforward networks for image classification with data-efficient training, [Paper]

  • (arXiv 2021.06) Multi-head or Single-head? An Empirical Comparison for Transformer Training, [Paper]

  • (arXiv 2021.06) Dynamic Head: Unifying Object Detection Heads with Attentions, [Paper], [Code]

  • (arXiv 2021.06) MLP-Mixer: An all-MLP Architecture for Vision, [Paper], [Code]

  • (arXiv 2021.06) BEIT: BERT Pre-Training of Image Transformers, [Paper], [Code]

  • (arXiv 2021.06) Scaling Vision with Sparse Mixture of Experts, [Paper]

  • (arXiv 2021.06) Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition, [Paper]

  • (arXiv 2021.06) Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time, [Paper], [Code]

  • (arXiv 2021.06) DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, [Paper], [Code]

  • (arXiv 2021.06) SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation, [Paper]

  • (arXiv 2021.06) Anticipative Video Transformer, [Paper], [Project]

  • (arXiv 2021.06) Pay Attention to MLPs, [Paper]

  • (arXiv 2021.06) When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, [Paper]

  • (arXiv 2021.06) StyTr^2: Unbiased Image Style Transfer with Transformers, [Paper]

  • (arXiv 2021.06) THG:Transformer with Hyperbolic Geometry, [Paper]

  • (arXiv 2021.06) You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection, [Paper], [Code]

  • (arXiv 2021.06) TransVOS: Video Object Segmentation with Transformers, [Paper]

  • (2021.06) Reinforcement Learning as One Big Sequence Modeling Problem, [Paper], [Project]

  • (arXiv 2021.06) Less is More: Pay Less Attention in Vision Transformers, [Paper], [Code]

  • (arXiv 2021.06) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, [Paper], [Code]

2021.05

  • (arXiv 2021.05) KVT: k-NN Attention for Boosting Vision Transformers, [Paper]

  • (arXiv 2021.05) Memory-Efficient Differentiable Transformer Architecture Search, [Paper]

  • (arXiv 2021.05) An Attention Free Transformer, [Paper]

  • (arXiv 2021.05) On the Bias Against Inductive Biases, [Paper]

  • (arXiv 2021.05) MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation, [Paper]

  • (arXiv 2021.05) Transformer-Based Source-Free Domain Adaptation, [Paper], [Code]

  • (arXiv 2021.05) FoveaTer: Foveated Transformer for Image Classification, [Paper]

  • (arXiv 2021.05) UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis, [Paper]

  • (arXiv 2021.05) Gaze Estimation using Transformer, [Paper], [Code]

  • (arXiv 2021.05) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [Paper], [Project]

  • (arXiv 2021.05) Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length, [Paper]

  • (arXiv 2021.05) Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model, [Paper]

  • (arXiv 2021.05) MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, [Paper], [Code]

  • (arXiv 2021.05) Sequence Parallelism: Making 4D Parallelism Possible, [Paper]

  • (arXiv 2021.05) CogView: Mastering Text-to-Image Generation via Transformers, [Paper], [Code]

  • (arXiv 2021.05) TrTr: Visual Tracking with Transformer, [Paper], [Code]

  • (arXiv 2021.05) Conformer: Local Features Coupling Global Representations for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.05) Visual Grounding with Transformers, [Paper]

  • (arXiv 2021.05) Self-Supervised Learning with Swin Transformers, [Paper], [Code]

  • (arXiv 2021.05) Are Pre-trained Convolutions Better than Pre-trained Transformers? [Paper]

  • (arXiv 2021.05) MOTR: End-to-End Multiple-Object Tracking with TRansformer, [Paper], [Code]

  • (arXiv 2021.05) Attention for Image Registration (AiR): an unsupervised Transformer approach, [Paper], [Code]

  • (arXiv 2021.05) EXPLORING EXPLICIT AND IMPLICIT VISUAL RELATIONSHIPS FOR IMAGE CAPTIONING, [Paper]

  • (arXiv 2021.05) Computer-Aided Design as Language, [Paper]

  • (arXiv 2021.05) FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction, [Paper], [Project]

  • (arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper]

  • (arXiv 2021.05) High-Resolution Complex Scene Synthesis with Transformers, [Paper]

  • (arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [Paper]

  • (arXiv 2021.05) Towards Robust Vision Transformer, [Paper], [Code]

  • (arXiv 2021.05) Vision Transformers are Robust Learners, [Paper], [Code]

  • (arXiv 2021.05) ISTR: End-to-End Instance Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2021.05) SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition, [Paper]

  • (arXiv 2021.05) Rethinking Skip Connection with Layer Normalization in Transformers and ResNets, [Paper]

  • (arXiv 2021.05) IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture, [Paper]

  • (arXiv 2021.05) Parallel Attention Network with Sequence Matching for Video Grounding, [Paper], [Code]

  • (arXiv 2021.05) Relative Positional Encoding for Transformers with Linear Complexity, [Paper]

  • (arXiv 2021.05) VTNET: VISUAL TRANSFORMER NETWORK FOR OBJECT GOAL NAVIGATION, [Paper]

  • (arXiv 2021.05) DeepCAD: A Deep Generative Network for Computer-Aided Design Models, [Paper]

  • (arXiv 2021.05) Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead, [Paper]

  • (arXiv 2021.05) An Attention Free Transformer, [Paper]

  • (arXiv 2021.05) Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks, [Paper], [Code]

  • (arXiv 2021.05) Combining Transformer Generators with Convolutional Discriminators, [Paper]

  • (arXiv 2021.05) VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding, [Paper]

  • (arXiv 2021.05) Improving Generation and Evaluation of Visual Stories via Semantic Consistency, [Paper], [Code]

  • (arXiv 2021.05) BELT: Blockwise Missing Embedding Learning Transfomer, [Paper]

  • (arXiv 2021.05) End-to-End Video Object Detection with Spatial-Temporal Transformers, [Paper], [Code]

  • (arXiv 2021.05) SAT: 2D Semantics Assisted Training for 3D Visual Grounding, [Paper]

  • (arXiv 2021.05) Aggregating Nested Transformers, [Paper]

  • (arXiv 2021.05) Intriguing Properties of Vision Transformers, [Paper], [Code]

  • (arXiv 2021.05) Temporal Action Proposal Generation with Transformers, [Paper]

  • (arXiv 2021.05) Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation, [Paper], [Code]

  • (arXiv 2021.05) Perceptual Image Quality Assessment with Transformers, [Paper]

  • (arXiv 2021.05) Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet, [Paper], [Code]

  • (arXiv 2021.05) Pay Attention to MLPs, [Paper]

  • (arXiv 2021.05) ResMLP: Feedforward networks for image classification with data-efficient training, [Paper]

  • (arXiv 2021.05) RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition, [Paper], [Code]

  • (arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision? [Paper]

  • (arXiv 2021.05) FNet: Mixing Tokens with Fourier Transforms, [Paper]

  • (arXiv 2021.05) Segmenter: Transformer for Semantic Segmentation, [Paper], [Code]

  • (arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper]

  • (arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]

2021.04

  • (arXiv 2021.04) HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction, [Paper]

  • (arXiv 2021.04) Chop Chop BERT: Visual Question Answering by Chopping VisualBERT’s Heads, [Paper]

  • (arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [Paper]

  • (arXiv 2021.04) CAT: Cross-Attention Transformer for One-Shot Object Detection, [Paper]

  • (arXiv 2021.04) Dual Transformer for Point Cloud Analysis, [Paper]

  • (arXiv 2021.04) Playing Lottery Tickets with Vision and Language, [Paper]

  • (arXiv 2021.04) M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers, [Paper]

  • (arXiv 2021.04) RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory, [Paper], [Code]

  • (arXiv 2021.04) MDETR-Modulated Detection for End-to-End Multi-Modal Understanding, [Paper], [Code]

  • (arXiv 2021.04) Rich Semantics Improve Few-shot Learning, [Paper], [Code]

  • (arXiv 2021.04) Effect of Vision-and-Language Extensions on Natural Language Understanding in Vision-and-Language Models, [Paper]

  • (arXiv 2021.04) Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet, [Paper], [Code]

  • (arXiv 2021.04) So-ViT: Mind Visual Tokens for Vision Transforme, [Paper]

  • (arXiv 2021.04) Multiscale Vision Transformers, [Paper], [Code]

  • (arXiv 2021.04) M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, [Paper]

  • (arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [Paper]

  • (arXiv 2021.04) T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval, [Paper]

  • (arXiv 2021.04) VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization, [Paper]

  • (arXiv 2021.04) Multi-Modal Fusion Transformer for End-to-End Autonomous Driving, [Paper], [Code]

  • (arXiv 2021.04) TransVG: End-to-End Visual Grounding with Transformers, [Paper]

  • (arXiv 2021.04) Visual Transformer Pruning, [Paper]

  • (arXiv 2021.04) Higher Order Recurrent Space-Time Transformer, [Paper], [Code]

  • (arXiv 2021.04) CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval, [Paper], [Code]

  • (arXiv 2021.04) Lessons on Parameter Sharing across Layers in Transformers, [Paper]

  • (arXiv 2021.04) Disentangled Motif-aware Graph Learning for Phrase Grounding, [Paper]

  • (arXiv 2021.04) Co-Scale Conv-Attentional Image Transformers, [Paper], [Code]

  • (arXiv 2021.04) Cloth Interactive Transformer for Virtual Try-On, [Paper], [Code]

  • (arXiv 2021.04) LocalViT: Bringing Locality to Vision Transformers, [Paper], [Code]

  • (arXiv 2021.04) Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning, [Paper]

  • (arXiv 2021.04) Facial Attribute Transformers for Precise and Robust Makeup Transfer, [Paper]

  • (arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]

  • (arXiv 2021.04) ConTNet: Why not use convolution and transformer at the same time? [Paper], [Code]

  • (arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]

  • (arXiv 2021.04) Twins: Revisiting the Design of Spatial Attention in Vision Transformers, [Paper], [Code]

  • (arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [Paper]

  • (arXiv 2021.04) Shot Contrastive Self-Supervised Learning for Scene Boundary Detection, [Paper]

  • (arXiv 2021.04) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper]

  • (arXiv 2021.04) Visual Saliency Transformer, [Paper]

  • (arXiv 2021.04) Improve Vision Transformers Training by Suppressing Over-smoothing, [Paper], [Code]

  • (arXiv 2021.04) Visformer: The Vision-friendly Transformer, [Paper], [Code]

  • (arXiv 2021.04) TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking, [Paper]

  • (arXiv 2021.04) Mesh Graphormer, [Paper], [Code]

  • (arXiv 2021.04) TRAJEVAE - Controllable Human Motion Generation from Trajectories, [Paper]

  • (arXiv 2021.04) UC^2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training, [Paper]

  • (arXiv 2021.04) Learning to Cluster Faces via Transformer, [Paper]

  • (arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]

  • (arXiv 2021.04) VidTr: Video Transformer Without Convolutions, [Paper]

  • (arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]

  • (arXiv 2021.04) Going deeper with Image Transformers, [Paper]

  • (arXiv 2021.04) EFFICIENT PRE-TRAINING OBJECTIVES FOR TRANSFORMERS, [Paper], [Code]

  • (arXiv 2021.04) ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING, [Paper]

  • (arXiv 2021.04) VideoGPT: Video Generation using VQ-VAE and Transformers, [Paper], [Code]

  • (arXiv 2021.04) DODRIO: Exploring Transformer Models with Interactive Visualization, [Paper], [Code]

  • (arXiv 2021.04) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]

  • (arXiv 2021.04) Demystifying the Better Performance of Position Encoding Variants for Transformer, [Paper]

  • (arXiv 2021.04) Consistent Accelerated Inference via Confident Adaptive Transformers, [Paper], [Code]

  • (arXiv 2021.04) Temporal Query Networks for Fine-grained Video Understanding, [Paper], [Code]

  • (arXiv 2021.04) Face Transformer for Recognition, [Paper], [Code]

  • (arXiv 2021.04) VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks, [Paper]

  • (arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]

  • (arXiv 2021.04) Cross-Modal Retrieval Augmentation for Multi-Modal Classification, [Paper]

  • (arXiv 2021.04) Point-Based Modeling of Human Clothing, [Paper]

  • (arXiv 2021.04) Points as Queries: Weakly Semi-supervised Object Detection by Points, [Paper]

  • (arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [Paper], [Code]

  • (arXiv 2021.04) Self-supervised Video Object Segmentation by Motion Grouping, [Paper], [Project]

  • (arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]

  • (arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]

  • (arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper], [Project]

  • (arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]

  • (arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]

  • (arXiv 2021.04) Handwriting Transformers, [Paper]

  • (arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]

  • (arXiv 2021.04) EFFICIENT TRANSFORMERS IN REINFORCEMENT LEARNING USING ACTOR-LEARNER DISTILLATION, [Paper]

  • (arXiv 2021.04) Compressing Visual-linguistic Model via Knowledge Distillation, [Paper]

  • (arXiv 2021.04) When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes, [Paper]

  • (arXiv 2021.04) Variational Transformer Networks for Layout Generation, [Paper]

  • (arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]

  • (arXiv 2021.04) Fourier Image Transformer, [Paper]

  • (arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]

  • (arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]

  • (arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]

  • (arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]

  • (arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]

  • (arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]

  • (arXiv 2021.04) VisQA: X-raying Vision and Language Reasoning in Transformers, [Paper]

  • (arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]

  • (arXiv 2021.04) Language-based Video Editing via Multi-Modal Multi-Level Transformer, [Paper]

  • (arXiv 2021.04) LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference, [Paper]

  • (arXiv 2021.04) LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]

  • (arXiv 2021.04) Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis, [Paper], [Project]

  • (arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]

  • (arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]

  • (arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]

2021.03

  • (arXiv 2021.03) TransCenter: Transformers with Dense Queries for Multiple-Object Tracking, [Paper]

  • (arXiv 2021.03) PixelTransformer: Sample Conditioned Signal Generation, [Paper], [Code]

  • (arXiv 2021.03) Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation, [Paper]

  • (arXiv 2021.03) DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention, [Paper]

  • (arXiv 2021.03) Learning Spatio-Temporal Transformer for Visual Tracking, [Paper], [Code]

  • (arXiv 2021.03) StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, [Paper], [Code]

  • (arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]

  • (arXiv 2021.03) Robust Facial Expression Recognition with Convolutional Visual Transformers, [Paper]

  • (arXiv 2021.03) Describing and Localizing Multiple Changes with Transformers, [Paper], [Project]

  • (arXiv 2021.03) COTR: Correspondence Transformer for Matching Across Images, [Paper]

  • (arXiv 2021.03) nderstanding Robustness of Transformers for Image Classification, [Paper]

  • (arXiv 2021.03) CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [Paper]

  • (arXiv 2021.03) Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers, [Paper]

  • (arXiv 2021.03) HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval, [Paper]

  • (arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper], [Code]

  • (arXiv 2021.03) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, [Paper]

  • (arXiv 2021.03) Transformer Tracking, [Paper], [Code]

  • (arXiv 2021.03) ViViT: A Video Vision Transformer, [Paper]

  • (arXiv 2021.03) CvT: Introducing Convolutions to Vision Transformers, [Paper], [Code]

  • (arXiv 2021.03) Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, [Paper], [Code]

  • (arXiv 2021.03) On the Adversarial Robustness of Visual Transformers, [Paper]

  • (arXiv 2021.03) Rethinking Spatial Dimensions of Vision Transformers, [Paper], [Code]

  • (arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]

  • (arXiv 2021.03) Read and Attend: Temporal Localisation in Sign Language Videos, [Paper], [Benchmark]

  • (arXiv 2021.03) Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]

  • (arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]

  • (arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]

  • (arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]

  • (arXiv 2021.03) Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning, [Paper], [Code]

  • (arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]

  • (arXiv 2021.03) Scene-Intuitive Agent for Remote Embodied Visual Grounding, [Paper]

  • (arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper]

  • (arXiv 2021.03) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]

  • (arXiv 2021.03) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, [Paper], [Code]

  • (arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]

  • (arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]

  • (arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]

  • (arXiv 2021.03) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]

  • (arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]

  • (arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]

  • (arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]

  • (arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]

  • (arXiv 2021.03) Paying Attention to Multiscale Feature Maps in Multimodal Image Matching, [Paper]

  • (arXiv 2021.03) HOPPER: MULTI-HOP TRANSFORMER FOR SPATIOTEMPORAL REASONING, [Paper], [Code]

  • (arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]

  • (arXiv 2021.03) AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting, [Paper], [Code]

  • (arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]

  • (arXiv 2021.03) ConViT: Improving Vision Transformers ith Soft Convolutional Inductive Biases, [Paper], [Code]

  • (arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]

  • (arXiv 2021.03) On the Sentence Embeddings from Pre-trained Language Models, [Paper]

  • (arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]

  • (arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]

  • (arXiv 2021.03) Decoupled Spatial Temporal Graphs for Generic Visual Grounding, [Paper]

  • (arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]

  • (arXiv 2021.03) Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models, [Paper], [Code]

  • (arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]

  • (arXiv 2021.03) Causal Attention for Vision-Language Tasks, [Paper], [Code]

  • (arXiv 2021.03) Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks, [Paper]

  • (arXiv 2021.03) WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training, [Paper]

  • (arXiv 2021.03) Attention is not all you need: pure attention loses rank doubly exponentially with depth, [Paper]

  • (arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]

  • (arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]

  • (arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]

  • (arXiv 2021.03) Perceiver: General Perception with Iterative Attention, [Paper]

  • (arXiv 2021.03) Transformer in Transformer, [Paper], [Code]

  • (arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]

  • (arXiv 2021.03) OmniNet: Omnidirectional Representations from Transformers, [Paper]

  • (arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]

2021.02

  • (arXiv 2021.02) Evolving Attention with Residual Convolutions, [Paper]

  • (arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]

  • (arXiv 2021.02) SparseBERT: Rethinking the Importance Analysis in Self-attention, [Paper]

  • (arXiv 2021.02) Investigating the Limitations of Transformers with Simple Arithmetic Tasks, [Paper], [Code]

  • (arXiv 2021.02) Do Transformer Modifications Transfer Across Implementations and Applications? [Paper]

  • (arXiv.2021.02) Do We Really Need Explicit Position Encodings for Vision Transformers? [Paper], [Code]

  • (arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]

  • (arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]

  • (arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]

  • (arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]

  • (arXiv.2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]

  • (arXiv 2021.02) Linear Transformers Are Secretly Fast Weight Memory Systems, [Paper]

  • (arXiv.2021.02) POSITION INFORMATION IN TRANSFORMERS: AN OVERVIEW, [Paper]

  • (arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Project], [Code]

  • (arXiv 2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]

  • (arXiv 2021.02) Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts, [Paper]

  • (arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]

  • (arXiv 2021.02) END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS, [Paper]

  • (arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]

  • (arXiv 2021.02) Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, [Paper], [Code]

  • (arXiv 2021.02) Video Transformer Network, [Paper]

  • (arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]

  • (arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]

  • (arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]

  • (arXiv 2021.02) Improving Visual Reasoning by Exploiting The Knowledge in Texts, [Paper]

2021.01

  • (arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]

  • (arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]

  • (arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation (arXiv 2021.1), [Paper]

  • (arXiv 2021.01) CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING, [Paper]

  • (arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]

  • (arXiv 2021.01) Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, [Paper], [Code]

  • (arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]

  • (arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Page]

  • (arXiv 2021.01) Spherical Transformer: Adapting Spherical Signal to CNNs, [Paper]

  • (arXiv 2021.01) Are We There Yet? Learning to Localize in Embodied Instruction Following, [Paper]

  • (arXiv 2021.01) VinVL: Making Visual Representations Matter in Vision-Language Models, [Paper]

  • (arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper]

  • (arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]

  • (arXiv 2021.01) ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY, [Paper]

  • (arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]

  • (arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]

  • (arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]

  • (arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]

  • (arXiv 2021.01) Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, [Paper]

2020.12

  • (arXiv 2020.12) Cloud Transformers, [Paper]

  • (arXiv 2020.12) Accurate Word Representations with Universal Visual Guidance, [Paper]

  • (arXiv 2020.12) DETR for Pedestrian Detection, [Paper]

  • (arXiv 2020.12) Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]

  • (arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]

  • (arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]

  • (arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]

  • (arXiv 2020.12) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]

  • (arXiv 2020.12) Transformer for Image Quality Assessment, [Paper], [Code]

  • (arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]

  • (arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]

  • (arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper]

  • (arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]

  • (arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]

  • (arXiv 2020.12) Point Transformer, [Paper]

  • (arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]

  • (arXiv 2020.12) Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting, [Paper]

  • (arXiv 2020.12) Pre-Trained Image Processing Transformer, [Paper]

  • (arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]

2020.11

  • (arXiv 2020.11) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]

  • (arXiv 2020.11) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper]

  • (arXiv 2020.11) End-to-End Video Instance Segmentation with Transformers, [Paper]

  • (arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]

  • (arXiv 2020.11) General Multi-label Image Classification with Transformers, [[Paper]](https://arxiv.org/pdf/2011.14027}

  • (arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]

before 2020.11

  • (arXiv 2020.10) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]

  • (arXiv 2020.07) Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks, [Paper], [Code]

  • (arXiv 2020.07) Feature Pyramid Transformer, [Paper], [Code]

  • (arXiv 2020.06) Linformer: Self-Attention with Linear Complexity, [Paper]

  • (arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]

  • (arXiv 2019.08) LXMERT: Learning Cross-Modality Encoder Representations from Transformers, [Paper], [Code]

  • (ICLR'21) IOT: INSTANCE-WISE LAYER REORDERING FOR TRANSFORMER STRUCTURES, [Paper], [Code]

  • (ICLR'21) UPDET: UNIVERSAL MULTI-AGENT REINFORCEMENT LEARNING VIA POLICY DECOUPLING WITH TRANSFORMERS, [Paper], [Code]

  • (ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]

  • (ICLR'21) LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]

  • (ICLR'21) SUPPORT-SET BOTTLENECKS FOR VIDEO-TEXT REPRESENTATION LEARNING, [Paper]

  • (ICLR'21) COLORIZATION TRANSFORMER, [Paper], [Code]

  • (ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]

  • (ECCV'20) Connecting Vision and Language with Localized Narratives, [Paper]

  • (ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]

  • (CVPR'20) PaStaNet: Toward Human Activity Knowledge Engine, [Paper], [Code]

  • (CVPR'20) Multi-Modality Cross Attention Network for Image and Sentence Matching, [Paper], [Page]

  • (CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]

  • (CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]

  • (ICPR'20) Transformer Encoder Reasoning Network, [Paper], [Code]

  • (EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]

TODO

You might also like...
《DeepViT: Towards Deeper Vision Transformer》(2021)
《DeepViT: Towards Deeper Vision Transformer》(2021)

DeepViT This repo is the official implementation of "DeepViT: Towards Deeper Vision Transformer". The repo is based on the timm library (https://githu

Implementation of ViViT: A Video Vision Transformer
Implementation of ViViT: A Video Vision Transformer

ViViT: A Video Vision Transformer Unofficial implementation of ViViT: A Video Vision Transformer. Notes: This is in WIP. Model 2 is implemented, Model

 SiT: Self-supervised vIsion Transformer
SiT: Self-supervised vIsion Transformer

This repository contains the official PyTorch self-supervised pretraining, finetuning, and evaluation codes for SiT (Self-supervised image Transformer).

Pytorch implementation of
Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet (arxiv) This is a Pytorch implementation of our te

So-ViT: Mind Visual Tokens for Vision Transformer
So-ViT: Mind Visual Tokens for Vision Transformer

So-ViT: Mind Visual Tokens for Vision Transformer        Introduction This repository contains the source code under PyTorch framework and models trai

Code for the ICML 2021 paper:
Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

LeViT a Vision Transformer in ConvNet's Clothing for Faster Inference
LeViT a Vision Transformer in ConvNet's Clothing for Faster Inference

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference This repository contains PyTorch evaluation code, training code and pretrained

Vision Transformer for 3D medical image registration (Pytorch).
Vision Transformer for 3D medical image registration (Pytorch).

ViT-V-Net: Vision Transformer for Volumetric Medical Image Registration keywords: vision transformer, convolutional neural networks, image registratio

Code for the ICML 2021 paper:
Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

Comments
  • Missing the earlist survey on visual transformer

    Missing the earlist survey on visual transformer

    Very nice paper list for visual transformers. Including this survey would be more complete: A Survey on Visual Transformer https://arxiv.org/abs/2012.12556

    opened by iamhankai 2
  • Paper with code

    Paper with code

    Hi @DirtyHarryLYL,

    Thank you for posting our paper on your list, the code can be found at: ("STAR: Sparse Transformer-based Action Recognition") https://github.com/imj2185/STAR.

    Best regards,

    opened by shi27feng 1
Owner
Yong-Lu Li
Ph.D. ML CV Rob
Yong-Lu Li
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Alex Pashevich 62 Dec 24, 2022
The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

null 87 Nov 29, 2022
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

Microsoft 409 Jan 6, 2023
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

AI2 79 Dec 23, 2022
a general-purpose Transformer based vision backbone

Swin Transformer By Ze Liu*, Yutong Lin*, Yue Cao*, Han Hu*, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. This repo is the official implement

Microsoft 9.9k Jan 8, 2023
Code for the Convolutional Vision Transformer (ConViT)

ConViT : Vision Transformers with Convolutional Inductive Biases This repository contains PyTorch code for ConViT. It builds on code from the Data-Eff

Facebook Research 418 Jan 6, 2023