Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Rishikesh (ऋषिकेश)

Last update: Nov 25, 2022

Related tags

Deep Learning classifier computer-vision transformers pytorch image-classification vision-transformers

Overview

CrossViT : Cross-Attention Multi-Scale Vision Transformer for Image Classification

This is an unofficial PyTorch implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification .

Usage :

import torch
from crossvit import CrossViT

img = torch.ones([1, 3, 224, 224])
    
model = CrossViT(image_size = 224, channels = 3, num_classes = 100)
out = model(img)

print("Shape of out :", out.shape)      # [B, num_classes]

Citation

@misc{chen2021crossvit,
      title={CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification}, 
      author={Chun-Fu Chen and Quanfu Fan and Rameswar Panda},
      year={2021},
      eprint={2103.14899},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

Base ViT code is borrowed from @lucidrains repo : https://github.com/lucidrains/vit-pytorch

You might also like...

Unofficial implementation of MUSIQ (Multi-Scale Image Quality Transformer)

MUSIQ: Multi-Scale Image Quality Transformer Unofficial pytorch implementation of the paper "MUSIQ: Multi-Scale Image Quality Transformer" (paper link

41 Jan 2, 2023

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

109 Dec 28, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer By Jingkai Zhou, Pich

87 Dec 19, 2022

Pytorch implementation of ICASSP 2022 paper Attention Probe: Vision Transformer Distillation in the Wild

Attention Probe: Vision Transformer Distillation in the Wild Jiahao Wang, Mingdeng Cao, Shuwei Shi, Baoyuan Wu, Yujiu Yang In ICASSP 2022 This code is

6 Sep 21, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

PyTorch implementation of Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network

hierarchical-multi-label-text-classification-pytorch Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach This

17 Dec 13, 2022

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

1.4k Dec 30, 2022

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

87 Nov 29, 2022

Comments

Multilabel Image classification

@rishikksh20 thanks for sharing code base , i had few queries 1.can we train crossvit for multilabel classification problem , if so what is the procedure 2. i have a custom dataset of 10.5k with 25class labels with instance as label vectors of 0 and 1 3. can remove the pre-trained classifer head and add our customr classifier ?

Thanks in advance

opened by abhigoku10 0
About the questions in Table 7 of the article

Hello,I am very interested in your job, but there is one thing I don't understand.I would like to ask you about the comparison between the information in Table 7 of the paper and the information in Crossvit -S, the first line should be K=3,N=1,M=4 and L=1. I don't quite understand the setting in the first line of Table 7, and I feel it is inconsistent with the content mentioned in the article. Maybe I haven't understood your meaning correctly, and I hope to receive your reply.Thank you!

opened by jessica9812 0

Owner

Rishikesh (ऋषिकेश)

GitHub

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

209 Dec 30, 2022

The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

CrossFormer This repository is the code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention. Introduction Existin

238 Jan 6, 2023

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

12.6k Jan 9, 2023

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Deformable Attention Implementation of Deformable Attention from this paper in Pytorch, which appears to be an improvement to what was proposed in DET

128 Dec 24, 2022

Official implement of "CAT: Cross Attention in Vision Transformer".

CAT: Cross Attention in Vision Transformer This is official implement of "CAT: Cross Attention in Vision Transformer". Abstract Since Transformer has

100 Dec 15, 2022

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

409 Jan 6, 2023

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

129 Dec 11, 2022

Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Related tags

Overview

CrossViT : Cross-Attention Multi-Scale Vision Transformer for Image Classification

Usage :

Citation

Acknowledgement

You might also like...

Unofficial implementation of MUSIQ (Multi-Scale Image Quality Transformer)

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

Pytorch implementation of ICASSP 2022 paper Attention Probe: Vision Transformer Distillation in the Wild

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

PyTorch implementation of Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Comments

Multilabel Image classification

About the questions in Table 7 of the article

Owner

Rishikesh (ऋषिकेश)

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Official implement of "CAT: Cross Attention in Vision Transformer".

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.