Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding


Vision Longformer

This project provides the source code for the vision longformer paper.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding


  • Fast Pytorch implementation of conv-like sliding-window local attention
  • Fast random-shifting training strategy of vision longformer
  • A versatile multi-scale vision transformer class (MsViT) that can support various efficient attention mechanisms
  • Compare multiple efficient attention mechanisms: vision-longformer ("global + conv_like local") attention, performer attention, global-memory attention, linformer attention and spatial reduction attention.
  • Provides pre-trained models for different attention mechanisms.


  • 03/29/2021: First version of vision longformer paper posted on Arxiv.
  • 04/30/2021: Performance improved by adding relative positional bias, inspired by Swin Transformer! Training is accelerated significantly by adding random-shifting training strategy. First version of code released.

Multi-scale Vision Transformer Architecture

Vision Longformer, and more generally the Multi-scale Vision Transformer (MsViT), follows the multi-stage design of ResNet. Each stage is a (slightly modified) vision transformer with some user-specified attenion mechanism. Currently, five attention mechanisms are supported:

# choices=['full', 'longformerhand', 'linformer', 'srformer', 'performer', 'longformerauto', 'longformer_cuda']
_C.MODEL.VIT.MSVIT.ATTN_TYPE = 'longformerhand'

As an example, a 3-stage multi-scale model architecture is specified by the MODEL.VIT.MSVIT.ARCH:

_C.MODEL.VIT.MSVIT.ARCH = 'l1,h3,d192,n1,s1,g1,p16,f7,a1_l2,h6,d384,n10,s0,g1,p2,f7,a1_l3,h12,d796,n1,s0,g1,p2,f7,a1'

Configs of different stages are separated by _. For each stage, the meaning of the config l*,h*,d*,n*,s*,g*,p*,f*,a* is specified as below.

symbol l h d n s g p f a
Name stage num_heads hidden_dim num_layers is_parse_attention num_global_tokens patch_size num_feats absolute_position_embedding
Range [1,2,3,4] N+ N+ N+ [0, 1] N N N [0,1]

Here, N stands for natural numbers including 0, and N+ stands for positive integers.

The num_feats (number of features) field, i.e., f, is overloaded for different attention mechanisms:

linformer: number of features

performer: number of (random orthogonal) features

srformer: spatial reduction ratio

longformer: one sided window size (not including itself, actual window size is 2 * f + 1 for MSVIT.SW_EXACT = 1 and 3 * f for MSVIT.SW_EXACT = 0/-1).

The following are the main model architectures used in Vision Longformer paper.

Model size stage_1 stage_2 stage_3 stage_4
Tiny n1,p4,h1,d48 n1,p2,h3,d96 n9,p2,h3,d192 n1,p2,h6,d384
Small n1,p4,h3,d96 n2,p2,h3,d192 n8,p2,h6,d384 n1,p2,h12,d768
Medium-Deep n1,p4,h3,d96 n4,p2,h3,d192 n16,p2,h6,d384 n1,p2,h12,d768
Medium-Wide n1,p4,h3,d192 n2,p2,h6,d384 n8,p2,h8,d512 n1,p2,h12,d768
Base-Deep n1,p4,h3,d96 n8,p2,h3,d192 n24,p2,h6,d384 n1,p2,h12,d768
Base-Wide n1,p4,h3,d192 n2,p2,h6,d384 n8,p2,h12,d768 n1,p2,h16,d1024

Model Performance

Main Results on ImageNet and Pretrained Models

Vision Longformer with absolute positional embedding

name pretrain resolution acc@1 acc@5 #params FLOPs 22K model 1K model
ViL-Tiny ImageNet-1K 224x224 76.3 93.3 6.7M 1.43G - ckpt, config
ViL-Small ImageNet-1K 224x224 82.0 95.8 24.6M 5.12G - ckpt, config
ViL-Medium-Deep ImageNet-1K 224x224 83.3 96.3 39.7M 9.1G - ckpt, config
ViL-Medium-Wide ImageNet-1K 224x224 82.9 96.4 39.8M 11.3G - ckpt, config
ViL-Medium-Deep ImageNet-22K 384x384 85.6 97.7 39.7M 29.4G ckpt, config ckpt, config
ViL-Medium-Wide ImageNet-22K 384x384 84.7 97.3 39.8M 35.1G ckpt, config ckpt, config
ViL-Base-Deep ImageNet-22K 384x384 86.0 97.9 55.7M 45.3G ckpt, config ckpt, config
ViL-Base-Wide ImageNet-22K 384x384 86.2 98.0 79.0M 55.8G ckpt, config ckpt, config

Vision Longformer with relative positional embedding and comparison with Swin Transformers

name pretrain resolution acc@1 acc@5 #params FLOPs 22K model 1K model
ViL-Tiny ImageNet-1K 224x224 76.65 93.55 6.7M 1.43G - ckpt config
ViL-Small ImageNet-1K 224x224 82.39 95.92 24.6M 5.12G - ckpt config
ViL-Medium-Deep ImageNet-1K 224x224 83.52 96.52 39.7M 9.1G - ckpt config
ViL-Medium-Deep ImageNet-22K 384x384 85.73 97.8 39.7M 29.4G ckpt config ckpt config
ViL-Base-Deep ImageNet-22K 384x384 86.11 97.89 55.7M 45.3G ckpt config ckpt config
--- --- --- --- --- --- --- --- ---
Swin-Tiny (2-2-6-2) ImageNet-1K 224x224 81.2 95.5 28M 4.5G - from swin repo
ViL-Swin-Tiny (2-2-6-2) ImageNet-1K 224x224 82.71 95.95 28M 5.33G - ckpt config
Swin-Small (2-2-18-2) ImageNet-1K 224x224 83.2 96.2 50M 8.7G - from swin repo
ViL-Swin-Small (2-2-18-2) ImageNet-1K 224x224 83.7 96.43 50M 9.85G - ckpt config

Results of other attention mechanims (Small size)

Attention pretrain resolution acc@1 acc@5 #params FLOPs 22K model 1K model
full ImageNet-1K 224x224 81.9 95.8 24.6M 6.95G - ckpt, config
longformer ImageNet-1K 224x224 82.0 95.8 24.6M 5.12G - ckpt, config
--- --- --- --- --- --- --- --- ---
linformer ImageNet-1K 224x224 81.0 95.4 26.3M 5.62G - ckpt, config
srformer/64 ImageNet-1K 224x224 76.4 92.9 52.9M 3.97G - ckpt, config
srformer/32 ImageNet-1K 224x224 79.9 94.9 31.1M 4.28G - ckpt, config
global ImageNet-1K 224x224 79.0 94.5 24.9M 6.78G - ckpt, config
performer ImageNet-1K 224x224 78.7 94.3 24.8M 6.26G - ckpt, config
--- --- --- --- --- --- --- --- ---
partial linformer ImageNet-1K 224x224 81.8 95.9 25.8M 5.21G - ckpt, config
partial srformer/32 ImageNet-1K 224x224 81.6 95.7 26.4M 4.57G - ckpt, config
partial global ImageNet-1K 224x224 81.4 95.7 24.9M 6.3G - ckpt, config
partial performer ImageNet-1K 224x224 81.7 95.7 24.7M 5.52G - ckpt, config

See more results on comparing different efficient attention mechanisms in Table 13 and Table 14 in the Vision Longformer paper.

Main Results on COCO object detection and instance segmentation (with absolute positional embedding)

Vision Longformer with absolute positional embedding

Backbone Method pretrain Lr Schd box mAP mask mAP #params FLOPs
ViL-Tiny RetinaNet ImageNet-1K 1x 38.8 -- 16.64M 182.7G
ViL-Tiny RetinaNet ImageNet-1K 3x 40.7 -- 16.64M 182.7G
ViL-Small RetinaNet ImageNet-1K 1x 41.6 -- 35.68M 254.8G
ViL-Small RetinaNet ImageNet-1K 3x 42.9 -- 35.68M 254.8G
ViL-Medium (D) RetinaNet ImageNet-1K 1x 42.9 -- 50.77M 330.4G
ViL-Medium (D) RetinaNet ImageNet-1K 3x 43.7 -- 50.77M 330.4G
ViL-Base (D) RetinaNet ImageNet-1K 1x 44.3 -- 66.74M 420.9G
ViL-Base (D) RetinaNet ImageNet-1K 3x 44.7 -- 66.74M 420.9G
--- --- --- --- --- --- --- ---
ViL-Tiny Mask R-CNN ImageNet-1K 1x 38.7 36.2 26.9M 145.6G
ViL-Tiny Mask R-CNN ImageNet-1K 3x 41.2 37.9 26.9M 145.6G
ViL-Small Mask R-CNN ImageNet-1K 1x 41.8 38.5 45.0M 218.3G
ViL-Small Mask R-CNN ImageNet-1K 3x 43.4 39.6 45.0M 218.3G
ViL-Medium (D) Mask R-CNN ImageNet-1K 1x 43.4 39.7 60.1M 293.8G
ViL-Medium (D) Mask R-CNN ImageNet-1K 3x 44.6 40.7 60.1M 293.8G
ViL-Base (D) Mask R-CNN ImageNet-1K 1x 45.1 41.0 76.1M 384.4G
ViL-Base (D) Mask R-CNN ImageNet-1K 3x 45.7 41.3 76.1M 384.4G

See more fine-grained results in Table 6 and Table 7 in the Vision Longformer paper.

Results of other attention mechanims (Small size)

Backbone Method pretrain Lr Schd box mAP mask mAP #params FLOPs Memory
srformer/64 Mask R-CNN ImageNet-1K 1x 35.7 33.6 73.3M 224.1G 7.1G
srformer/32 Mask R-CNN ImageNet-1K 1x 39.8 36.8 51.5M 268.3G 13.6G
Partial srformer/32 Mask R-CNN ImageNet-1K 1x 41.1 38.1 46.8M 352.1G 22.6G
global Mask R-CNN ImageNet-1K 1x 34.1 32.5 45.2M 226.4G 7.6G
Partial global Mask R-CNN ImageNet-1K 1x 41.3 38.2 45.1M 326.5G 20.1G
performer Mask R-CNN ImageNet-1K 1x 35.0 33.1 45.0M 251.5G 8.4G
Partial performer Mask R-CNN ImageNet-1K 1x 41.7 38.4 45.0M 343.7G 20.0G
ViL Mask R-CNN ImageNet-1K 1x 41.3. 38.1 45.0M 218.3G 7.4G
Partial ViL Mask R-CNN ImageNet-1K 1x 42.6 39.3 45.0M 326.8G 19.5G

Compare different implementations of vision longformer

Please go to Implementation for implementation details of vision longformer.

Training/Testing Vision Longformer on Local Machine

Prepare datasets

One needs to download zip files of ImageNet (, train_map.txt,, val_map.txt) under the specified data folder, e.g., the default src/datasets/imagenet. The CIFAR10, CIFAR100 and MNIST can be automatically downloaded.

With the default setting, we should have the following files in the /root/datasets directory:

root (root folder)
├── datasets (folder with all the datasets and pretrained models)
├──── imagenet/ (imagenet dataset and pretrained models)
├────── 2012/
├───────── train_map.txt
├───────── val_map.txt
├──── CIFAR10/ (CIFAR10 dataset and pretrained models)
├──── CIFAR100/ (CIFAR100 dataset and pretrained models)
├──── MNIST/ (MNIST dataset and pretrained models)

Environment requirements

It is recommended to use any of the following docker images to run the experiments.

pengchuanzhang/maskrcnn:ubuntu18-py3.7-cuda10.1-pytorch1.7 # recommended
pengchuanzhang/maskrcnn:py3.7-cuda10.0-pytorch1.7 # if you want to try the customized cuda kernel of vision longformer.

For virtual environments, the following packages should be the sufficient.

pytorch >= 1.5
tensorboardx, einops, timm, yacs==0.1.8

Evaluation scripts

Navigate to the src folder, run the following commands to evaluate the pre-trained models above.

Pretrained models of Vision Longformer

# tiny
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ARCH 'l1,h1,d48,n1,s1,g1,p4,f7_l2,h3,d96,n1,s1,g1,p2,f7_l3,h3,d192,n9,s0,g1,p2,f7_l4,h6,d384,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/visionlongformer/msvit_tiny_longformersw_1191_train/model_best.pth 
INFO:root:ACCURACY: 76.29600524902344%
INFO:root:iter: 0  max mem: 2236
    accuracy_metrics - top1: 76.2960 (76.2960)  top5: 93.2720 (93.2720)
    epoch_metrics    - total_cnt: 50000.0000 (50000.0000)  loss: 0.0040 (0.0040)  time: 0.0022 (0.0022)

# small
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f7_l2,h3,d192,n2,s1,g1,p2,f7_l3,h6,d384,n8,s0,g1,p2,f7_l4,h12,d768,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/visionlongformer/msvit_small_longformersw_1281_train/model_best.pth 
INFO:root:ACCURACY: 81.97799682617188%
INFO:root:iter: 0  max mem: 6060
    accuracy_metrics - top1: 81.9780 (81.9780)  top5: 95.7880 (95.7880)
    epoch_metrics    - total_cnt: 50000.0000 (50000.0000)  loss: 0.0031 (0.0031)  time: 0.0029 (0.0029)

# medium-deep
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f7_l2,h3,d192,n4,s1,g1,p2,f7_l3,h6,d384,n16,s0,g1,p2,f7_l4,h12,d768,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/visionlongformer/deepmedium_14161_lr8e-4/model_best.pth

# medium-wide
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ARCH 'l1,h3,d192,n1,s1,g1,p4,f7_l2,h6,d384,n2,s1,g1,p2,f7_l3,h8,d512,n8,s0,g1,p2,f7_l4,h12,d768,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/visionlongformer/wide_medium_1281/model_best.pth

# ImageNet22K pretrained and ImageNet1K finetuned medium-deep
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest FINETUNE.FINETUNE True INPUT.IMAGE_SIZE 384 INPUT.CROP_PCT 0.922 MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f7_l2,h3,d192,n4,s1,g1,p2,f7_l3,h6,d384,n16,s0,g1,p2,f7_l4,h12,d768,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/IN384_IN22kpretrained/msvitdeepmedium_imagenet384_finetune_bsz256_lr001_wd0/model_best.pth

# ImageNet22K pretrained and ImageNet1K finetuned medium-wide
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest FINETUNE.FINETUNE True INPUT.IMAGE_SIZE 384 INPUT.CROP_PCT 0.922 MODEL.VIT.MSVIT.ARCH 'l1,h3,d192,n1,s1,g1,p4,f8_l2,h6,d384,n2,s1,g1,p2,f12_l3,h8,d512,n8,s0,g1,p2,f7_l4,h12,d768,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/IN384_IN22kpretrained/msvitwidemedium_imagenet384_finetune_bsz512_lr004_wd0/model_best.pth

# ImageNet22K pretrained and ImageNet1K finetuned base-deep
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest FINETUNE.FINETUNE True INPUT.IMAGE_SIZE 384 INPUT.CROP_PCT 0.922 MODEL.VIT.MSVIT.LN_EPS 1e-5 MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f6_l2,h3,d192,n8,s1,g1,p2,f8_l3,h6,d384,n24,s0,g1,p2,f7_l4,h12,d768,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/IN384_IN22kpretrained/msvitdeepbase_imagenet384_finetune_bsz640_lr003_wd0/model_best.pth

# ImageNet22K pretrained and ImageNet1K finetuned base-wide
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest FINETUNE.FINETUNE True INPUT.IMAGE_SIZE 384 INPUT.CROP_PCT 0.922 MODEL.VIT.MSVIT.ARCH 'l1,h3,d192,n1,s1,g1,p4,f8_l2,h6,d384,n2,s1,g1,p2,f8_l3,h12,d768,n8,s0,g1,p2,f7_l4,h16,d1024,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/IN384_IN22kpretrained/msvitwidebase_imagenet384_finetune_bsz768_lr001_wd1e-7/model_best.pth DATALOADER.BSZ 64

Pretrained models of other attention mechanisms

# Small full attention
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ATTN_TYPE full MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f7_l2,h3,d192,n2,s1,g1,p2,f7_l3,h6,d384,n8,s0,g1,p2,f7_l4,h12,d768,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/fullMSA/small1281/model_best.pth

# Small linformer
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ATTN_TYPE linformer MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f256_l2,h3,d192,n2,s1,g1,p2,f256_l3,h6,d384,n8,s1,g1,p2,f256_l4,h12,d768,n1,s1,g0,p2,f256' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/linformer/small1281_full/model_best.pth

# Small partial linformer
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ATTN_TYPE linformer MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f256_l2,h3,d192,n2,s1,g1,p2,f256_l3,h6,d384,n8,s0,g1,p2,f256_l4,h12,d768,n1,s0,g0,p2,f256' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/linformer/small1281_partial/model_best.pth

# Small global attention
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.AVG_POOL True MODEL.VIT.MSVIT.ONLY_GLOBAL True MODEL.VIT.MSVIT.ATTN_TYPE longformerhand MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g256,p4,f7_l2,h3,d192,n2,s1,g256,p2,f7_l3,h6,d384,n8,s1,g64,p2,f7_l4,h12,d768,n1,s1,g16,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/globalformer/globalfull1281/model_best.pth

# Small partial global attention
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.AVG_POOL True MODEL.VIT.MSVIT.ONLY_GLOBAL True MODEL.VIT.MSVIT.ATTN_TYPE longformerhand MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g256,p4,f7_l2,h3,d192,n2,s1,g256,p2,f7_l3,h6,d384,n8,s0,g1,p2,f7_l4,h6,d384,n1,s0,g0,p2,f7' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/globalformer/globalpartial1281/model_best.pth

# Small spatial reduction attention with down-sample ratio 64
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ATTN_TYPE srformer MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f16_l2,h3,d192,n2,s1,g1,p2,f8_l3,h6,d384,n8,s1,g1,p2,f4_l4,h12,d768,n1,s1,g0,p2,f2' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/srformer/srformerfull1281/model_best.pth

# Small spatial reduction attention with down-sample ratio 32
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ATTN_TYPE srformer MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f8_l2,h3,d192,n2,s1,g1,p2,f4_l3,h6,d384,n8,s1,g1,p2,f2_l4,h12,d768,n1,s0,g0,p2,f1' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/srformer/srformerfull8_1281/model_best.pth

# Small partial spatial reduction attention with down-sample ratio 32
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ATTN_TYPE srformer MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f8_l2,h3,d192,n2,s1,g1,p2,f4_l3,h6,d384,n8,s0,g1,p2,f2_l4,h12,d768,n1,s0,g0,p2,f1' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/srformer/srformerpartial1281/model_best.pth

# Small performer
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ATTN_TYPE performer MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f256_l2,h3,d192,n2,s1,g1,p2,f256_l3,h6,d384,n8,s1,g1,p2,f256_l4,h12,d768,n1,s1,g0,p2,f256' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/performer/fullperformer1281/model_best.pth

# Small partial performer
python --config-file 'config/msvit.yaml' --data ../datasets/imagenet/2012 --output_dir ../run/imagenet/msvittest MODEL.VIT.MSVIT.ATTN_TYPE performer MODEL.VIT.MSVIT.ARCH 'l1,h3,d96,n1,s1,g1,p4,f256_l2,h3,d192,n2,s1,g1,p2,f256_l3,h6,d384,n8,s0,g1,p2,f256_l4,h12,d768,n1,s0,g0,p2,f256' EVALUATE True MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/performer/partialperformer1281/model_best.pth

Training scripts

We provide three example training scripts as below.

# ViL-Tiny with relative positional embedding: Imagenet1K training with 224x224 resolution
python -m torch.distributed.launch --nproc_per_node=4 --config-file
    'config/msvit.yaml' --data '../datasets/imagenet/2012/' OPTIM.OPT adamw

# Training with random shifting strategy: accelerate the training significantly
python -m torch.distributed.launch --nproc_per_node=4 --config-file
    'config/msvit.yaml' --data '../datasets/imagenet/2012/' OPTIM.OPT adamw

# ViL-Medium-Deep: Imagenet1K finetuning with 384x384 resolution
python -m torch.distributed.launch --nproc_per_node=8 --config-file
    'config/msvit_384finetune.yaml' --data '/mnt/default/data/sasa/imagenet/2012/'
    MODEL.MODEL_PATH /home/penzhan/penzhanwu2/imagenet/msvit/IN22kpretrained/deepmedium/model_best.pth

Cite Vision Longformer

Please consider citing vision longformer if it helps your work.

  title={Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding},
  author={Zhang, Pengchuan and Dai, Xiyang and Yang, Jianwei and Xiao, Bin and Yuan, Lu and Zhang, Lei and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2103.15358},
  • Why does the implementation here reduce memory?

    Why does the implementation here reduce memory?

    @jwyang @pzzhang

    Thanks for your great work!

    In the sliding chunk approach, to achieve a conv-like local attention mechanism with window size $2 w + 1$, we split the feature map into chunks with size $w \times w$. Each chunk only attends to itself and its 8 neighbor chunks. The Pytorch Autograd will save 9 copies of the feature map (9 nodes in the computing graph) for automatic back-propagation, which is not time/memory efficient. The SCw/Handgrad version defines a customized torch.autograd.Function with hand-written backward function, which greatly saves the memory usage and also speeds up the algorithm, as shown in figures above. We would like to point out that the memory usage of the SCw/Handgrad version is nearly optimal (very close to that of the cuda_kernel).

    In my experiment, the memory is really much lower. But I am very interested in the principle behind it.

    The Pytorch Autograd will save 9 copies of the feature map (9 nodes in the computing graph) for automatic back-propagation...

    How to understand this?

    The SCw/Handgrad version defines a customized torch.autograd.Function with hand-written backward function, which greatly saves the memory usage and also speeds up the algorithm

    How does the implementation here avoid this? I am very curious about this and hope to get a more detailed explanation. Thank you very much!

    opened by lartpang 4
  • RuntimeError: shape '[2, 37, 3, 12, 66]' is invalid for input of size 176712

    RuntimeError: shape '[2, 37, 3, 12, 66]' is invalid for input of size 176712

    When add some code to use the MsViT model in src/models/ as follow: if __name__ == "__main__": test_tensor = torch.randn(2, 3, 384, 384) net = MsViT(num_classes=122, img_size=384, drop_rate=0., drop_path_rate=0.1, norm_embed=True, avg_pool=False, arch='l1,h3,d192,n1,s1,g1,p16,f7,a1_l2,h6,d384,n10,s0,g1,p2,f7,a1_l3,h12,d796,n1,s0,g1,p2,f7,a1', sharew=True, attn_type='longformerhand', share_kv=True, only_glo=False, sw_exact=0, ln_eps=1e-6, mode=0) the following error occurred in src/models/ line 87.

    RuntimeError: shape '[2, 37, 3, 12, 66]' is invalid for input of size 176712

    Can you help me? Thank you very much!

    opened by Yulv-git 2
  • Question on sliding chunks

    Question on sliding chunks

    Thank you for your work. I just have one question about the implementation. I wonder if the chunk size could be reduced to one pixel, like the central pixel within the convolution. In this case, other neighboring chunks would become neighboring pixels in convolution. I am curious about the motivation of chunk within this implementation.

    opened by PeiqinZhuang 0
  • Fine-tuning on customized dataset for classification?

    Fine-tuning on customized dataset for classification?

    Hi there,

    Thank you for contributing such great work to the community. A quick question, is there any instructions for applying transfer learning or fine-tuning this longformer to a customized classification dataset?

    Thank you in advance.

    opened by Dadatata-JZ 0
  • Would you release your implementations of CUDA optimized kernels using TVM?

    Would you release your implementations of CUDA optimized kernels using TVM?

    Hi @jwyang In your paper, you said that

    since it’s not making use of the highly optimized matrix multiplication libraries in CUDA, its speed is still slow in practice.

    The implementation using the customized CUDA kernel is about 20% faster than the full attention in the same setting, while achieving the theoretical memory complexity. The sliding-chunk approach is the fastest, which is 60% faster than the full attention with a cost of consuming 20% more memory than the theoretical complexity.

    Therefore, your code only contains the implementations of the sliding-chunk approach. However, have you ever tried to implement or generate CUDA optimized kernels based on new arches(sm75+)? In my opinion, introducing tensor-core instructions and highly optimized GEMM libraries(CUBLAS,etc) can improve the performance of longformer.

    opened by TengFeiHan0 3
Open source projects and samples from Microsoft
Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging This repository contains an implementation

Computational Photography Lab @ SFU 1.1k Jan 2, 2023
Official repository for "Restormer: Efficient Transformer for High-Resolution Image Restoration". SOTA for motion deblurring, image deraining, denoising (Gaussian/real data), and defocus deblurring.

Restormer: Efficient Transformer for High-Resolution Image Restoration Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan,

Syed Waqas Zamir 906 Dec 30, 2022
A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

Poisson Image Editing - A Parallel Implementation Jiayi Weng (jiayiwen), Zixu Chen (zixuc) Poisson Image Editing is a technique that can fuse two imag

Jiayi Weng 110 Dec 27, 2022
Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT : Cross-Attention Multi-Scale Vision Transformer for Image Classification This is an unofficial PyTorch implementation of CrossViT: Cross-Att

Rishikesh (ऋषिकेश) 103 Nov 25, 2022
Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT This repository is the official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. ArXiv If

International Business Machines 168 Dec 29, 2022
Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (CVPR2022)

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding by Qiaole Dong*, Chenjie Cao*, Yanwei Fu Paper and Supple

Qiaole Dong 190 Dec 27, 2022
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022
Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

DFSA Unofficial pytorch implementation of the ICCV 2021 paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution" (p

null 2 Nov 15, 2021
AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)

AOT-GAN for High-Resolution Image Inpainting Arxiv Paper | AOT-GAN: Aggregated Contextual Transformations for High-Resolution Image Inpainting Yanhong

Multimedia Research 214 Jan 3, 2023
Unofficial implementation of MUSIQ (Multi-Scale Image Quality Transformer)

MUSIQ: Multi-Scale Image Quality Transformer Unofficial pytorch implementation of the paper "MUSIQ: Multi-Scale Image Quality Transformer" (paper link

null 41 Jan 2, 2023
Multi-Scale Aligned Distillation for Low-Resolution Detection (CVPR2021)

MSAD Multi-Scale Aligned Distillation for Low-Resolution Detection Lu Qi*, Jason Kuen*, Jiuxiang Gu, Zhe Lin, Yi Wang, Yukang Chen, Yanwei Li, Jiaya J

Jia Research Lab 115 Dec 23, 2022
Multi-Scale Aligned Distillation for Low-Resolution Detection (CVPR2021)

MSAD Multi-Scale Aligned Distillation for Low-Resolution Detection Lu Qi*, Jason Kuen*, Jiuxiang Gu, Zhe Lin, Yi Wang, Yukang Chen, Yanwei Li, Jiaya J

DV Lab 115 Dec 23, 2022
This is an official implementation of the High-Resolution Transformer for Dense Prediction.

High-Resolution Transformer for Dense Prediction Introduction This is the official implementation of High-Resolution Transformer (HRT). We present a H

HRNet 403 Dec 13, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

null 1 Dec 24, 2021
(ImageNet pretrained models) The official pytorch implemention of the TPAMI paper "Res2Net: A New Multi-scale Backbone Architecture"

Res2Net The official pytorch implemention of the paper "Res2Net: A New Multi-scale Backbone Architecture" Our paper is accepted by IEEE Transactions o

Res2Net Applications 928 Dec 29, 2022
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. Training python --c

Rishikesh (ऋषिकेश) 55 Dec 26, 2022
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 170 Jan 4, 2023
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 54 Aug 30, 2021