Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

YifanXu

Last update: Dec 5, 2022

Related tags

Overview

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

This repository contains the PyTorch code for Evo-ViT.

This work proposes a slow-fast token evolution approach to accelerate vanilla vision transformers of both flat and deep-narrow structures without additional pre-training and fine-tuning procedures. For details please see Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer by Yifan Xu*, Zhijie Zhang*, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun.

Our code is based on pytorch-image-models, DeiT, and LeViT.

Preparation

Download and extract ImageNet train and val images from http://image-net.org/. The directory structure is the standard layout for the torchvision datasets.ImageFolder, and the training and validation data is expected to be in the train/ folder and val folder respectively.

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

All distillation settings are conducted with a teacher model RegNetY-160, which is available at teacher checkpoint.

Install the requirements by running:

pip3 install -r requirements.txt

NOTE that all experiments in the paper are conducted under cuda11.0. If necessary, please install the following packages under the environment with CUDA version 11.0: torch1.7.0-cu110, torchvision-0.8.1-cu110.

Model Zoo

We provide our Evo-ViT models pretrained on ImageNet:

Name	Top-1 Acc (%)	Throughput (img/s)	Url
Evo-ViT-T	72.0	4027	Google Drive
Evo-ViT-S	79.4	1510	Google Drive
Evo-ViT-B	81.3	462	Google Drive
Evo-LeViT-128S	73.0	10135	Google Drive
Evo-LeViT-128	74.4	8323	Google Drive
Evo-LeViT-192	76.8	6148	Google Drive
Evo-LeViT-256	78.8	4277	Google Drive
Evo-LeViT-384	80.7	2412	Google Drive
Evo-ViT-B*	82.0	139	Google Drive
Evo-LeViT-256*	81.1	1285	Google Drive
Evo-LeViT-384*	82.2	712	Google Drive

The input image resolution is 224 × 224 unless specified. * denotes the input image resolution is 384 × 384.

Usage

Evaluation

To evaluate a pre-trained model, run:

python3 main_deit.py --model evo_deit_small_patch16_224 --eval --resume /path/to/checkpoint.pth --batch-size 256 --data-path /path/to/imagenet

Training with input resolution of 224

To train Evo-ViT on ImageNet on a single node with 8 gpus for 300 epochs, run:

Evo-ViT-T

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_deit.py --model evo_deit_tiny_patch16_224 --drop-path 0 --batch-size 256 --data-path /path/to/imagenet --output_dir /path/to/save

Evo-ViT-S

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_deit.py --model evo_deit_small_patch16_224 --batch-size 128 --data-path /path/to/imagenet --output_dir /path/to/save

Sometimes loss Nan happens in the early training epochs of DeiT-B, which is described in this issue. Our solution is to reduce the batch size to 128, load a warmup checkpoint trained for 9 epochs, and train Evo-ViT for the remaining 291 epochs. To train Evo-ViT-B on ImageNet on a single node with 8 gpus for 300 epochs, run:

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_deit.py --model evo_deit_base_patch16_224 --batch-size 128 --data-path /path/to/imagenet --output_dir /path/to/save --resume /path/to/warmup_checkpoint.pth

To train Evo-LeViT-128 on ImageNet on a single node with 8 gpus for 300 epochs, run:

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_levit.py --model EvoLeViT_128 --batch-size 256 --data-path /path/to/imagenet --output_dir /path/to/save

The other models of Evo-LeViT are trained with the same command as mentioned above.

Training with input resolution of 384

To train Evo-ViT-B* on ImageNet on 2 nodes with 8 gpus each for 300 epochs, run:

python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=$NODE_SIZE  --node_rank=$NODE_RANK --master_port=$MASTER_PORT --master_addr=$MASTER_ADDR main_deit.py --model evo_deit_base_patch16_384 --input-size 384 --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save

To train Evo-ViT-S* on ImageNet on a single node with 8 gpus for 300 epochs, run:

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_deit.py --model evo_deit_small_patch16_384 --batch-size 128 --input-size 384 --data-path /path/to/imagenet --output_dir /path/to/save"

To train Evo-LeViT-384* on ImageNet on a single node with 8 gpus for 300 epochs, run:

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_levit.py --model EvoLeViT_384_384 --input-size 384 --batch-size 128 --data-path /path/to/imagenet --output_dir /path/to/save

The other models of Evo-LeViT* are trained with the same command of Evo-LeViT-384*.

Testing inference throughput

To test inference throughput, first modify the model name in line 153 of benchmark.py. Then, run:

python3 benchmark.py

The defauld input resolution is 224. To test inference throughput with input resolution of 384, please add the parameter "--img_size 384"

Visualization of token selection

The visualization code is modified from DynamicViT.

To visualize a batch of ImageNet val images, run:

python3 visualize.py --model evo_deit_small_vis_patch16_224 --resume /path/to/checkpoint.pth --output_dir /path/to/save --data-path /path/to/imagenet --batch-size 64

To visualize a single image, run:

python3 visualize.py --model evo_deit_small_vis_patch16_224 --resume /path/to/checkpoint.pth --output_dir /path/to/save --img-path ./imgs/a.jpg --save-name evo_test

Add parameter '--layer-wise-prune' if the visualized model is not trained with layer-to-stage training strategy.

The visualization results of Evo-ViT-S are as follows:

Citation

If you find our work useful in your research, please consider citing:

@article{xu2021evo,
  title={Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer},
  author={Xu, Yifan and Zhang, Zhijie and Zhang, Mengdan and Sheng, Kekai and Li, Ke and Dong, Weiming and Zhang, Liqing and Xu, Changsheng and Sun, Xing},
  journal={arXiv preprint arXiv:2108.01390},
  year={2021}
}

You might also like...

Official code for paper "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight"

Demysitifing Local Vision Transformer, arxiv This is the official PyTorch implementation of our paper. We simply replace local self attention by (dyna

138 Dec 28, 2022

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

36 Oct 30, 2022

As-ViT: Auto-scaling Vision Transformers without Training

As-ViT: Auto-scaling Vision Transformers without Training [PDF] Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou In ICLR 2

68 Sep 5, 2022

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

12.6k Jan 9, 2023

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

A fast Evolution Strategy implementation in Python

Evostra: Evolution Strategy for Python Evolution Strategy (ES) is an optimization technique based on ideas of adaptation and evolution. You can learn

251 Dec 8, 2022

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

1.4k Dec 30, 2022

Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”

Official implementation for TransDA Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”. Overview: Result: Prerequisites:

54 Dec 22, 2022

Comments

Question bout Figure 4 in paper

``Hi~ @YifanXu74 I find an interesting figure (Figure 4 in the paper) in your work. But I have some questions about this figure: (1) In the Figure(a) I observe that ALL tokens CKA values are higher than informative and placeholders tokens. This phenomenon confuses me, I think the CKA of informative tokens should also be higher than all tokens? (2) For Figure(b), I compute the cosine similarity and PCC by sampling some images from Imagenet based on the original DeiT-T codebase, but I find the cosine similarity of top layers is lower than the bottom layers. Did I do something wrong? Here is my code in VisionTransformer Block:

    def forward(self, x):
        cos_list = []
        for i in range (1, x.size()[1]): # not consider CLS
            for j in range(i+1, x.size()[1]):
                cos_tmp = cos(x[0][j], x[0][i])
                cos_list.append(cos_tmp)
        cos_sum = sum(cos_list) / len(cos_list)
        x = x + self.drop_path(self.attn(self.norm1(x)))
        x = x + self.drop_path(self.mlp(self.norm2(x)))
        return x

Here is a example of my results:

tensor(0.4797, device='cuda:0')
tensor(0.5382, device='cuda:0')
tensor(0.4463, device='cuda:0')
tensor(0.3758, device='cuda:0')
tensor(0.3314, device='cuda:0')
tensor(0.3138, device='cuda:0')
tensor(0.2635, device='cuda:0')
tensor(0.2867, device='cuda:0')
tensor(0.3507, device='cuda:0')
tensor(0.3842, device='cuda:0')
tensor(0.4197, device='cuda:0')
tensor(0.4028, device='cuda:0')

opened by yimingsh 6

The code does not match the pipeline in your paper
In the original paper, there is a special token named representative token, which is aggregated by the placeholder tokens. However, there is no corresponding implementation in your code.

In fact, you simply use argsort and select the topk informative tokens, which is non-differentiable.

# topk for slow update x = x_[:, :N_ + 1] # L438 # simply copy for fast update x = torch.cat((x, x_[:, N_ + 1:]), dim=1) # L473

I'm curious about the performance of using aggregating tokens and differentiable topk used in other paper. Hopefully for your reply.
opened by Andy1621 1

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Related tags

Overview

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Preparation

Model Zoo

Usage

Evaluation

Training with input resolution of 224

Training with input resolution of 384

Testing inference throughput

Visualization of token selection

Citation

You might also like...

Official code for paper "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight"

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

As-ViT: Auto-scaling Vision Transformers without Training

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

A fast Evolution Strategy implementation in Python

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”

Comments

Question bout Figure 4 in paper

The code does not match the pipeline in your paper

Owner

YifanXu

Vit-ImageClassification - Pytorch ViT for Image classification on the CIFAR10 dataset

So-ViT: Mind Visual Tokens for Vision Transformer

A PyTorch Implementation of ViT (Vision Transformer)

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

Implementing Vision Transformer (ViT) in PyTorch

Official implement of "CAT: Cross Attention in Vision Transformer".

A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

FAST Aiming at the problems of cumbersome steps and slow download speed of GNSS data

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).