Dynamic-Vision-Transformer (Pytorch)
This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).
Update on 2021/06/01: Release Pre-trained Models and the Inference Code on ImageNet.
Introduction
We develop a Dynamic Vision Transformer (DVT) to automatically configure a proper number of tokens for each individual image, leading to a significant improvement in computational efficiency, both theoretically and empirically.
Results
- Top-1 accuracy on ImageNet v.s. GFLOPs
- Top-1 accuracy on CIFAR v.s. GFLOPs
- Top-1 accuracy on ImageNet v.s. Throughput
- Visualization
Pre-trained Models
Backbone | # of Exits | # of Tokens | Links |
---|---|---|---|
T2T-ViT-12 | 3 | 7x7-10x10-14x14 | Tsinghua Cloud / Google Drive |
- What are contained in the checkpoints:
**.pth.tar
├── model_state_dict: state dictionaries of the model
├── flops: a list containing the GFLOPs corresponding to exiting at each exit
├── anytime_classification: Top-1 accuracy of each exit
├── dynamic_threshold: the confidence thresholds used in budgeted batch classification
├── budgeted_batch_classification: results of budgeted batch classification (a two-item list, [0] and [1] correspond to the two coordinates of a curve)
Requirements
- python 3.7.7
- pytorch 1.3.1
- torchvision 0.4.2
Evaluate Pre-trained Models
Read the evaluation results saved in pre-trained models
CUDA_VISIBLE_DEVICES=0 python inference.py --batch_size 128 --model DVT_T2t_vit_12 --checkpoint_path PATH_TO_CHECKPOINTS --eval_mode 0
Read the confidence thresholds saved in pre-trained models and infer the model on the validation set
CUDA_VISIBLE_DEVICES=0 python inference.py --data_url PATH_TO_DATASET --batch_size 128 --model DVT_T2t_vit_12 --checkpoint_path PATH_TO_CHECKPOINTS --eval_mode 1
Determine confidence thresholds on the training set and infer the model on the validation set
CUDA_VISIBLE_DEVICES=0 python inference.py --data_url PATH_TO_DATASET --batch_size 128 --model DVT_T2t_vit_12 --checkpoint_path PATH_TO_CHECKPOINTS --eval_mode 2
The dataset is expected to be prepared as follows:
ImageNet
├── train
│ ├── folder 1 (class 1)
│ ├── folder 2 (class 1)
│ ├── ...
├── val
│ ├── folder 1 (class 1)
│ ├── folder 2 (class 1)
│ ├── ...
Contact
If you have any question, please feel free to contact the authors. Yulin Wang: [email protected].
Acknowledgment
Our code of T2T-ViT from here.
To Do
- Update the code for training.