This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Last update: Dec 18, 2022

Related tags

Deep Learning Dynamic-Vision-Transformer

Overview

Dynamic-Vision-Transformer (Pytorch)

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Update on 2021/06/01: Release Pre-trained Models and the Inference Code on ImageNet.

Introduction

We develop a Dynamic Vision Transformer (DVT) to automatically configure a proper number of tokens for each individual image, leading to a significant improvement in computational efficiency, both theoretically and empirically.

Results

Top-1 accuracy on ImageNet v.s. GFLOPs

Top-1 accuracy on CIFAR v.s. GFLOPs

Top-1 accuracy on ImageNet v.s. Throughput

Visualization

Pre-trained Models

Backbone	# of Exits	# of Tokens	Links
T2T-ViT-12	3	7x7-10x10-14x14	Tsinghua Cloud / Google Drive

What are contained in the checkpoints:

**.pth.tar
├── model_state_dict: state dictionaries of the model
├── flops: a list containing the GFLOPs corresponding to exiting at each exit
├── anytime_classification: Top-1 accuracy of each exit
├── dynamic_threshold: the confidence thresholds used in budgeted batch classification
├── budgeted_batch_classification: results of budgeted batch classification (a two-item list, [0] and [1] correspond to the two coordinates of a curve)

Requirements

python 3.7.7
pytorch 1.3.1
torchvision 0.4.2

Evaluate Pre-trained Models

Read the evaluation results saved in pre-trained models

CUDA_VISIBLE_DEVICES=0 python inference.py --batch_size 128 --model DVT_T2t_vit_12 --checkpoint_path PATH_TO_CHECKPOINTS  --eval_mode 0

Read the confidence thresholds saved in pre-trained models and infer the model on the validation set

CUDA_VISIBLE_DEVICES=0 python inference.py --data_url PATH_TO_DATASET --batch_size 128 --model DVT_T2t_vit_12 --checkpoint_path PATH_TO_CHECKPOINTS  --eval_mode 1

Determine confidence thresholds on the training set and infer the model on the validation set

CUDA_VISIBLE_DEVICES=0 python inference.py --data_url PATH_TO_DATASET --batch_size 128 --model DVT_T2t_vit_12 --checkpoint_path PATH_TO_CHECKPOINTS  --eval_mode 2

The dataset is expected to be prepared as follows:

ImageNet
├── train
│   ├── folder 1 (class 1)
│   ├── folder 2 (class 1)
│   ├── ...
├── val
│   ├── folder 1 (class 1)
│   ├── folder 2 (class 1)
│   ├── ...

Contact

If you have any question, please feel free to contact the authors. Yulin Wang: [email protected].

Acknowledgment

Our code of T2T-ViT from here.

To Do

Update the code for training.

Comments

About the implementation of upsampling in relation_reuse

The main concern for me is that what is the necessity to split relation_temp as:

  split_index = int(relation_temp.size(0) / 2)
  relation_temp = torch.cat(
      (
          self.relation_reuse_upsample(relation_temp[:split_index * 1]),
          self.relation_reuse_upsample(relation_temp[split_index * 1:]),
      ), 0
  )

It is more straight to implement the upsample like this:

  relation_temp =  self.relation_reuse_upsample(relation_temp)

Could you please explain the difference between the above two implementations?

opened by larenzhang 1

Error 'Unknown model (DVT_T2t_vit_12)'

Hi!

I try to evaluate the DVT_T2t_vit_12, then I run 'python inference.py --data_url ./data/ --batch_size 64 --model DVT_T2t_vit_12 --checkpoint_path .\checkpoint\DVT_T2t_vit_12.pth.tar --eval_mode 1', I get the error.

" Traceback (most recent call last): File "inference.py", line 226, in main() File "inference.py", line 57, in main model = create_model( File "A:\transformer\DViT\Dynamic-Vision-Transformer-main\Dynamic-Vision-Transformer-main\timm\models\factory.py", line 59, in create_model raise RuntimeError('Unknown model (%s)' % model_name) RuntimeError: Unknown model (DVT_T2t_vit_12) "

And I try to print the _model_entrypoints, which in ..Dynamic-Vision-Transformer-main/timm/models/registry.py to find the model name'DVT_T2t_vit_12'. I don't see that.

env: python:3.8 pytorch:1.8.1 torchvision 0.9.1

opened by xiyiyia 1
Could you give us some example checkpoint of ImageNet 2014 or anything else?

I try to evaluate your model on my machine. Unfortunately, I find you don't give us any checkpoint in any dataset. I don't have enough machines to train a new model at ImageNet. Your work is perfect, and It would be a pity if I couldn't run your model. Just one checkpoint file I need! Any dataset is fine! Please. @blackfeather-wang @guanfuchen

Thank you

opened by xiyiyia 1
Some questions wioth FLOPs calculation in ViT

Thanks for your great work. I am interested in the FLOPs reported in your paper like table 1 table 4. I am wondering if you can release the code of FLOPs calcuation for ViT. Thank you!

opened by Liuyang829 1
关于feature 和 relation reuse 的疑惑
我们知道一个transformer应该是由多个encoder blocks组成的，那么我好奇的是upstream transformer 最后一层的输出是否要与downstream transformer每一个encoder block中的mlp输出进行级联？

论文中提到要重用upstream transformer的attention logits, 也就是重用upstream transformer中由Q与K生成的attention map, 那么我所好奇的是，是不是要将upstream transformer 每一个encoder block中的 attention map都与 downstream transformer与之深度对应的encoder block的attention map 进行级联来达到relation resue的目的？

这种重用机制所带来的额外计算开销理论上来说是非常巨大的，就像densenet的dense connection, 而论文中提到额外的计算开销是很小的，那么我觉得只有一个理由能解释这种相对额外开销很小的原因就是每一个patch 进行linear projection后得到的D的数值是很小的。我这样理解对吗？
opened by xingshulicc 1

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Related tags

Overview

Dynamic-Vision-Transformer (Pytorch)

Introduction

Results

Pre-trained Models

Requirements

Evaluate Pre-trained Models

Contact

Acknowledgment

To Do

Comments

About the implementation of upsampling in relation_reuse

Error 'Unknown model (DVT_T2t_vit_12)'

Could you give us some example checkpoint of ImageNet 2014 or anything else?

Some questions wioth FLOPs calculation in ViT

关于feature 和 relation reuse 的疑惑

Owner

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Official code for paper "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight"

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

This repo contains the pytorch implementation for Dynamic Concept Learner (accepted by ICLR 2021).

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Source code for paper: Knowledge Inheritance for Pre-trained Language Models

This repo contains the code required to train the multivariate time-series Transformer.