This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Microsoft

Last update: Dec 30, 2022

Related tags

Overview

Introduction

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolutional vision Transformers (CvT), that improves Vision Transformers (ViT) in performance and efficienty by introducing convolutions into ViT to yield the best of both disignes. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (e.g. shift, scale, and distortion invariance) while maintaining the merits of Transformers (e.g. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger dataset (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.

Main results

Models pre-trained on ImageNet-1k

Model	Resolution	Param	GFLOPs	Top-1
CvT-13	224x224	20M	4.5	81.6
CvT-21	224x224	32M	7.1	82.5
CvT-13	384x384	20M	16.3	83.0
CvT-32	384x384	32M	24.9	83.3

Models pre-trained on ImageNet-22k

Model	Resolution	Param	GFLOPs	Top-1
CvT-13	384x384	20M	16.3	83.3
CvT-32	384x384	32M	24.9	84.9
CvT-W24	384x384	277M	193.2	87.6

You can download all the models from our model zoo.

Quick start

Installation

Assuming that you have installed PyTroch and TorchVision, if not, please follow the officiall instruction to install them firstly. Intall the dependencies using cmd:

python -m pip install -r requirements.txt --user -q

The code is developed and tested using pytorch 1.7.1. Other versions of pytorch are not fully tested.

Data preparation

Please prepare the data as following:

|-DATASET
  |-imagenet
    |-train
    | |-class1
    | | |-img1.jpg
    | | |-img2.jpg
    | | |-...
    | |-class2
    | | |-img3.jpg
    | | |-...
    | |-class3
    | | |-img4.jpg
    | | |-...
    | |-...
    |-val
      |-class1
      | |-img5.jpg
      | |-...
      |-class2
      | |-img6.jpg
      | |-...
      |-class3
      | |-img7.jpg
      | |-...
      |-...

Run

Each experiment is defined by a yaml config file, which is saved under the directory of experiments. The directory of experiments has a tree structure like this:

experiments
|-{DATASET_A}
| |-{ARCH_A}
| |-{ARCH_B}
|-{DATASET_B}
| |-{ARCH_A}
| |-{ARCH_B}
|-{DATASET_C}
| |-{ARCH_A}
| |-{ARCH_B}
|-...

We provide a run.sh script for running jobs in local machine.

Usage: run.sh [run_options]
Options:
  -g|--gpus <1> - number of gpus to be used
  -t|--job-type <aml> - job type (train|test)
  -p|--port <9000> - master port
  -i|--install-deps - If install dependencies (default: False)

Training on local machine

bash run.sh -g 8 -t train --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml

You can also modify the config paramters by the command line. For example, if you want to change the lr rate to 0.1, you can run the command:

bash run.sh -g 8 -t train --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml TRAIN.LR 0.1

Notes:

The checkpoint, model, and log files will be saved in OUTPUT/{dataset}/{training config} by default.

Testing pre-trained models

bash run.sh -t test --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml TEST.MODEL_FILE ${PRETRAINED_MODLE_FILE}

Citation

If you find this work or code is helpful in your research, please cite:

@article{wu2021cvt,
  title={Cvt: Introducing convolutions to vision transformers},
  author={Wu, Haiping and Xiao, Bin and Codella, Noel and Liu, Mengchen and Dai, Xiyang and Yuan, Lu and Zhang, Lei},
  journal={arXiv preprint arXiv:2103.15808},
  year={2021}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments

About the pretrained model

I use the pretrained model CvT-13-224x224-IN-1k.pth, and test on Imagenet as the guide, but the result is terrible "TEST: Loss 8.5690 Error@1 98.926% Error@5 97.844% Accuracy@1 1.074% Accuracy@5 2.156%"

Does anyone else have tested? Why is it？

opened by Y1YU 4
NAN loss

Hi, I just trained cvt13-224 model with the default settings, but got NAN loss after several epochs. Does anyone have trained this model sucessfully?

opened by tzt101 4
Hyperparameters

Hi, thanks for this repo! Could you please share the configuration for ImageNet experiments? I suppose the config file here is not the one used for ImageNet, or at least doesn't reflect what is written in the paper (please correct me if I'm wrong). Many thanks!

opened by helia95 1
What's the accuracy of CvT-13 without pre-trained on CIFAR10

Hi,

What's the accuracy of CvT-13 without pre-trained on CIFAR10? Mine is only 79.6. Would you like to told me yours? And what are the hyper-parameters for fine-tuning on CIFAR10 without pre-trained ? I can't find it in detail in the paper.

Thanks.

opened by hanwenran1 0
Recommend change the code

recommend change the following code int lib/models/cls_cvt.py:611
x = torch.squeeze(x) change to x = torch.squeeze(x, dim=1) Otherwise, an error will occur when bachsize = 1

opened by zlwangustc 0
recommended torch version may be wrong

After installing torch 1.7.1, I got an ERROR: ModuleNotFoundError: No module named 'torch.fx What I find on stackoverflow is that torch.fx was added in PyTorch 1.8.0., so may be recommended version is wrong?

opened by RylonW 0
How to calculate the flops of the model?

Hello, thanks for the great work, how to calculate the flops of the model. I have noticed that you report the flops of transformer based model, but I only found some tools of cnn models.

opened by exiawsh 0

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Related tags

Overview

Introduction

Main results

Models pre-trained on ImageNet-1k

Models pre-trained on ImageNet-22k

Quick start

Installation

Data preparation

Run

Training on local machine

Testing pre-trained models

Citation

Contributing

Trademarks

Comments

About the pretrained model

NAN loss

Hyperparameters

What's the accuracy of CvT-13 without pre-trained on CIFAR10

Recommend change the code

recommended torch version may be wrong

How to calculate the flops of the model?

Owner

Microsoft

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging

Introducing neural networks to predict stock prices

DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predicate.

Introducing neural networks to predict stock prices

[ICCV 2021] Official Tensorflow Implementation for "Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions"

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

PyTorch implementation of the R2Plus1D convolution based ResNet architecture described in the paper "A Closer Look at Spatiotemporal Convolutions for Action Recognition"

an implementation of Revisiting Adaptive Convolutions for Video Frame Interpolation using PyTorch

Unofficial pytorch implementation of 'Image Inpainting for Irregular Holes Using Partial Convolutions'

Simple Tensorflow implementation of "Adaptive Convolutions for Structure-Aware Style Transfer" (CVPR 2021)

TART - A PyTorch implementation for Transition Matrix Representation of Trees with Transposed Convolutions

Official repository for "Intriguing Properties of Vision Transformers" (2021)

Official repository for "On Improving Adversarial Transferability of Vision Transformers" (2021)

Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Implementations of orthogonal and semi-orthogonal convolutions in the Fourier domain with applications to adversarial robustness

Classify bird species based on their songs using SIamese Networks and 1D dilated convolutions.

Implements an infinite sum of poisson-weighted convolutions

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition