LeViT a Vision Transformer in ConvNet's Clothing for Faster Inference

Facebook Research

Last update: Jan 2, 2023

Related tags

Deep Learning LeViT

Overview

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

This repository contains PyTorch evaluation code, training code and pretrained models for LeViT.

They obtain competitive tradeoffs in terms of speed / precision:

For details see LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference by Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou and Matthijs Douze.

If you use this code for a paper please cite:

@article{graham2021levit,
  title={LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference},
  author={Benjamin Graham and Alaaeldin El-Nouby and Hugo Touvron and Pierre Stock and Armand Joulin and Herv\'e J\'egou and Matthijs Douze},
  journal={arXiv preprint arXiv:22104.01136},
  year={2021}
}

Model Zoo

We provide baseline LeViT models trained with distllation on ImageNet 2012.

name	acc@1	acc@5	#FLOPs	#params	url
LeViT-128S	76.6	92.9	305M	7.8M	model
LeViT-128	78.6	94.0	406M	9.2M	model
LeViT-192	80.0	94.7	658M	11M	model
LeViT-256	81.6	95.4	1120M	19M	model
LeViT-384	82.6	96.0	2353M	39M	model

Usage

First, clone the repository locally:

git clone https://github.com/facebookresearch/levit.git

Then, install PyTorch 1.7.0+ and torchvision 0.8.1+ and pytorch-image-models:

conda install -c pytorch pytorch torchvision
pip install timm

Data preparation

Download and extract ImageNet train and val images from http://image-net.org/. The directory structure is the standard layout for the torchvision datasets.ImageFolder, and the training and validation data is expected to be in the train/ folder and val folder respectively:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Evaluation

To evaluate a pre-trained LeViT-256 model on ImageNet val with a single GPU run:

python main.py --eval --model LeViT_256 --data-path /path/to/imagenet

This should give

* Acc@1 81.636 Acc@5 95.424 loss 0.750

Training

To train LeViT-256 on ImageNet with hard distillation on a single node with 8 gpus run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model LeViT_256 --data-path /path/to/imagenet --output_dir /path/to/save

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

To train LeViT-256 model on ImageNet on one node with 8 gpus:

python run_with_submitit.py --model LeViT_256 --data-path /path/to/imagenet

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Comments

LeViT-128S without distillation 100 epoch training reproduction on 1 GPU

Hello and thanks for the great paper and codebase! I am trying to replicate the numbers reported in Table 5 of the paper, and specifically the A4 model (without distillation), that is reported to achieve 69.7% top-1 accuracy. Would you have any hints as to how to replicate these numbers only having 1 GPU? Modifying the code and using gradient accumulation techniques to replicate the 256 * 32 = 8192 batch size seems to only reach 63.9% top-1 accuracy. Are there any other steps / tricks that I might be missing? Thanks!

opened by ktertikas 4
Why LeViT needs 1000 training epochs?

While other VIT models are trained with only 300 epochs, LeViT need 1000 epochs，which bring lots of traing cost. I think is unfair for comparison. What is the accuracy of LeViT at 300 epoch ?

opened by lilujunai 4
Question about running the speed_test.py

Hi~ I want to run the speed_test.py, but there is an error as follow: q, k, v = qkv.view(B, N, self.num_heads, shape '[2048, 50176, 4, -1]' is invalid for input of size 308281344 when I check the code, I find that the code remove batchnorm of model, and the patch_embed of model is also removed. Therefore, the transformer blocks can not reshape the input.

My question is how I fix this problem? And When I delete this line of removing batchnorm, I find the result of 'levit.LeViT_128S, 2048, 224' is 20761 images/s on RTX 3090 which is a lot higher than what you reported (12880 images/s reported in Tab 3). Is this result reasonable? I am looking forward to your reply, thx~.

opened by irsLu 3

LeViT training and bench on GTSRB dataset

Hello

I'm trying to use your SOTA LeViT for GTSRB but encounter some problems when testing. The accuracy after testing 12K images in GTSRB was only 13.4%, 347 FPS on 3080Ti. I believe that your model could break any record of my survey and training may be the main cause. I have tried levit.py and levit_c.py to load the model with only arg num_classes = 43 for training. I also use the same training, and testing method for GhostNet 1.0 and MobileNetV3 Large. Could you please point out some points in the training code that made my testbench not work with your work? Thank you in advance.

import torch
import torchvision
from torchvision import models
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torchsummary import summary
from utils import save_plots
from levit_c import LeViT_c_128S
mean=(0.485, 0.456, 0.406)
std=(0.229, 0.224, 0.225)
transform_train = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(mean,std),
])
trainset = torchvision.datasets.GTSRB(root='data', download=False, transform=transform_train)   # download=True if you did not download yet
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=8) 
model = LeViT_c_128S(num_classes=43)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Config Training HyperParameter
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)
# Lists to keep track of losses and accuracies.
train_loss= []
train_acc= []
# Training
epochs = 50
model.train()
epoch_acc = 0
epoch_loss = 0
for epoch in range(epochs):
    print("\n Epoch: %d"%(epoch+1))
    sum_loss = 0.0
    correct = 0.0
    total =0.0
    for i, data in enumerate(trainloader,0):
        length = len(trainloader)
        inputs,labels = data
        inputs,labels = inputs.to(device),labels.to(device)
        optimizer.zero_grad()
        # forward+backward
        outputs, x = model(inputs)
        loss = criterion(outputs,labels)
        loss.backward()
        optimizer.step()
        # 每个epoch输出损失和正确率
        sum_loss += loss.item()
        _, predicted = torch.max(outputs.data,1)
        total += labels.size(0)
        correct += predicted.eq(labels.data).cpu().sum()
        print("[epoch:%d, iter:%d] Loss: %.03F | Acc: %.3f%%"
              %(epoch+1, (i+1+epoch*length), sum_loss/(i+1), 100.*correct/total))
    scheduler.step()      # Adjust Learning Rate for next epoch
    epoch_loss = sum_loss/(i+1)
    epoch_acc = 100.*correct/total
    train_loss.append(epoch_loss)
    train_acc.append(epoch_acc)
#Display Training Result
model_name = "LeViT_128s"
save_plots(model_name, train_acc, train_loss)
print("Model: LeViT_128s")
print(f"Training Hyperparameter - Epochs: %s, Batch-size: 32, Learning-rate: 0.1, Optimizer: SGD, Momentum: 0.9 " % epochs)
print("[epoch:%d, iter:%d] Loss: %.03F | Acc: %.3f%%" %(epoch+1, (i+1+epoch*length), sum_loss/(i+1), 100.*correct/total))
print(f"Model was saved as %s.pth" % model_name)   
torch.save(model.state_dict(),'LeViT_128s.pth')

acc loss

opened by thaihoangminhtam 3

[fix] Fix Conv2d_BN fuse bug when groups > 1

Hi there, thanks for your great work : )

I found there is a bug when fusing Conv2d_BN with groups > 1. The reason is that the input channel should be w.size(1) * self.c.groups rather than w.size(1) in the function Conv2d_BN.fuse.

Reproduce Code:

from levit import Conv2d_BN
from levit_c import Conv2d_BN as Conv2d_BN_c
import torch
import numpy as np
from itertools import product
import utils

@torch.no_grad()
def test():
    for layer_t, a, b, ks, groups in product(
        [Conv2d_BN, Conv2d_BN_c],
        [8, 16, 32, 64],
        [8, 16, 32, 64],
        [1, 3, 5, 7],
        [1, 2, 4],
            ):
        layer = layer_t(a, b, ks, pad=ks//2, groups=groups)
        layer.eval()

        x = torch.randn((1, a, 16, 16))
        y1 = layer(x)
        utils.replace_batchnorm(layer)
        y2 = layer(x)

        np.testing.assert_almost_equal(y1.detach().numpy(), y2.detach().numpy(), decimal=4)

if __name__ == '__main__':
    test()
    print("Test Over")

Error:

  File "test_conv.py", line 21, in test
    layer.fuse()
  File "/home/wkcn/miniconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wkcn/proj/LeViT-1/levit_c.py", line 99, in fuse
    m.weight.data.copy_(w)
RuntimeError: The size of tensor a (2) must match the size of tensor b (4) at non-singleton dimension 1

CLA Signed

opened by wkcn 3

'NoneType' object has no attribute 'log_softmax'

I am using standard loss function nn.CrossEntropyLoss(). It give the following error, please let me know, can we use nn.CrossEntropyLoss()?

Traceback (most recent call last):

  File "/raid/khawar/PycharmProjects/thesis/train.py", line 487, in <module>
    loss = LOSS(outputs, labels)
  File "/raid/khawar/anaconda3/envs/vision-transformer-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/raid/khawar/anaconda3/envs/vision-transformer-pytorch/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1047, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/raid/khawar/anaconda3/envs/vision-transformer-pytorch/lib/python3.8/site-packages/torch/nn/functional.py", line 2693, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/raid/khawar/anaconda3/envs/vision-transformer-pytorch/lib/python3.8/site-packages/torch/nn/functional.py", line 1672, in log_softmax
    ret = input.log_softmax(dim)
AttributeError: 'NoneType' object has no attribute 'log_softmax'

opened by khawar-islam 3

Exporting ONNX failed.

I used following code to export onnx model:

torch.onnx.export(levit_model, dummy_input, 
                  "levit192.onnx",
                  export_params=True,
                  verbose=True, 
                  input_names=input_names, output_names=output_names)

but error occurred:

raise RuntimeError("step!=1 is currently not supported")
RuntimeError: step!=1 is currently not supported

I tried to set opset_version=11, but another error occurred:

  File "/multimedia-nfs/liwei/model_selection/model_select_env/lib/python3.6/site-packages/torch/onnx/utils.py", line 500, in _model_to_graph
    _export_onnx_opset_version)
RuntimeError: Index is supposed to be an empty tensor or a vector

I need your help. Thank you!

opened by leviome 2

About the shape of attention_biases

Thanks for your work!

When I run the code, I meet an error: too many indices for tensor of dimension2.

Error file is "mypath/levit.py", the error code is "self.attention_biases[:, self.attention_bias_idxs] "

attention_biases shape: [4, 196] attention_bias_idxs shape: [196, 196] Is there something wrong with this code?

opened by ZhangLei999 2
The specific setting (e.g., batch-size) to reproduce the inference speed in Tab.3?

In the Tab.3 of the paper, there are some values indicating the inference speed of LeViT models, such as 12880 img/s for LeViT-128S and 9266 img/s for LeViT-128.

Would you please list the specific setting (e.g., the batch-size, the type of GPU), because the same architecture can run with various inference speed under different settings.

opened by Openning07 2
problem of inference precision

Thank you very much for your open source. But when I reproduce the inference precision, when I use the model provided by the official, the inference precision is inconsistent with that given in readme. What is the reason.

i use the LeViT-256 Acc@1 81.584 Acc@5 95.464 loss 0.745

opened by aso538 1
Inference - different output when using different batch size
When performing inference using pretrained model (within eval() mode), the same image may producing different logits output when batch size is changed.

Example code:

with torch.no_grad(): x = torch.stack([ torch.zeros((3,224,224)), torch.ones((3,224,224)), torch.ones((3,224,224)), ]) model = LeViT_384(pretrained=True) model = model.eval() print('batch=1', model(x[:1])[0][:2].numpy()) print('batch=3', model(x[:3])[0][:2].numpy())

Output: (only for the 1st sample, limited to 2 classes)

batch=1 [-0.3287484 -0.11664876] batch=3 [-0.32874817 -0.11664899]

While the argmax may not significantly affected, this inconsistency make it difficult to perform gradient analysis.

I'm suspecting that this is caused by some batch-normalization layers not honoring eval() mode.
opened by thariq-nugrohotomo 1

Owner

Facebook Research

GitHub

A faster pytorch implementation of faster r-cnn

A Faster Pytorch Implementation of Faster R-CNN Write at the beginning [05/29/2020] This repo was initaited about two years ago, developed as the firs

7.1k Jan 1, 2023

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

12.6k Jan 9, 2023

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021

Code for "ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on", accepted at WACV 2021 Generation of Human Behavior Workshop.

ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on [ Paper ] [ Project Page ] This repository contains the code fo

97 Dec 13, 2022

Official implementation of the ICCV 2021 paper: "The Power of Points for Modeling Humans in Clothing".

The Power of Points for Modeling Humans in Clothing (ICCV 2021) This repository contains the official PyTorch implementation of the ICCV 2021 paper: T

158 Nov 24, 2022

RepVGG: Making VGG-style ConvNets Great Again

RepVGG: Making VGG-style ConvNets Great Again (PyTorch) This is a super simple ConvNet architecture that achieves over 80% top-1 accuracy on ImageNet

2.8k Jan 4, 2023

RepVGG: Making VGG-style ConvNets Great Again

This repository is the code that needs to be submitted for OpenMMLab Algorithm Ecological Challenge，the paper is RepVGG: Making VGG-style ConvNets Great Again

62 May 21, 2022

PyTorch implementation of spectral graph ConvNets, NIPS’16

Graph ConvNets in PyTorch October 15, 2017 Xavier Bresson http://www.ntu.edu.sg/home/xbresson https://github.com/xbresson https://twitter.com/xbresson

287 Jan 4, 2023

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression. Not an official Google product. Me

27 Dec 12, 2022

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Tacotron 2 (without wavenet) PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions. This implementati

4.1k Jan 3, 2023

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

209 Dec 30, 2022

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Yolov5 running on TorchServe (GPU compatible) ! This is a dockerfile to run TorchServe for Yolo v5 object detection model. (TorchServe (PyTorch librar

82 Nov 29, 2022

Monocular 3D pose estimation. OpenVINO. CPU inference or iGPU (OpenCL) inference.

human-pose-estimation-3d-python-cpp RealSenseD435 (RGB) 480x640 + CPU Corei9 45 FPS (Depth is not used) 1. Run 1-1. RealSenseD435 (RGB) 480x640 + CPU

8 Oct 3, 2022

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

157 Dec 11, 2022

Data-depth-inference - Data depth inference with python

Welcome! This readme will guide you through the use of the code in this reposito

3 Feb 8, 2022

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

1.4k Dec 30, 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

87 Nov 29, 2022

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

52 Dec 29, 2022

LeViT a Vision Transformer in ConvNet's Clothing for Faster Inference

Related tags

Overview

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Model Zoo

Usage

Data preparation

Evaluation

Training

Multinode training

License

Contributing

Comments

Thank you very much for your open source. But when I reproduce the inference precision, when I use the model provided by the official, the inference precision is inconsistent with that given in readme. What is the reason.

Owner

Facebook Research

A faster pytorch implementation of faster r-cnn

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Code for "ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on", accepted at WACV 2021 Generation of Human Behavior Workshop.

Official implementation of the ICCV 2021 paper: "The Power of Points for Modeling Humans in Clothing".

RepVGG: Making VGG-style ConvNets Great Again

RepVGG: Making VGG-style ConvNets Great Again

PyTorch implementation of spectral graph ConvNets, NIPS’16

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Monocular 3D pose estimation. OpenVINO. CPU inference or iGPU (OpenCL) inference.

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

Data-depth-inference - Data depth inference with python

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)