ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Related tags

Deep Learning vidt

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

by Hwanjun Song1, Deqing Sun2, Sanghyuk Chun1, Varun Jampani2, Dongyoon Han1,
Byeongho Heo1, Wonjae Kim1, and Ming-Hsuan Yang2,3

1 NAVER AI Lab, 2 Google Research, 3 University California Merced

ViDT: Vision and Detection Transformers


ViDT is an end-to-end fully transformer-based object detector, which directly produces predictions without using convolutional layers. Our main contributions are summarized as follows:

  • ViDT introduces a modified attention mechanism, named Reconfigured Attention Module (RAM), that facilitates any ViT variant to handling the appened [DET] and [PATCH] tokens for a standalone object detection. Thus, we can modify the lastest Swin Transformer backbone with RAM to be an object detector and obtain high scalability using its local attetention mechanism with linear complexity.

  • ViDT adopts a lightweight encoder-free neck architecture to reduce the computational overhead while still enabling the additional optimization techniques on the neck module. As a result, ViDT obtains better performance than neck-free counterparts.

  • We introdcue a new concept of token matching for knowledge distillation, which brings additional performance gains from a large model to a small model without compromising detection efficiency.

Architectural Advantages. First, ViDT enables to combine Swin Transformer and the sequent-to-sequence paradigm for detection. Second, ViDT can use the multi-scale features and additional techniques without a significant computation overhead. Therefore, as a fully transformer-based object detector, ViDT facilitates better integration of vision and detection transformers.

Component Summary. There are four components: (1) RAM to extend Swin Transformer as a standalone object detector, (2) the neck decoder to exploit multi-scale features with two additional techniques, auxiliary decoding loss and iterative box refinement, (3) knowledge distillation to benefit from a large model, and (4) decoding layer drop to further accelerate inference speed.


Index: [A. ViT Backbone], [B. Main Results], [C. Complete Analysis]

|--- A. ViT Backbone used for ViDT
|--- B. Main Results in the ViDT Paper
     |--- B.1. ViDT for 50 and 150 Epochs
     |--- B.2. Distillation with Token Matching
|--- C. Complete Component Analysis

A. ViT Backbone used for ViDT

Backbone and Size Training Data Epochs Resulution Params ImageNet Acc. Checkpoint
Swin-nano ImageNet-1K 300 224 6M 74.9% Github
Swin-tiny ImageNet-1K 300 224 28M 81.2% Github
Swin-small ImageNet-1K 300 224 50M 83.2% Github
Swin-base ImageNet-22K 90 224 88M 86.3% Github

B. Main Results in the ViDT Paper

In main experiments, auxiliary decoding loss and iterative box refinement were used as the auxiliary techniques on the neck structure.
The efficiacy of distillation with token mathcing and decoding layer drop are verified independently in Compelete Component Analysis.
All the models were re-trained with the final version of source codes. Thus, the value may be very slightly different from those in the paper.

B.1. VIDT for 50 and 150 epochs
Backbone Epochs AP AP50 AP75 AP_S AP_M AP_L Params FPS Checkpoint / Log
Swin-nano 50 (150) 40.4 (42.6) 59.9 (62.2) 43.0 (45.7) 23.1 (24.9) 42.8 (45.4) 55.9 (59.1) 16M 20.0 Github / Log
(Github / Log)
Swin-tiny 50 (150) 44.9 (47.2) 64.7 (66.7) 48.3 (51.4) 27.5 (28.4) 47.9 (50.2) 61.9 (64.7) 38M 17.2 Github / Log
(Github / Log)
Swin-small 50 (150) 47.4 (48.8) 67.7 (68.8) 51.2 (53.0) 30.4 (30.7) 50.7 (52.0) 64.6 (65.9) 60M 12.1 Github / Log
(Github / Log)
Swin-base 50 (150) 49.4 (50.4) 69.6 (70.4) 53.4 (54.8) 31.6 (34.1) 52.4 (54.2) 66.8 (67.4) 0.1B 9.0 Github / Log
(Github / Log)
B.2. Distillation with Token Matching (Coefficient 4.0)

All the models are trained for 50 epochs with distillation.

Teacher ViDT (Swin-base) trained for 50 epochs
Student ViDT (Swin-nano) ViDT (Swin-tiny) ViDT (Swin-Small)
Coefficient = 0.0 40.4 44.9 47.4
Coefficient = 4.0 41.8 (Github / Log) 46.6 (Github / Log) 49.2 (Github / Log)

C. Complete Component Analysis

We combined the four proposed components (even with distillation with token matching and decoding layer drop) to achieve high accuracy and speed for object detection. For distillation, ViDT (Swin-base) trained for 50 epochs was used for all models.

Component Swin-nano Swin-tiny Swin-small
# RAM Neck Distil Drop AP Params FPS AP Params FPS AP Params FPS
(1) ✔️ 28.7 7M 36.5 36.3 29M 28.6 41.6 52M 16.8
(2) ✔️ ✔️ 40.4 16M 20.0 44.9 38M 17.2 47.4 60M 12.1
(3) ✔️ ✔️ ✔️ 41.8 16M 20.0 46.6 38M 17.2 49.2 60M 12.1
(4) ✔️ ✔️ ✔️ ✔️ 41.6 13M 23.0 46.4 35M 19.5 49.1 58M 13.0


This codebase has been developed with the setting used in Deformable DETR:
Linux, CUDA>=9.2, GCC>=5.4, Python>=3.7, PyTorch>=1.5.1, and torchvision>=0.6.1.

We recommend you to use Anaconda to create a conda environment:

conda create -n deformable_detr python=3.7 pip
conda activate deformable_detr
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch

Compiling CUDA operators for deformable attention

cd ./ops
sh ./
# unit test (should see all checking is True)

Other requirements

pip install -r requirements.txt


We used the below commands to train ViDT models with a single node having 8 NVIDIA V100 GPUs.

Run this command to train the ViDT (Swin-nano) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_nano \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT (Swin-tiny) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_tiny \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT (Swin-small) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_small \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT (Swin-base) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

When a large pre-trained ViDT model is available, distillation with token matching can be applied for training a smaller ViDT model.

Run this command when training ViDT (Swin-nano) using a large ViDT (Swin-base) via Knowledge Distillation :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_nano \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --distil_model vidt_base \
       --distil_path /path/to/vidt_base (or url) \
       --coco_path /path/to/coco \
       --output_dir /path/for/output


Run this command to evaluate the ViDT (Swin-nano) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \ 
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_nano \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_nano \
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT (Swin-tiny) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_tiny \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_tiny\
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT (Swin-small) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_small \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_small \
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT (Swin-base) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_base \
       --pre_trained none \
       --eval True


Please consider citation if our paper is useful in your research.

  title={ViDT: An Efficient and Effective Fully Transformer-based Object Detector},
  author={Song, Hwanjun and Sun, Deqing and Chun, Sanghyuk and Jampani, Varun and Han, Dongyoon and Heo, Byeongho and Kim, Wonjae and Yang, Ming-Hsuan},
  journal={arXiv preprint arXiv:2110.03921},


Copyright 2021-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.
  • Inference Time of Deformable Detr with Swin-base

    Inference Time of Deformable Detr with Swin-base

    Hi, From the results you provided in openreview, the inference time of deformable detr with swin-base is 4.8 FPS. However, from my testing, it is 8.1 FPS. I am using Tesla V100 GPU with batch size=1.

    Screen Shot 2021-12-03 at 4 27 18 PM

    opened by ilovecv 5
  • Simple notebook file(.ipynb) for whom wants to train/test ViDT on Colab

    Simple notebook file(.ipynb) for whom wants to train/test ViDT on Colab

    As I first seen your paper, I'm currently trying train/test of ViDT on single machine, single gpu (especially Colab Pro).

    Since there seems to be no any other materials (or .ipynb file) of tutorial for this simple testing with COCO dataset,

    I would like to share my .ipynb file for whom interested in this model, and testing with Colab environment.

    .ipynb file on this repo

    If it bothers, please let me know, then I'll delete this colab repo.

    Thanks in advance.

    opened by EherSenaw 1
  • Error while running

    Error while running

    I am getting the following error message while running in ops directory.

    I am exactly following the installation steps provided in the README file

    `Traceback (most recent call last): File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/site-packages/torch/utils/", line 1423, in _run_ninja_build check=True) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "", line 70, in cmdclass={"build_ext": torch.utils.cpp_extension.BuildExtension}, File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/site-packages/setuptools/", line 153, in setup return distutils.core.setup(**attrs) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/", line 148, in setup dist.run_commands() File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/", line 966, in run_commands self.run_command(cmd) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/", line 985, in run_command File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/command/", line 135, in run self.run_command(cmd_name) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/", line 313, in run_command self.distribution.run_command(command) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/", line 985, in run_command File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/site-packages/setuptools/command/", line 79, in run File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/command/", line 340, in run self.build_extensions() File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/site-packages/torch/utils/", line 603, in build_extensions build_ext.build_extensions(self) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/command/", line 449, in build_extensions self._build_extensions_serial() File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/command/", line 474, in _build_extensions_serial self.build_extension(ext) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/site-packages/setuptools/command/", line 202, in build_extension _build_ext.build_extension(self, ext) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/distutils/command/", line 534, in build_extension depends=ext.depends) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/site-packages/torch/utils/", line 437, in unix_wrap_ninja_compile with_cuda=with_cuda) File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/site-packages/torch/utils/", line 1163, in _write_ninja_file_and_compile_objects error_prefix='Error compiling objects for extension') File "/home/aditya_rastogi/anaconda3/envs/ddetr/lib/python3.7/site-packages/torch/utils/", line 1436, in _run_ninja_build raise RuntimeError(message) RuntimeError: Error compiling objects for extension`

    opened by IISCAditayTripathi 0
  • Question about feature map

    Question about feature map


    I have a question about the feature map that is extracted by the Swin backbone. Assuming an input with size (224,224), the original Swin model produces 4 feature maps, with shapes (C, 56, 56), (2C, 28, 28), (4C, 14, 14) and (8C, 7, 7).

    Your version, however, produces 4 feature maps (2C, 28, 28), (4C, 14, 14), (8C, 7, 7) and (256, 4, 4).

    Can you please explain why you are not also using the 1st feature map?

    opened by ManiadisG 0
  • Long training Time

    Long training Time

    I am trying to train swin_nano with 4 V100 GPUs. It's almost 20hrs but have not completed one epoch yet. I have followed the setup instructions stated in this repo. My setup is as foliows: Package Version

    certifi 2022.6.15
    charset-normalizer 2.1.0
    cycler 0.11.0
    einops 0.4.1
    fonttools 4.33.3
    idna 3.3
    kiwisolver 1.4.3
    matplotlib 3.5.2
    MultiScaleDeformableAttention 1.0
    numpy 1.21.6
    onnx 1.10.0
    onnxruntime 1.4.0
    packaging 21.3
    Pillow 9.2.0
    pip 19.0.3
    protobuf 3.20.1
    pycocotools 2.0.4
    pyparsing 3.0.9
    python-dateutil 2.8.2
    requests 2.28.1
    scipy 1.7.3
    setuptools 40.8.0
    six 1.16.0
    timm 0.5.4
    torch 1.8.0+cu111 torchaudio 0.8.0
    torchvision 0.9.0+cu111 typing-extensions 4.3.0
    urllib3 1.26.9

    With the same setup DeformableDETR takes 1hr and 30 mins to complete one epoch on COCO 2017 dataset. Could anyone identify the problem?

    opened by Alam4545 0
  • What if we only do detection and classification task with vidt+

    What if we only do detection and classification task with vidt+

    As mention in title,I have some dataset that already transform to coco format with bounding box and class label but with no segmentation mask,which part of your code should be modified? Simply with --mask=False still not working..

    opened by quyanqiu 0
  • #BUG


    when i run the, the error comes

    ViDT training and evaluation script: error: unrecognized arguments: true

    in, my code is

    args = parser.parse_args(['--method', 'vidt', '--backbone_name', 'swin_nano', '--epochs', '50', '--lr', '1e-4', '--min-lr', '1e-7', '--batch_size', '2', '--num_workers', '2', '--aux_loss', 'true', '--with_box_refine', 'true', '--det_token_num', '100', '--epff', ' true', '--token_label', 'true', '--iou_aware', 'true', '--with_vector', 'true', '--masks', 'true', '--coco_path', '/r/code/coco', '--output_dir', './output',])

    opened by ross-Hr 1
Official account of NAVER AI, Korea No.1 Industrial AI Research Group
Lane follower: Lane-detector (OpenCV) + Object-detector (YOLO5) + CAN-bus

Lane Follower This code is for the lane follower, including perception and control, as shown below. Environment Hardware Industrial Camera Intel-NUC(1

Siqi Fan 3 Jul 7, 2022
Deformable DETR is an efficient and fast-converging end-to-end object detector.

Deformable DETR: Deformable Transformers for End-to-End Object Detection.

null 2k Jan 5, 2023
LiDAR R-CNN: An Efficient and Universal 3D Object Detector

LiDAR R-CNN: An Efficient and Universal 3D Object Detector Introduction This is the official code of LiDAR R-CNN: An Efficient and Universal 3D Object

TuSimple 295 Jan 5, 2023
A simple, fast, and efficient object detector without FPN

You Only Look One-level Feature (YOLOF), CVPR2021 A simple, fast, and efficient object detector without FPN. This repo provides an implementation for

null 789 Jan 9, 2023
HeartRate detector with ArduinoandPython - Use Arduino and Python create a heartrate detector.

Syllabus of Contents Syllabus of Contents Introduction Of Project Features Develop With Python code introduction Installation License Developer Contac

null 1 Jan 5, 2022
Video lie detector using xgboost - A video lie detector using OpenFace and xgboost

video_lie_detector_using_xgboost a video lie detector using OpenFace and xgboost

null 2 Jan 11, 2022
An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

Zhihu 44 Oct 20, 2022
A whale detector design for the Kaggle whale-detector challenge!

CNN (InceptionV1) + STFT based Whale Detection Algorithm So, this repository is my PyTorch solution for the Kaggle whale-detection challenge. The obje

Tarin Ziyaee 92 Sep 28, 2021
Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

HackED 2022 Team 3IQ - 2022 Imposter Detector By Aneeljyot Alagh, Curtis Kan, Jo

Joshua Ji 3 Aug 20, 2022
Embracing Single Stride 3D Object Detector with Sparse Transformer

SST: Single-stride Sparse Transformer This is the official implementation of paper: Embracing Single Stride 3D Object Detector with Sparse Transformer

TuSimple 385 Dec 28, 2022
Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

AI Secure 57 Dec 15, 2022
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Microsoft 8.4k Jan 1, 2023
A deep learning library that makes face recognition efficient and effective

Distributed Arcface Training in Pytorch This is a deep learning library that makes face recognition efficient, and effective, which can train tens of

Sajjad Aemmi 10 Nov 23, 2021
Colar: Effective and Efficient Online Action Detection by Consulting Exemplars, CVPR 2022.

Colar: Effective and Efficient Online Action Detection by Consulting Exemplars This repository is the official implementation of Colar. In this work,

LeYang 246 Dec 13, 2022
MISSFormer: An Effective Medical Image Segmentation Transformer

MISSFormer Code for paper "MISSFormer: An Effective Medical Image Segmentation Transformer". Please read our preprint at the following link: paper_add

Fong 22 Dec 24, 2022
BMW TechOffice MUNICH 148 Dec 21, 2022
Alex Pashevich 62 Dec 24, 2022
Playing around with FastAPI and streamlit to create a YoloV5 object detector

FastAPI-Streamlit-based-YoloV5-detector Playing around with FastAPI and streamlit to create a YoloV5 object detector It turns out that a User Interfac

null 2 Jan 20, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022