MPViT:Multi-Path Vision Transformer for Dense Prediction

Overview

MPViT : Multi-Path Vision Transformer for Dense Prediction

This repository inlcudes official implementations and model weights for MPViT.

[Arxiv] [BibTeX]

MPViT : Multi-Path Vision Transformer for Dense Prediction
🏛️ ️️ 🏫 Youngwan Lee, 🏛️ ️️Jonghee Kim, 🏫 Jeff Willette, 🏫 Sung Ju Hwang
ETRI 🏛️ ️, KAIST 🏫

Abstract

We explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds features of the same size (i.e., sequence length) with patches of different scales simultaneously by using overlapping convolutional patch embedding. Tokens of different scales are then independently fed into the Transformer encoders via multiple paths and the resulting features are aggregated, enabling both fine and coarse feature representations at the same feature level. Thanks to the diverse and multi-scale feature representations, our MPViTs scaling from Tiny(5M) to Base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation. These extensive results demonstrate that MPViT can serve as a versatile backbone network for various vision tasks.

Main results on ImageNet-1K

🚀 These all models are trained on ImageNet-1K with the same training recipe as DeiT and CoaT.

model resolution [email protected] #params FLOPs weight
MPViT-T 224x224 78.2 5.8M 1.6G weight
MPViT-XS 224x224 80.9 10.5M 2.9G weight
MPViT-S 224x224 83.0 22.8M 4.7G weight
MPViT-B 224x224 84.3 74.8M 16.4G weight

Main results on COCO object detection

🚀 All model are trained using ImageNet-1K pretrained weights.

☀️ MS denotes the same multi-scale training augmentation as in Swin-Transformer which follows the MS augmentation as in DETR and Sparse-RCNN. Therefore, we also follows the official implementation of DETR and Sparse-RCNN which are also based on Detectron2.

Please refer to detectron2/ for the details.

Backbone Method lr Schd box mAP mask mAP #params FLOPS weight
MPViT-T RetinaNet 1x 41.8 - 17M 196G model | metrics
MPViT-XS RetinaNet 1x 43.8 - 20M 211G model | metrics
MPViT-S RetinaNet 1x 45.7 - 32M 248G model | metrics
MPViT-B RetinaNet 1x 47.0 - 85M 482G model | metrics
MPViT-T RetinaNet MS+3x 44.4 - 17M 196G model | metrics
MPViT-XS RetinaNet MS+3x 46.1 - 20M 211G model | metrics
MPViT-S RetinaNet MS+3x 47.6 - 32M 248G model | metrics
MPViT-B RetinaNet MS+3x 48.3 - 85M 482G model | metrics
MPViT-T Mask R-CNN 1x 42.2 39.0 28M 216G model | metrics
MPViT-XS Mask R-CNN 1x 44.2 40.4 30M 231G model | metrics
MPViT-S Mask R-CNN 1x 46.4 42.4 43M 268G model | metrics
MPViT-B Mask R-CNN 1x 48.2 43.5 95M 503G model | metrics
MPViT-T Mask R-CNN MS+3x 44.8 41.0 28M 216G model | metrics
MPViT-XS Mask R-CNN MS+3x 46.6 42.3 30M 231G model | metrics
MPViT-S Mask R-CNN MS+3x 48.4 43.9 43M 268G model | metrics
MPViT-B Mask R-CNN MS+3x 49.5 44.5 95M 503G model | metrics

Deformable-DETR

All models are trained using the same training recipe.

Please refer to deformable_detr/ for the details.

backbone box mAP epochs link
ResNet-50 44.5 50 -
CoaT-lite S 47.0 50 link
CoaT-S 48.4 50 link
MPViT-S 49.0 50 link

Main results on ADE20K Semantic segmentation

All model are trained using ImageNet-1K pretrained weight.

Please refer to semantic_segmentation/ for the details.

Backbone Method Crop Size Lr Schd mIoU #params FLOPs weight
MPViT-S UperNet 512x512 160K 48.3 52M 943G weight
MPViT-B UperNet 512x512 160K 50.3 105M 1185G weight

Getting Started

We use pytorch==1.7.0 torchvision==0.8.1 cuda==10.1 libraries on NVIDIA V100 GPUs. If you use different versions of cuda, you may obtain different accuracies, but the differences are negligible.

Acknowledgement

This repository is built using the Timm library, DeiT, CoaT, Detectron2, mmsegmentation repositories.

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00004, Development of Previsional Intelligence based on Long-term Visual Memory Network and No. 2014-3-00123, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis).

License

Please refer to MPViT LSA.

Citing MPViT

@article{lee2021mpvit,
      title={MPViT: Multi-Path Vision Transformer for Dense Prediction}, 
      author={Youngwan Lee and Jonghee Kim and Jeff Willette and Sung Ju Hwang},
      year={2021},
      journal={arXiv preprint arXiv:2112.11010}
}
Issues
  • When will open source be available?

    When will open source be available?

    When will open source be available?

    opened by 1haomingtian 3
  • Looking forward to your work

    Looking forward to your work

    null

    opened by Joysheng 2
  • Some details on the implementation.

    Some details on the implementation.

    Hi @youngwanLEE.

    Thanks for the great work. After reading the paper, I have some questions about the implementation details. One of the questions is that whether you have used the Relative Position Encoding in the transformer architecture design as Coat (Code here). I found that discarding the relative position encoding in Factorized Attention leads to a significant acc drop.

    Looking forward to your reply. Thx.

    opened by ChongjianGE 2
Owner
Youngwan Lee
Researcher at ETRI & Ph.D student in Graduate school of AI at KAIST.
Youngwan Lee
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 8.2k Jan 17, 2022
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

null 1 Dec 24, 2021
Exploring Relational Context for Multi-Task Dense Prediction [ICCV 2021]

Adaptive Task-Relational Context (ATRC) This repository provides source code for the ICCV 2021 paper Exploring Relational Context for Multi-Task Dense

David Brüggemann 25 Jan 12, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 163 Jan 5, 2022
Dense Prediction Transformers

Vision Transformers for Dense Prediction This repository contains code and models for our paper: Vision Transformers for Dense Prediction René Ranftl,

Intel ISL (Intel Intelligent Systems Lab) 946 Jan 8, 2022
Implementation of "A MLP-like Architecture for Dense Prediction"

A MLP-like Architecture for Dense Prediction (arXiv) Updates (22/07/2021) Initial release. Model Zoo We provide CycleMLP models pretrained on ImageNet

Shoufa Chen 164 Jan 7, 2022
Dense Prediction Transformers

Vision Transformers for Dense Prediction This repository contains code and models for our paper: Vision Transformers for Dense Prediction René Ranftl,

Intelligent Systems Lab Org 953 Jan 15, 2022
[ICCV 2021] FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

FaPN: Feature-aligned Pyramid Network for Dense Image Prediction [arXiv] [Project Page] @inproceedings{ huang2021fapn, title={{FaPN}: Feature-alig

Shihua Huang 18 Jan 4, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 890 Jan 11, 2022
Alex Pashevich 41 Dec 31, 2021
The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

null 73 Jan 8, 2022
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 39 Jan 7, 2022
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

Microsoft 266 Jan 13, 2022
Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT : Cross-Attention Multi-Scale Vision Transformer for Image Classification This is an unofficial PyTorch implementation of CrossViT: Cross-Att

Rishikesh (ऋषिकेश) 68 Jan 7, 2022
Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT This repository is the official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. ArXiv If

International Business Machines 81 Jan 17, 2022
GluonMM is a library of transformer models for computer vision and multi-modality research

GluonMM is a library of transformer models for computer vision and multi-modality research. It contains reference implementations of widely adopted baseline models and also research work from Amazon Research.

null 41 Dec 16, 2021
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 185 Jan 10, 2022
Multi Agent Path Finding Algorithms

MATP-solver Simulator collision check path step random initial states or given states Traditional method Seperate A* algorithem Confict-based Search S

null 11 Dec 1, 2021
Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Price-Prediction-For-a-Dream-Home ROADMAP TO THIS LINEAR REGRESSION BASED HOUSE PRICE PREDICTION PREDICTION MODEL Import all the dependencies of the p

DIKSHA DESWAL 1 Dec 29, 2021