This project aims to explore the deployment of Swin-Transformer based on TensorRT, including the test results of FP16 and INT8.

maggiez

Last update: Dec 21, 2022

Related tags

Deep Learning Swin-Transformer-TensorRT

Overview

Swin Transformer

This project aims to explore the deployment of SwinTransformer based on TensorRT, including the test results of FP16 and INT8.

Introduction(Quoted from the Original Project )

Swin Transformer original github repo (the name Swin stands for Shifted window) is initially described in arxiv, which capably serves as a general-purpose backbone for computer vision. It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.

Setup

Please refer to the Install session for conda environment build.
Please refer to the Data preparation session to prepare Imagenet-1K.
Install the TensorRT, now we choose the TensorRT 8.2 GA(8.2.1.8) as the test version.

Code Structure

Focus on the modifications and additions.

.
├── export.py                  # Export the PyTorch model to ONNX format
├── get_started.md            
├── main.py
├── models
│   ├── build.py
│   ├── __init__.py
│   ├── swin_mlp.py
│   └── swin_transformer.py    # Build the model, modified to export the onnx and build the TensorRT engine
├── README.md
├── trt                        # Directory for TensorRT's engine evaluation and visualization.
│   ├── engine.py
│   ├── eval_trt.py            # Evaluate the tensorRT engine's accuary.
│   ├── onnxrt_eval.py         # Run the onnx model, generate the results, just for debugging
├── utils.py
└── weights

Export to ONNX and Build TensorRT Engine

You need to pay attention to the two modification below.

Exporting the operator roll to ONNX opset version 9 is not supported. A: Please refer to torch/onnx/symbolic_opset9.py, add the support of exporting torch.roll.

Node (Concat_264) Op (Concat) [ShapeInferenceError] All inputs to Concat must have same rank.
A: Please refer to the modifications in models/swin_transformer.py. We use the input_resolution and window_size to compute the nW.

   if mask is not None:
     nW = int(self.input_resolution[0]*self.input_resolution[1]/self.window_size[0]/self.window_size[1])
     #nW = mask.shape[0]
     #print('nW: ', nW)
     attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)
     attn = attn.view(-1, self.num_heads, N, N)
     attn = self.softmax(attn)

Accuray Test Results on ImageNet-1K Validation Dataset

Download the Swin-T pretrained model from Model Zoo. Evaluate the accuracy of the Pytorch pretrained model.

$ python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval --cfg configs/swin_tiny_patch4_window7_224.yaml --resume ./weights/swin_tiny_patch4_window7_224.pth --data-path ../imagenet_1k

export.py exports a pytorch model to onnx format.

$ python export.py --eval --cfg configs/swin_tiny_patch4_window7_224.yaml --resume ./weights/swin_tiny_patch4_window7_224.pth --data-path ../imagenet_1k --batch-size 16

Build the TensorRT engine using trtexec.

$ trtexec --onnx=./weights/swin_tiny_patch4_window7_224.onnx --buildOnly --verbose --saveEngine=./weights/swin_tiny_patch4_window7_224_batch16.engine --workspace=4096

Add the --fp16 or --best tag to build the corresponding fp16 or int8 model. Take fp16 as an example.

$ trtexec --onnx=./weights/swin_tiny_patch4_window7_224.onnx --buildOnly --verbose --fp16 --saveEngine=./weights/swin_tiny_patch4_window7_224_batch16_fp16.engine --workspace=4096

You can use the trtexec to test the throughput of the TensorRT engine.

$ trtexec --loadEngine=./weights/swin_tiny_patch4_window7_224_batch16.engine

trt/eval_trt.py aims to evalute the accuracy of the TensorRT engine.

$ python trt/eval_trt.py --eval --cfg configs/swin_tiny_patch4_window7_224.yaml --resume ./weights/swin_tiny_patch4_window7_224_batch16.engine --data-path ../imagenet_1k --batch-size 16

trt/onnxrt_eval.py aims to evalute the accuracy of the Onnx model, just for debug.

$ python trt/onnxrt_eval.py --eval --cfg configs/swin_tiny_patch4_window7_224.yaml --resume ./weights/swin_tiny_patch4_window7_224.onnx --data-path ../imagenet_1k --batch-size 16

SwinTransformer(T4)	Acc@1	Notes
PyTorch Pretrained Model	81.160
TensorRT Engine(FP32)	81.156
TensorRT Engine(FP16)	-	TensorRT 8.0.3.4: 81.156% vs TensorRT 8.2.1.8: 72.768%

Notes: Reported a nvbug for the FP16 accuracy issue, please refer to nvbug 3464358.

Speed Test of TensorRT engine(T4)

SwinTransformer(T4)	FP32	FP16	INT8
batchsize=1	245.388 qps	510.072 qps	514.707 qps
batchsize=16	316.8624 qps	804.112 qps	804.1072 qps
batchsize=64	329.13984 qps	833.4208 qps	849.5168 qps
batchsize=256	331.9808 qps	844.10752 qps	840.33024 qps

Analysis: Compared with FP16, INT8 does not speed up at present. The main reason is that, for the Transformer structure, most of the calculations are processed by Myelin. Currently Myelin does not support the PTQ path, so the current test results are expected.
Attached the int8 and fp16 engine layer information with batchsize=128 on T4.

Build with int8 precision:

[12/04/2021-06:34:17] [V] [TRT] Engine Layer Information:
Layer(Reformat): Reformatting CopyNode for Input Tensor 0 to Conv_0, Tactic: 0, input_0[Float(128,3,224,224)] -> Reformatted Input Tensor 0 to Conv_0[Int8(128,3,224,224)]
Layer(CaskConvolution): Conv_0, Tactic: 1025026069226666066, Reformatted Input Tensor 0 to Conv_0[Int8(128,3,224,224)] -> 191[Int8(128,96,56,56)]
Layer(Reformat): Reformatting CopyNode for Input Tensor 0 to {ForeignNode[318...Transpose_2125 + Flatten_2127 + (Unnamed Layer* 4178) [Shuffle]]}, Tactic: 0, 191[Int8(128,96,56,56)] -> Reformatted Input Tensor 0 to {ForeignNode[318...Transpose_2125 + Flatten_2127 + (Unnamed Layer* 4178) [Shuffle]]}[Half(128,96,56,56)]
Layer(Myelin): {ForeignNode[318...Transpose_2125 + Flatten_2127 + (Unnamed Layer* 4178) [Shuffle]]}, Tactic: 0, Reformatted Input Tensor 0 to {ForeignNode[318...Transpose_2125 + Flatten_2127 + (Unnamed Layer* 4178) [Shuffle]]}[Half(128,96,56,56)] -> (Unnamed Layer* 4178) [Shuffle]_output[Half(128,768,1,1)]
Layer(CaskConvolution): Gemm_2128, Tactic: -1838109259315759592, (Unnamed Layer* 4178) [Shuffle]_output[Half(128,768,1,1)] -> (Unnamed Layer* 4179) [Fully Connected]_output[Half(128,1000,1,1)]
Layer(Reformat): Reformatting CopyNode for Input Tensor 0 to (Unnamed Layer* 4183) [Shuffle], Tactic: 0, (Unnamed Layer* 4179) [Fully Connected]_output[Half(128,1000,1,1)] -> Reformatted Input Tensor 0 to (Unnamed Layer* 4183) [Shuffle][Float(128,1000,1,1)]
Layer(NoOp): (Unnamed Layer* 4183) [Shuffle], Tactic: 0, Reformatted Input Tensor 0 to (Unnamed Layer* 4183) [Shuffle][Float(128,1000,1,1)] -> output_0[Float(128,1000)]

Build with fp16 precision:

[12/04/2021-06:44:31] [V] [TRT] Engine Layer Information:
Layer(Reformat): Reformatting CopyNode for Input Tensor 0 to Conv_0, Tactic: 0, input_0[Float(128,3,224,224)] -> Reformatted Input Tensor 0 to Conv_0[Half(128,3,224,224)]
Layer(CaskConvolution): Conv_0, Tactic: 1579845938601132607, Reformatted Input Tensor 0 to Conv_0[Half(128,3,224,224)] -> 191[Half(128,96,56,56)]
Layer(Myelin): {ForeignNode[318...(Unnamed Layer* 4183) [Shuffle]]}, Tactic: 0, 191[Half(128,96,56,56)] -> Reformatted Output Tensor 0 to {ForeignNode[318...(Unnamed Layer* 4183) [Shuffle]]}[Half(128,1000)]
Layer(Reformat): Reformatting CopyNode for Output Tensor 0 to {ForeignNode[318...(Unnamed Layer* 4183) [Shuffle]]}, Tactic: 0, Reformatted Output Tensor 0 to {ForeignNode[318...(Unnamed Layer* 4183) [Shuffle]]}[Half(128,1000)] -> output_0[Float(128,1000)]

Todo

After the FP16 nvbug 3464358 solved, will do the QAT optimization.

Comments

question about CUDA Version

Dear maggiez0138,

I have question about CUDA version.

You said you have used TensorRT 8.2.1, and this version depends on CUDA 11.5 as I know.(and there is no torch version for CUDA 11.5 yet, so I have installed torch 1.10 for CUDA 11.3)

I tried to run this code with CUDA 11.5, but I can't success to run any python file, because of container_abcs package.(I searched about this issue, it can be conflict with torch latest version).

Could you please let me know how to run python files(before processing Accuray Test Results on ImageNet-1K Validation Dataset at README) with CUDA 11.5.

Thank you very much.

opened by PigBroA 2
the operator roll to ONNX

Exporting the operator roll to ONNX opset version 9 is not supported. A: Please refer to torch/onnx/symbolic_opset9.py, add the support of exporting torch.roll.

the link of (https://github.com/maggiez0138/Swin-Transformer-TensorRT/blob/master/torch/onnx/symbolic_opset9.py) is expired,

opened by linuxmi 0
Swin does not support dynamic input shape after tracing the module

hello there, I am trying to convert swin model to onnx and then tensorrt, but a problem which I face is that it does not support dynamic input resolution after tracing the model using torch.jit.trace. It seems that it is because of the mask input in window attention. Do you have any idea how I can fix this problem?

opened by fatemebafghi 0
Hello, I have not reproduced the income of the batch you mentioned！

Hello, I have not reproduced the income of the batch you mentioned. My test machine environment is as follows： The model I tested is as follows: My evaluation script is as follows： trtexec --loadEngine=./weights/swin_tiny_patch4_window7_224_batch16.engine My test results are as follows: The specific test screenshots are as follows:

Because, I'm more curious, why are my results so different from yours? same graphics hardware.

Looking forward to your reply and good luck.

opened by tensorflowt 1
ModuleNotFoundError:No module named 'trt.engine'

Hi,maggiez0138! Thank you very much for your open source project on trt accelerated swin-transformer！ I have a small question here to ask you，details as follows： checked it carefully, and there is indeed no such engine.py file. Where can I find this file? Looking forward to your reply, good luck！

opened by tensorflowt 1
No improvement using fp16 mode

Hardware: V100 Model: Swin-tiny

trt engine size: 115M (fp16) vs 116M (fp32) throughput: 184 (fp16) vs 178 (fp32)

All the processes were kept the same as your description. Any suggestions? Many thanks!

opened by zjujh1995 1
the link(torch/onnx/symbolic_opset9.py) is expired

Exporting the operator roll to ONNX opset version 9 is not supported. A: Please refer to torch/onnx/symbolic_opset9.py, add the support of exporting torch.roll.

the link(https://github.com/maggiez0138/Swin-Transformer-TensorRT/blob/master/torch/onnx/symbolic_opset9.py) is expired

opened by linuxmi 2

This project aims to explore the deployment of Swin-Transformer based on TensorRT, including the test results of FP16 and INT8.

Related tags

Overview

Swin Transformer

Introduction(Quoted from the Original Project )

Setup

Code Structure

Export to ONNX and Build TensorRT Engine

Accuray Test Results on ImageNet-1K Validation Dataset

Speed Test of TensorRT engine(T4)

Todo

Comments

question about CUDA Version

the operator roll to ONNX

Swin does not support dynamic input shape after tracing the module

Hello, I have not reproduced the income of the batch you mentioned！

ModuleNotFoundError:No module named 'trt.engine'

No improvement using fp16 mode

the link(torch/onnx/symbolic_opset9.py) is expired

Owner

maggiez

Pytorch-Swin-Unet-V2 - a modified version of Swin Unet based on Swin Transfomer V2

tensorrt int8 量化yolov5 4.0 onnx模型

EfficientNetv2 TensorRT int8

The deployment framework aims to provide a simple, lightweight, fast integrated, pipelined deployment framework that ensures reliability, high concurrency and scalability of services.

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Unofficial PyTorch reimplementation of the paper Swin Transformer V2: Scaling Up Capacity and Resolution

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

Implementation of the Swin Transformer in PyTorch.

Tensorflow implementation of Swin Transformer model.

The codes for the work "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation"

SwinIR: Image Restoration Using Swin Transformer

Image Restoration Using Swin Transformer for VapourSynth

This repository contains a CBIR system that uses swin transformer to extract image's feature.

This project is based on RIFE and aims to make RIFE more practical for users by adding various features and design new models

Federated Learning - Including common test models for federated learning, like CNN, Resnet18 and lstm, controlled by different parser

My usage of Real-ESRGAN to upscale anime, some test and results in the test_img folder

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.