TResNet: High Performance GPU-Dedicated Architecture

Last update: Dec 28, 2022

Related tags

Overview

TResNet: High Performance GPU-Dedicated Architecture

paperV2 | pretrained models

Official PyTorch Implementation

Tal Ridnik, Hussam Lawen, Asaf Noy, Itamar Friedman, Emanuel Ben Baruch, Gilad Sharir
DAMO Academy, Alibaba Group

Abstract

Many deep learning models, developed in recent years, reach higher ImageNet accuracy than ResNet50, with fewer or comparable FLOPS count. While FLOPs are often seen as a proxy for network efficiency, when measuring actual GPU training and inference throughput, vanilla ResNet50 is usually significantly faster than its recent competitors, offering better throughput-accuracy trade-off. In this work, we introduce a series of architecture modifications that aim to boost neural networks' accuracy, while retaining their GPU training and inference efficiency. We first demonstrate and discuss the bottlenecks induced by FLOPs-optimizations. We then suggest alternative designs that better utilize GPU structure and assets. Finally, we introduce a new family of GPU-dedicated models, called TResNet, which achieve better accuracy and efficiency than previous ConvNets. Using a TResNet model, with similar GPU throughput to ResNet50, we reach 80.7% top-1 accuracy on ImageNet. Our TResNet models also transfer well and achieve state-of-the-art accuracy on competitive datasets such as Stanford cars (96.0%), CIFAR-10 (99.0%), CIFAR-100 (91.5%) and Oxford-Flowers (99.1%). They also perform well on multi-label classification and object detection tasks.

29/11/2021 Update - New article released, offering new classification head with state-of-the-art results

Checkout our new project, Ml-Decoder, which presents a unified classification head for multi-label, single-label and zero-shot tasks. Backbones with ML-Decoder reach SOTA results, while also improving speed-accuracy tradeoff.

23/4/2021 Update - ImageNet21K Pretraining

In a new article we released, we share pretrain weights for TResNet models from ImageNet21K training, that dramatically outperfrom standard pretraining. TResNet-M model, for example, improves its ImageNet-1K score, from 80.7% to 83.1% ! This kind of improvement is consistently achieved on all downstream tasks.

28/8/2020: V2 of TResNet Article Released

Sotabench Comparisons

Comparative results from sotabench benchamrk, demonstartaing that TReNset models give excellent speed-accuracy tradoff:

11/6/2020: V1 of TResNet Article Released

The main change - In addition to single label SOTA results, we also added top results for multi-label classification and object detection tasks, using TResNet. For example, we set a new SOTA record for MS-COCO multi-label dataset, surpassing the previous top results by more than 2.5% mAP !

Bacbkone	mAP
KSSNet (previous SOTA)	83.7
TResNet-L	86.4

2/6/2020: CVPR-Kaggle competitions

We participated and won top places in two major CVPR-Kaggle competitions:

2nd place in Herbarium 2020 competition, out of 153 teams.
7th place in Plant-Pathology 2020 competition, out of 1317 teams.

TResNet was a vital part of our solution for both competitions, allowing us to work on high resolutions and reach top scores while doing fast and efficient experiments.

Main Article Results

TResNet Models

TResNet models accuracy and GPU throughput on ImageNet, compared to ResNet50. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models	Top Training Speed (img/sec)	Top Inference Speed (img/sec)	Max Train Batch Size	Top-1 Acc.
ResNet50	805	2830	288	79.0
EfficientNetB1	440	2740	196	79.2
TResNet-M	730	2930	512	80.8
TResNet-L	345	1390	316	81.5
TResNet-XL	250	1060	240	82.0

Comparison To Other Networks

Comparison of ResNet50 to top modern networks, with similar top-1 ImageNet accuracy. All measurements were done on Nvidia V100 GPU with mixed precision. For gaining optimal speeds, training and inference were measured on 90% of maximal possible batch size. Except TResNet-M, all the models' ImageNet scores were taken from the public repository, which specialized in providing top implementations for modern networks. Except EfficientNet-B1, which has input resolution of 240, all other models have input resolution of 224.

Model	Top Training Speed (img/sec)	Top Inference Speed (img/sec)	Top-1 Acc.	Flops[G]
ResNet50	805	2830	79.0	4.1
ResNet50-D	600	2670	79.3	4.4
ResNeXt50	490	1940	79.4	4.3
EfficientNetB1	440	2740	79.2	0.6
SEResNeXt50	400	1770	79.9	4.3
MixNet-L	400	1400	79.0	0.5
TResNet-M	730	2930	80.8	5.5

Transfer Learning SotA Results

Comparison of TResNet to state-of-the-art models on transfer learning datasets (only ImageNet-based transfer learning results). Models inference speed is measured on a mixed precision V100 GPU. Since no official implementation of Gpipe was provided, its inference speed is unknown

Dataset	Model	Top-1 Acc.	Speed img/sec	Input
CIFAR-10	Gpipe	99.0	-	480
CIFAR-10	TResNet-XL	99.0	1060	224
CIFAR-100	EfficientNet-B7	91.7	70	600
CIFAR-100	TResNet-XL	91.5	1060	224
Stanford Cars	EfficientNet-B7	94.7	70	600
Stanford Cars	TResNet-L	96.0	500	368
Oxford-Flowers	EfficientNet-B7	98.8	70	600
Oxford-Flowers	TResNet-L	99.1	500	368

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of TResNet models on ImageNet. First, download pretrained models from here.

Then, run the infer.py script. For example, for tresnet_m (input size 224) run:

python -m infer.py \
--val_dir=/path/to/imagenet_val_folder \
--model_path=/model/path/to/tresnet_m.pth \
--model_name=tresnet_m
--input_size=224

TResNet Training

Due to IP limitations, we do not provide the exact training code that was used to obtain the article results.

However, TResNet is now an integral part of the popular rwightman / pytorch-image-models repo. Using that repo, you can reach very similar results to the one stated in the article.

For example, training tresnet_m on rwightman / pytorch-image-models with the command line:

python -u -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=1 --node_rank=0 ./train.py /data/imagenet/ \
-b=190 --lr=0.6 --model-ema --aa=rand-m9-mstd0.5-inc1 \
--num-gpu=8 -j=16 --amp \
--model=tresnet_m --epochs=300 --mixup=0.2 \
--sched='cosine' --reprob=0.4 --remode=pixel

gave accuracy of 80.5%.

Also, during the merge request, we had interesting discussions and insights regarding TResNet design. I am attaching a pdf version the mentioned discussions. They can shed more light on TResNet design considerations and directions for the future.

TResNet discussion and insights

(taken with permission from here)

Tips For Working With Inplace-ABN

See INPLACE_ABN_TIPS.

Citation

@misc{ridnik2020tresnet,
    title={TResNet: High Performance GPU-Dedicated Architecture},
    author={Tal Ridnik and Hussam Lawen and Asaf Noy and Itamar Friedman},
    year={2020},
    eprint={2003.13630},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Contact

Feel free to contact me if there are any questions or issues (Tal Ridnik, [email protected]).

Comments

How to install Inplace_ABN for multi-GPU to use TResNet?

Dear Tal, First, you give us the valuable TResNets to optimize GPU utilization. I am really love the models. I have install your package from source code and work on only one GPU (cuda:0) very well, but I can not use it in other GPUs. I am using TResNet family with PyTorch version 1.5, CUDA version 10.2, nvcc version 10.2. I also export CUDA_HOME=/usr/local/cuda-10.2. My machine has 2 GTX 1080Ti and one is plugged in PCIe 16 lanes (called cuda:0), the other inserted into PCIe 4 lanes (named cuda:1). After installation follows your tutorial, I realize only GPU with cuda:0 work properly with TResNet models or it means INPLACE_ABN functions. The other GPU is raised error while trying to train or inference using TResNet models. On the other hand, the GPU with CUDA:1 still works very well while I run other models like EfficientNet, for examples. Another word, how to establish to train TResNet using other GPUs on one machine? Do you have any tip for me to install multiple GPUs with INPLACE_ABN? Best regards. Linh

opened by linhduongtuan 12
Integrate TResNet to Object Detection

Hi authors, Thanks for your great work of extremely fast TResNet.

TResNet is demonstrated to be excellent in Image Classification. I am curious about its robustness in Object Detection. Currently, there are several frameworks pushing research on Object Detection, like mmdetection and detectron.

So, do you have any plan to integrate your great work with these frameworks? Also, I see that you are working on incorporating with Rwrightman in PyTorch-Image-Model in order to add TResNet into that framework. I appreciate your work and hope to see the result in Object Detection.

Thanks,

opened by thuyngch 7

Multi GPU training error

Hi while using multiple GPUs for training I get this:

File "/workspace/TResNet/src/models/tresnet/layers/anti_aliasing.py", line 40, in __call__    
    return F.conv2d(input_pad, self.filt, stride=2, padding=0, groups=input.shape[1])
RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, input, output, weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/pip-r
eq-build-cms73_uj/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19

However single GPU training using CUDA_VISIBLE_DEVICES=0 before my training script works fine. I can see the losses going down after iterations.

Can you help with this?

opened by yashnv 7

Errors while running the code for TresNetV2

Hi,

I am facing the errors while trying to run the inference for stanford cars.

init() got an unexpected keyword argument 'remove_model_jit' File "TResNet/src/models/tresnet_v2/tresnet_v2.py", line 117, in init anti_alias_layer(channels=planes, filt_size=3, stride=2)) File "TResNet/src/models/tresnet_v2/tresnet_v2.py", line 215, in _make_layer layers.append(block(self.inplanes, planes, stride, downsample, use_se=use_se, File "TResNet/src/models/tresnet_v2/tresnet_v2.py", line 160, in init layer2 = self._make_layer(Bottleneck, self.planes * 2, layers[1], stride=2, use_se=True, File "TResNet/src/models/tresnet_v2/tresnet_v2.py", line 239, in TResnetL_V2 model = TResNetV2(layers=layers_list, num_classes=num_classes, in_chans=in_chans, File "TResNet/src/models/utils/factory.py", line 23, in create_model model = TResnetL_V2(model_params) File "TResNet/infer.py", line 33, in main model = create_model(args).cuda() File "TResNet/infer.py", line 64, in main() File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 194, in _run_module_as_main (Current frame) return _run_code(code, main_globals, None,

opened by ma-siddiqui 5
something get wrong when loading model

Traceback (most recent call last): File "infer.py", line 99, in main() File "infer.py", line 35, in main aaa = torch.jit.script(model2) File "/home/kpl/.conda/envs/ASL/lib/python3.6/site-packages/torch/jit/_script.py", line 898, in script obj, torch.jit._recursive.infer_methods_to_compile File "/home/kpl/.conda/envs/ASL/lib/python3.6/site-packages/torch/jit/_recursive.py", line 352, in create_script_module return create_script_module_impl(nn_module, concrete_type, stubs_fn) File "/home/kpl/.conda/envs/ASL/lib/python3.6/site-packages/torch/jit/_recursive.py", line 410, in create_script_module_impl create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs) File "/home/kpl/.conda/envs/ASL/lib/python3.6/site-packages/torch/jit/_recursive.py", line 304, in create_methods_and_properties_from_stubs concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults) RuntimeError: Tried to set nonexistent attribute: embeddings. Did you forget to initialize it in init()?: File "/home/kpl/code/multilabel/TResNet/src/models/tresnet/tresnet.py", line 190 def forward(self, x): x = self.body(x) self.embeddings = self.global_pool(x) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE logits = self.head(self.embeddings) return logits

opened by myh12138 4
falling into loss "NAN"

I just train my own model and train dataset via your TResNet feature extractor.

BTW, my loss is fall into NaN after hundreds of iteration processed. If I lowering learning rate, loss fall into NAN later. (Until fall into Nan, my loss look converging well)

can I ask what's your optimization function or initial learning rate(and scheduler)? Or do u have any idea about NaN loss??

opened by YuNie24 4
Inference speed results
Hi, Thank for an interesting paper. I'm wondering how did you measure inference speed in your paper? I've measured the inference speed of baseline Resnet50 from torchvision and your TResnetM using apex.amp with opt_level=O1 on V100 with inputs of shape [1024, 3, 224, 224] and got following numbers:

Resnet50 TorchVision 25.56M params Mean of 10 runs 10 iters each BS=1024: 441.82+-0.31 msecs Forward. 0.00+-0.00 msecs Backward. Max memory: 8185.20Mb. 2317.68 imgs/sec TResNetM 31.39M params Mean of 10 runs 10 iters each BS=1024: 466.05+-0.19 msecs Forward. 0.00+-0.00 msecs Backward. Max memory: 4255.46Mb. 2197.17 imgs/sec
opened by bonlime 4
SelectAdaptivePool2d Layer in TResNet_v2

Hi @mrT23, I don't see the module SelectAdaptivePool2d in tresnet_v2.py(Module not found error). Instead of SelectAdaptivePool2d can I use FastGlobalAvgPool2d in tresnet.py. What's the difference?

And, for create_dataloader, what is the format for 'val_dir'? I have Stanford cars test data in /data/test/class/*.jpg

opened by nikhilgunti 3
1x1 or 3x3 stem conv?

Hi, just like you, I wanted to try s2d stem after reading the isometric nets paper :)

I noticed that in your paper, Figure 1, you show using 4x4 s2d followed by a 1x1 conv64. However, in your code here you clearly follow the 4x4 s2d by a 3x3 conv64. So, which one is used for the results in the paper?

opened by lucasb-eyer 3
where is the implements of focal loss for multi-label classification

Thank you very much for your insights!

I have a few question, where is the implements of focal loss for multi-label classification?

best regard! dongliang

opened by Vipermdl 3
Does the spaceTodepth work in the Net?

hi i am really Sorry for my rudeness and Thanks for you pointing out my rudeness; I will pay attention to my words in future and must improve my poor English; I have one question about spaceToDepth after reading the paper: I have noticed that you use the S2D block in the Stem part instead of using the Isometrics Net? In the paper , the table4. Ablation study. the line 2: add_Stem_space_2_depth: top1 acc just increasing the 0.1%; Does it make sense in the model? or just Reasonable fluctuation？

looking forward your reply ; sorry again for my rudeness

opened by zj19921221 3

TResNet: High Performance GPU-Dedicated Architecture

Related tags

Overview

TResNet: High Performance GPU-Dedicated Architecture

29/11/2021 Update - New article released, offering new classification head with state-of-the-art results

23/4/2021 Update - ImageNet21K Pretraining

28/8/2020: V2 of TResNet Article Released

Sotabench Comparisons

11/6/2020: V1 of TResNet Article Released

2/6/2020: CVPR-Kaggle competitions

Main Article Results

TResNet Models

Comparison To Other Networks

Transfer Learning SotA Results

Reproduce Article Scores

TResNet Training

Tips For Working With Inplace-ABN

Citation

Contact

Comments

Owner

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation. Intel iHD GPU (iGPU) support. NVIDIA GPU (dGPU) support.

GrabGpu_py: a scripts for grab gpu when gpu is free

[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

code for paper "Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?"

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

NeuralCompression is a Python repository dedicated to research of neural networks that compress data

SCAAML is a deep learning framwork dedicated to side-channel attacks run on top of TensorFlow 2.x.

NeoPlay is the project dedicated to ESport events.

Measures input lag without dedicated hardware, performing motion detection on recorded or live video

A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

ML-Ensemble – high performance ensemble learning

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classiﬁcation》(AAAI 2021) GitHub: