TResNet: High Performance GPU-Dedicated Architecture

Overview

TResNet: High Performance GPU-Dedicated Architecture

PWC
PWC
PWC
PWC
PWC
PWC
PWC

paperV2 | pretrained models

Official PyTorch Implementation

Tal Ridnik, Hussam Lawen, Asaf Noy, Itamar Friedman, Emanuel Ben Baruch, Gilad Sharir
DAMO Academy, Alibaba Group

Abstract

Many deep learning models, developed in recent years, reach higher ImageNet accuracy than ResNet50, with fewer or comparable FLOPS count. While FLOPs are often seen as a proxy for network efficiency, when measuring actual GPU training and inference throughput, vanilla ResNet50 is usually significantly faster than its recent competitors, offering better throughput-accuracy trade-off. In this work, we introduce a series of architecture modifications that aim to boost neural networks' accuracy, while retaining their GPU training and inference efficiency. We first demonstrate and discuss the bottlenecks induced by FLOPs-optimizations. We then suggest alternative designs that better utilize GPU structure and assets. Finally, we introduce a new family of GPU-dedicated models, called TResNet, which achieve better accuracy and efficiency than previous ConvNets. Using a TResNet model, with similar GPU throughput to ResNet50, we reach 80.7% top-1 accuracy on ImageNet. Our TResNet models also transfer well and achieve state-of-the-art accuracy on competitive datasets such as Stanford cars (96.0%), CIFAR-10 (99.0%), CIFAR-100 (91.5%) and Oxford-Flowers (99.1%). They also perform well on multi-label classification and object detection tasks.

29/11/2021 Update - New article released, offering new classification head with state-of-the-art results

Checkout our new project, Ml-Decoder, which presents a unified classification head for multi-label, single-label and zero-shot tasks. Backbones with ML-Decoder reach SOTA results, while also improving speed-accuracy tradeoff.

23/4/2021 Update - ImageNet21K Pretraining

In a new article we released, we share pretrain weights for TResNet models from ImageNet21K training, that dramatically outperfrom standard pretraining. TResNet-M model, for example, improves its ImageNet-1K score, from 80.7% to 83.1% ! This kind of improvement is consistently achieved on all downstream tasks.

28/8/2020: V2 of TResNet Article Released

Sotabench Comparisons

Comparative results from sotabench benchamrk, demonstartaing that TReNset models give excellent speed-accuracy tradoff:

11/6/2020: V1 of TResNet Article Released

The main change - In addition to single label SOTA results, we also added top results for multi-label classification and object detection tasks, using TResNet. For example, we set a new SOTA record for MS-COCO multi-label dataset, surpassing the previous top results by more than 2.5% mAP !

Bacbkone mAP
KSSNet (previous SOTA) 83.7
TResNet-L 86.4

2/6/2020: CVPR-Kaggle competitions

We participated and won top places in two major CVPR-Kaggle competitions:

  • 2nd place in Herbarium 2020 competition, out of 153 teams.
  • 7th place in Plant-Pathology 2020 competition, out of 1317 teams.

    TResNet was a vital part of our solution for both competitions, allowing us to work on high resolutions and reach top scores while doing fast and efficient experiments.

Main Article Results

TResNet Models

TResNet models accuracy and GPU throughput on ImageNet, compared to ResNet50. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models Top Training Speed
(img/sec)
Top Inference Speed
(img/sec)
Max Train Batch Size Top-1 Acc.
ResNet50 805 2830 288 79.0
EfficientNetB1 440 2740 196 79.2
TResNet-M 730 2930 512 80.8
TResNet-L 345 1390 316 81.5
TResNet-XL 250 1060 240 82.0

Comparison To Other Networks

Comparison of ResNet50 to top modern networks, with similar top-1 ImageNet accuracy. All measurements were done on Nvidia V100 GPU with mixed precision. For gaining optimal speeds, training and inference were measured on 90% of maximal possible batch size. Except TResNet-M, all the models' ImageNet scores were taken from the public repository, which specialized in providing top implementations for modern networks. Except EfficientNet-B1, which has input resolution of 240, all other models have input resolution of 224.

Model Top Training Speed
(img/sec)
Top Inference Speed
(img/sec)
Top-1 Acc. Flops[G]
ResNet50 805 2830 79.0 4.1
ResNet50-D 600 2670 79.3 4.4
ResNeXt50 490 1940 79.4 4.3
EfficientNetB1 440 2740 79.2 0.6
SEResNeXt50 400 1770 79.9 4.3
MixNet-L 400 1400 79.0 0.5
TResNet-M 730 2930 80.8 5.5


Transfer Learning SotA Results

Comparison of TResNet to state-of-the-art models on transfer learning datasets (only ImageNet-based transfer learning results). Models inference speed is measured on a mixed precision V100 GPU. Since no official implementation of Gpipe was provided, its inference speed is unknown

Dataset Model Top-1
Acc.
Speed
img/sec
Input
CIFAR-10 Gpipe 99.0 - 480
TResNet-XL 99.0 1060 224
CIFAR-100 EfficientNet-B7 91.7 70 600
TResNet-XL 91.5 1060 224
Stanford Cars EfficientNet-B7 94.7 70 600
TResNet-L 96.0 500 368
Oxford-Flowers EfficientNet-B7 98.8 70 600
TResNet-L 99.1 500 368

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of TResNet models on ImageNet. First, download pretrained models from here.

Then, run the infer.py script. For example, for tresnet_m (input size 224) run:

python -m infer.py \
--val_dir=/path/to/imagenet_val_folder \
--model_path=/model/path/to/tresnet_m.pth \
--model_name=tresnet_m
--input_size=224

TResNet Training

Due to IP limitations, we do not provide the exact training code that was used to obtain the article results.

However, TResNet is now an integral part of the popular rwightman / pytorch-image-models repo. Using that repo, you can reach very similar results to the one stated in the article.

For example, training tresnet_m on rwightman / pytorch-image-models with the command line:

python -u -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=1 --node_rank=0 ./train.py /data/imagenet/ \
-b=190 --lr=0.6 --model-ema --aa=rand-m9-mstd0.5-inc1 \
--num-gpu=8 -j=16 --amp \
--model=tresnet_m --epochs=300 --mixup=0.2 \
--sched='cosine' --reprob=0.4 --remode=pixel

gave accuracy of 80.5%.

Also, during the merge request, we had interesting discussions and insights regarding TResNet design. I am attaching a pdf version the mentioned discussions. They can shed more light on TResNet design considerations and directions for the future.

TResNet discussion and insights

(taken with permission from here)

Tips For Working With Inplace-ABN

See INPLACE_ABN_TIPS.

Citation

@misc{ridnik2020tresnet,
    title={TResNet: High Performance GPU-Dedicated Architecture},
    author={Tal Ridnik and Hussam Lawen and Asaf Noy and Itamar Friedman},
    year={2020},
    eprint={2003.13630},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Contact

Feel free to contact me if there are any questions or issues (Tal Ridnik, [email protected]).

Comments
  • How to install Inplace_ABN for multi-GPU to use TResNet?

    How to install Inplace_ABN for multi-GPU to use TResNet?

    Dear Tal, First, you give us the valuable TResNets to optimize GPU utilization. I am really love the models. I have install your package from source code and work on only one GPU (cuda:0) very well, but I can not use it in other GPUs. I am using TResNet family with PyTorch version 1.5, CUDA version 10.2, nvcc version 10.2. I also export CUDA_HOME=/usr/local/cuda-10.2. My machine has 2 GTX 1080Ti and one is plugged in PCIe 16 lanes (called cuda:0), the other inserted into PCIe 4 lanes (named cuda:1). After installation follows your tutorial, I realize only GPU with cuda:0 work properly with TResNet models or it means INPLACE_ABN functions. The other GPU is raised error while trying to train or inference using TResNet models. On the other hand, the GPU with CUDA:1 still works very well while I run other models like EfficientNet, for examples. Another word, how to establish to train TResNet using other GPUs on one machine? Do you have any tip for me to install multiple GPUs with INPLACE_ABN? Best regards. Linh

    opened by linhduongtuan 12
  • Integrate TResNet to Object Detection

    Integrate TResNet to Object Detection

    Hi authors, Thanks for your great work of extremely fast TResNet.

    TResNet is demonstrated to be excellent in Image Classification. I am curious about its robustness in Object Detection. Currently, there are several frameworks pushing research on Object Detection, like mmdetection and detectron.

    So, do you have any plan to integrate your great work with these frameworks? Also, I see that you are working on incorporating with Rwrightman in PyTorch-Image-Model in order to add TResNet into that framework. I appreciate your work and hope to see the result in Object Detection.

    Thanks,

    opened by thuyngch 7
  • Multi GPU training error

    Multi GPU training error

    Hi while using multiple GPUs for training I get this:

    File "/workspace/TResNet/src/models/tresnet/layers/anti_aliasing.py", line 40, in __call__    
        return F.conv2d(input_pad, self.filt, stride=2, padding=0, groups=input.shape[1])
    RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, input, output, weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/pip-r
    eq-build-cms73_uj/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19
    

    However single GPU training using CUDA_VISIBLE_DEVICES=0 before my training script works fine. I can see the losses going down after iterations.

    Can you help with this?

    opened by yashnv 7
  • Errors while running the code for TresNetV2

    Errors while running the code for TresNetV2

    Hi,

    I am facing the errors while trying to run the inference for stanford cars.

    init() got an unexpected keyword argument 'remove_model_jit' File "TResNet/src/models/tresnet_v2/tresnet_v2.py", line 117, in init anti_alias_layer(channels=planes, filt_size=3, stride=2)) File "TResNet/src/models/tresnet_v2/tresnet_v2.py", line 215, in _make_layer layers.append(block(self.inplanes, planes, stride, downsample, use_se=use_se, File "TResNet/src/models/tresnet_v2/tresnet_v2.py", line 160, in init layer2 = self._make_layer(Bottleneck, self.planes * 2, layers[1], stride=2, use_se=True, File "TResNet/src/models/tresnet_v2/tresnet_v2.py", line 239, in TResnetL_V2 model = TResNetV2(layers=layers_list, num_classes=num_classes, in_chans=in_chans, File "TResNet/src/models/utils/factory.py", line 23, in create_model model = TResnetL_V2(model_params) File "TResNet/infer.py", line 33, in main model = create_model(args).cuda() File "TResNet/infer.py", line 64, in main() File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ajmal/anaconda3/envs/alibaba_miil_dev/lib/python3.8/runpy.py", line 194, in _run_module_as_main (Current frame) return _run_code(code, main_globals, None,

    opened by ma-siddiqui 5
  • something get wrong when loading model

    something get wrong when loading model

    Traceback (most recent call last): File "infer.py", line 99, in main() File "infer.py", line 35, in main aaa = torch.jit.script(model2) File "/home/kpl/.conda/envs/ASL/lib/python3.6/site-packages/torch/jit/_script.py", line 898, in script obj, torch.jit._recursive.infer_methods_to_compile File "/home/kpl/.conda/envs/ASL/lib/python3.6/site-packages/torch/jit/_recursive.py", line 352, in create_script_module return create_script_module_impl(nn_module, concrete_type, stubs_fn) File "/home/kpl/.conda/envs/ASL/lib/python3.6/site-packages/torch/jit/_recursive.py", line 410, in create_script_module_impl create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs) File "/home/kpl/.conda/envs/ASL/lib/python3.6/site-packages/torch/jit/_recursive.py", line 304, in create_methods_and_properties_from_stubs concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults) RuntimeError: Tried to set nonexistent attribute: embeddings. Did you forget to initialize it in init()?: File "/home/kpl/code/multilabel/TResNet/src/models/tresnet/tresnet.py", line 190 def forward(self, x): x = self.body(x) self.embeddings = self.global_pool(x) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE logits = self.head(self.embeddings) return logits

    opened by myh12138 4
  • falling into loss

    falling into loss "NAN"

    I just train my own model and train dataset via your TResNet feature extractor.

    BTW, my loss is fall into NaN after hundreds of iteration processed. If I lowering learning rate, loss fall into NAN later. (Until fall into Nan, my loss look converging well)

    can I ask what's your optimization function or initial learning rate(and scheduler)? Or do u have any idea about NaN loss??

    opened by YuNie24 4
  • Inference speed results

    Inference speed results

    Hi, Thank for an interesting paper. I'm wondering how did you measure inference speed in your paper? I've measured the inference speed of baseline Resnet50 from torchvision and your TResnetM using apex.amp with opt_level=O1 on V100 with inputs of shape [1024, 3, 224, 224] and got following numbers:

    Resnet50 TorchVision 25.56M params
    Mean of 10 runs 10 iters each BS=1024:
    	 441.82+-0.31 msecs Forward. 0.00+-0.00 msecs Backward. Max memory: 8185.20Mb. 2317.68 imgs/sec
    TResNetM 31.39M params
    Mean of 10 runs 10 iters each BS=1024:
    	 466.05+-0.19 msecs Forward. 0.00+-0.00 msecs Backward. Max memory: 4255.46Mb. 2197.17 imgs/sec
    
    opened by bonlime 4
  • SelectAdaptivePool2d Layer in TResNet_v2

    SelectAdaptivePool2d Layer in TResNet_v2

    Hi @mrT23, I don't see the module SelectAdaptivePool2d in tresnet_v2.py(Module not found error). Instead of SelectAdaptivePool2d can I use FastGlobalAvgPool2d in tresnet.py. What's the difference?

    And, for create_dataloader, what is the format for 'val_dir'? I have Stanford cars test data in /data/test/class/*.jpg

    opened by nikhilgunti 3
  • 1x1 or 3x3 stem conv?

    1x1 or 3x3 stem conv?

    Hi, just like you, I wanted to try s2d stem after reading the isometric nets paper :)

    I noticed that in your paper, Figure 1, you show using 4x4 s2d followed by a 1x1 conv64. However, in your code here you clearly follow the 4x4 s2d by a 3x3 conv64. So, which one is used for the results in the paper?

    opened by lucasb-eyer 3
  • where is the implements of focal loss for multi-label classification

    where is the implements of focal loss for multi-label classification

    Thank you very much for your insights!

    I have a few question, where is the implements of focal loss for multi-label classification?

    best regard! dongliang

    opened by Vipermdl 3
  • Does the spaceTodepth work in the Net?

    Does the spaceTodepth work in the Net?

    hi i am really Sorry for my rudeness and Thanks for you pointing out my rudeness; I will pay attention to my words in future and must improve my poor English; I have one question about spaceToDepth after reading the paper: I have noticed that you use the S2D block in the Stem part instead of using the Isometrics Net? In the paper , the table4. Ablation study. the line 2: add_Stem_space_2_depth: top1 acc just increasing the 0.1%; Does it make sense in the model? or just Reasonable fluctuation?

    looking forward your reply ; sorry again for my rudeness

    opened by zj19921221 3
Owner
null
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 5.7k Feb 12, 2021
Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation. Intel iHD GPU (iGPU) support. NVIDIA GPU (dGPU) support.

mtomo Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation.

Katsuya Hyodo 24 Mar 2, 2022
GrabGpu_py: a scripts for grab gpu when gpu is free

GrabGpu_py a scripts for grab gpu when gpu is free. WaitCondition: gpu_memory >

tianyuluan 3 Jun 18, 2022
[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective [PDF] Wuyang Chen, Xinyu Gong, Zhangyang Wang In ICLR 2

VITA 156 Nov 28, 2022
Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

null 78 Dec 27, 2022
code for paper "Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?"

Does Unsupervised Architecture Representation Learning Help Neural Architecture Search? Code for paper: Does Unsupervised Architecture Representation

null 39 Dec 17, 2022
This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
NeuralCompression is a Python repository dedicated to research of neural networks that compress data

NeuralCompression is a Python repository dedicated to research of neural networks that compress data. The repository includes tools such as JAX-based entropy coders, image compression models, video compression models, and metrics for image and video evaluation.

Facebook Research 297 Jan 6, 2023
SCAAML is a deep learning framwork dedicated to side-channel attacks run on top of TensorFlow 2.x.

SCAAML (Side Channel Attacks Assisted with Machine Learning) is a deep learning framwork dedicated to side-channel attacks. It is written in python and run on top of TensorFlow 2.x.

Google 69 Dec 21, 2022
NeoPlay is the project dedicated to ESport events.

NeoPlay is the project dedicated to ESport events. On this platform users can participate in tournaments with prize pools as well as create their own tournaments.

null 3 Dec 18, 2021
Measures input lag without dedicated hardware, performing motion detection on recorded or live video

What is InputLagTimer? This tool can measure input lag by analyzing a video where both the game controller and the game screen can be seen on a webcam

Bruno Gonzalez 4 Aug 18, 2022
A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

Poisson Image Editing - A Parallel Implementation Jiayi Weng (jiayiwen), Zixu Chen (zixuc) Poisson Image Editing is a technique that can fuse two imag

Jiayi Weng 110 Dec 27, 2022
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 21.1k Jan 1, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 3, 2023
ML-Ensemble – high performance ensemble learning

A Python library for high performance ensemble learning ML-Ensemble combines a Scikit-learn high-level API with a low-level computational graph framew

Sebastian Flennerhag 764 Dec 31, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 8, 2023
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 11.9k Feb 13, 2021
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 2.8k Feb 12, 2021
《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification》(AAAI 2021) GitHub:

LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification

null 76 Dec 5, 2022