PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more

Ross Wightman

Last update: Jan 9, 2023

Related tags

Deep Learning pytorch resnet pretrained-models mixnet pretrained-weights imagenet-classifier distributed-training dual-path-networks cnn-classification mobilenet-v2 mnasnet mobile-deep-learning mobilenetv3 efficientnet augmix randaugment efficientnet-training nfnets normalization-free-training vision-transformer-models

Overview

PyTorch Image Models

Sponsors
What's New
Introduction
Models
Features
Results
Getting Started (Documentation)
Train, Validation, Inference Scripts
Awesome PyTorch Resources
Licenses
Citing

What's New

Oct 19, 2021

ResNet strikes back (https://arxiv.org/abs/2110.00476) weights added, plus any extra training components used. Model weights and some more details here (https://github.com/rwightman/pytorch-image-models/releases/tag/v0.1-rsb-weights)
BCE loss and Repeated Augmentation support for RSB paper
4 series of ResNet based attention model experiments being added (implemented across byobnet.py/byoanet.py). These include all sorts of attention, from channel attn like SE, ECA to 2D QKV self-attention layers such as Halo, Bottlneck, Lambda. Details here (https://github.com/rwightman/pytorch-image-models/releases/tag/v0.1-attn-weights)
Working implementations of the following 2D self-attention modules (likely to be differences from paper or eventual official impl):
- Halo (https://arxiv.org/abs/2103.12731)
- Bottleneck Transformer (https://arxiv.org/abs/2101.11605)
- LambdaNetworks (https://arxiv.org/abs/2102.08602)
A RegNetZ series of models with some attention experiments (being added to). These do not follow the paper (https://arxiv.org/abs/2103.06877) in any way other than block architecture, details of official models are not available. See more here (https://github.com/rwightman/pytorch-image-models/releases/tag/v0.1-attn-weights)
ConvMixer (https://openreview.net/forum?id=TVHS5Y4dNvM), CrossVit (https://arxiv.org/abs/2103.14899), and BeiT (https://arxiv.org/abs/2106.08254) architectures + weights added
freeze/unfreeze helpers by Alexander Soare

Aug 18, 2021

Optimizer bonanza!
- Add LAMB and LARS optimizers, incl trust ratio clipping options. Tweaked to work properly in PyTorch XLA (tested on TPUs w/ timm bits branch)
- Add MADGRAD from FB research w/ a few tweaks (decoupled decay option, step handling that works with PyTorch XLA)
- Some cleanup on all optimizers and factory. No more .data, a bit more consistency, unit tests for all!
- SGDP and AdamP still won't work with PyTorch XLA but others should (have yet to test Adabelief, Adafactor, Adahessian myself).
EfficientNet-V2 XL TF ported weights added, but they don't validate well in PyTorch (L is better). The pre-processing for the V2 TF training is a bit diff and the fine-tuned 21k -> 1k weights are very sensitive and less robust than the 1k weights.
Added PyTorch trained EfficientNet-V2 'Tiny' w/ GlobalContext attn weights. Only .1-.2 top-1 better than the SE so more of a curiosity for those interested.

July 12, 2021

Add XCiT models from official facebook impl. Contributed by Alexander Soare

July 5-9, 2021

Add efficientnetv2_rw_t weights, a custom 'tiny' 13.6M param variant that is a bit better than (non NoisyStudent) B3 models. Both faster and better accuracy (at same or lower res)
- top-1 82.34 @ 288x288 and 82.54 @ 320x320
Add SAM pretrained in1k weight for ViT B/16 (vit_base_patch16_sam_224) and B/32 (vit_base_patch32_sam_224) models.
Add 'Aggregating Nested Transformer' (NesT) w/ weights converted from official Flax impl. Contributed by Alexander Soare.
- jx_nest_base - 83.534, jx_nest_small - 83.120, jx_nest_tiny - 81.426

June 23, 2021

Reproduce gMLP model training, gmlp_s16_224 trained to 79.6 top-1, matching paper. Hparams for this and other recent MLP training here

June 20, 2021

Release Vision Transformer 'AugReg' weights from How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
- .npz weight loading support added, can load any of the 50K+ weights from the AugReg series
- See example notebook from official impl for navigating the augreg weights
- Replaced all default weights w/ best AugReg variant (if possible). All AugReg 21k classifiers work.
  - Highlights: vit_large_patch16_384 (87.1 top-1), vit_large_r50_s32_384 (86.2 top-1), vit_base_patch16_384 (86.0 top-1)
- vit_deit_* renamed to just deit_*
- Remove my old small model, replace with DeiT compatible small w/ AugReg weights
Add 1st training of my gmixer_24_224 MLP /w GLU, 78.1 top-1 w/ 25M params.
Add weights from official ResMLP release (https://github.com/facebookresearch/deit)
Add eca_nfnet_l2 weights from my 'lightweight' series. 84.7 top-1 at 384x384.
Add distilled BiT 50x1 student and 152x2 Teacher weights from Knowledge distillation: A good teacher is patient and consistent
NFNets and ResNetV2-BiT models work w/ Pytorch XLA now
- weight standardization uses F.batch_norm instead of std_mean (std_mean wasn't lowered)
- eps values adjusted, will be slight differences but should be quite close
Improve test coverage and classifier interface of non-conv (vision transformer and mlp) models
Cleanup a few classifier / flatten details for models w/ conv classifiers or early global pool
Please report any regressions, this PR touched quite a few models.

June 8, 2021

Add first ResMLP weights, trained in PyTorch XLA on TPU-VM w/ my XLA branch. 24 block variant, 79.2 top-1.
Add ResNet51-Q model w/ pretrained weights at 82.36 top-1.
- NFNet inspired block layout with quad layer stem and no maxpool
- Same param count (35.7M) and throughput as ResNetRS-50 but +1.5 top-1 @ 224x224 and +2.5 top-1 at 288x288

May 25, 2021

Add LeViT, Visformer, ConViT (PR by Aman Arora), Twins (PR by paper authors) transformer models
Add ResMLP and gMLP MLP vision models to the existing MLP Mixer impl
Fix a number of torchscript issues with various vision transformer models
Cleanup input_size/img_size override handling and improve testing / test coverage for all vision transformer and MLP models
More flexible pos embedding resize (non-square) for ViT and TnT. Thanks Alexander Soare
Add efficientnetv2_rw_m model and weights (started training before official code). 84.8 top-1, 53M params.

May 14, 2021

Add EfficientNet-V2 official model defs w/ ported weights from official Tensorflow/Keras impl.
- 1k trained variants: tf_efficientnetv2_s/m/l
- 21k trained variants: tf_efficientnetv2_s/m/l_in21k
- 21k pretrained -> 1k fine-tuned: tf_efficientnetv2_s/m/l_in21ft1k
- v2 models w/ v1 scaling: tf_efficientnetv2_b0 through b3
- Rename my prev V2 guess efficientnet_v2s -> efficientnetv2_rw_s
- Some blank efficientnetv2_* models in-place for future native PyTorch training

May 5, 2021

Add MLP-Mixer models and port pretrained weights from Google JAX impl
Add CaiT models and pretrained weights from FB
Add ResNet-RS models and weights from TF. Thanks Aman Arora
Add CoaT models and weights. Thanks Mohammed Rizin
Add new ImageNet-21k weights & finetuned weights for TResNet, MobileNet-V3, ViT models. Thanks mrT
Add GhostNet models and weights. Thanks Kai Han
Update ByoaNet attention modules
- Improve SA module inits
- Hack together experimental stand-alone Swin based attn module and swinnet
- Consistent '26t' model defs for experiments.
Add improved Efficientnet-V2S (prelim model def) weights. 83.8 top-1.
WandB logging support

April 13, 2021

Add Swin Transformer models and weights from https://github.com/microsoft/Swin-Transformer

April 12, 2021

Add ECA-NFNet-L1 (slimmed down F1 w/ SiLU, 41M params) trained with this code. 84% top-1 @ 320x320. Trained at 256x256.
Add EfficientNet-V2S model (unverified model definition) weights. 83.3 top-1 @ 288x288. Only trained single res 224. Working on progressive training.
Add ByoaNet model definition (Bring-your-own-attention) w/ SelfAttention block and corresponding SA/SA-like modules and model defs
- Lambda Networks - https://arxiv.org/abs/2102.08602
- Bottleneck Transformers - https://arxiv.org/abs/2101.11605
- Halo Nets - https://arxiv.org/abs/2103.12731
Adabelief optimizer contributed by Juntang Zhuang

April 1, 2021

Add snazzy benchmark.py script for bulk timm model benchmarking of train and/or inference
Add Pooling-based Vision Transformer (PiT) models (from https://github.com/naver-ai/pit)
- Merged distilled variant into main for torchscript compatibility
- Some timm cleanup/style tweaks and weights have hub download support
Cleanup Vision Transformer (ViT) models
- Merge distilled (DeiT) model into main so that torchscript can work
- Support updated weight init (defaults to old still) that closer matches original JAX impl (possibly better training from scratch)
- Separate hybrid model defs into different file and add several new model defs to fiddle with, support patch_size != 1 for hybrids
- Fix fine-tuning num_class changes (PiT and ViT) and pos_embed resizing (Vit) with distilled variants
- nn.Sequential for block stack (does not break downstream compat)
TnT (Transformer-in-Transformer) models contributed by author (from https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/TNT)
Add RegNetY-160 weights from DeiT teacher model
Add new NFNet-L0 w/ SE attn (rename nfnet_l0b->nfnet_l0) weights 82.75 top-1 @ 288x288
Some fixes/improvements for TFDS dataset wrapper

March 17, 2021

Add new ECA-NFNet-L0 (rename nfnet_l0c->eca_nfnet_l0) weights trained by myself.
- 82.6 top-1 @ 288x288, 82.8 @ 320x320, trained at 224x224
- Uses SiLU activation, approx 2x faster than dm_nfnet_f0 and 50% faster than nfnet_f0s w/ 1/3 param count
Integrate Hugging Face model hub into timm create_model and default_cfg handling for pretrained weight and config sharing (more on this soon!)
Merge HardCoRe NAS models contributed by https://github.com/yoniaflalo
Merge PyTorch trained EfficientNet-EL and pruned ES/EL variants contributed by DeGirum

March 7, 2021

First 0.4.x PyPi release w/ NFNets (& related), ByoB (GPU-Efficient, RepVGG, etc).
Change feature extraction for pre-activation nets (NFNets, ResNetV2) to return features before activation.
Tested with PyTorch 1.8 release. Updated CI to use 1.8.
Benchmarked several arch on RTX 3090, Titan RTX, and V100 across 1.7.1, 1.8, NGC 20.12, and 21.02. Some interesting performance variations to take note of https://gist.github.com/rwightman/bb59f9e245162cee0e38bd66bd8cd77f

Feb 18, 2021

Add pretrained weights and model variants for NFNet-F* models from DeepMind Haiku impl.
- Models are prefixed with dm_. They require SAME padding conv, skipinit enabled, and activation gains applied in act fn.
- These models are big, expect to run out of GPU memory. With the GELU activiation + other options, they are roughly 1/2 the inference speed of my SiLU PyTorch optimized s variants.
- Original model results are based on pre-processing that is not the same as all other models so you'll see different results in the results csv (once updated).
- Matching the original pre-processing as closely as possible I get these results:
  - dm_nfnet_f6 - 86.352
  - dm_nfnet_f5 - 86.100
  - dm_nfnet_f4 - 85.834
  - dm_nfnet_f3 - 85.676
  - dm_nfnet_f2 - 85.178
  - dm_nfnet_f1 - 84.696
  - dm_nfnet_f0 - 83.464

Feb 16, 2021

Add Adaptive Gradient Clipping (AGC) as per https://arxiv.org/abs/2102.06171. Integrated w/ PyTorch gradient clipping via mode arg that defaults to prev 'norm' mode. For backward arg compat, clip-grad arg must be specified to enable when using train.py.
- AGC w/ default clipping factor --clip-grad .01 --clip-mode agc
- PyTorch global norm of 1.0 (old behaviour, always norm), --clip-grad 1.0
- PyTorch value clipping of 10, --clip-grad 10. --clip-mode value
- AGC performance is definitely sensitive to the clipping factor. More experimentation needed to determine good values for smaller batch sizes and optimizers besides those in paper. So far I've found .001-.005 is necessary for stable RMSProp training w/ NFNet/NF-ResNet.

Feb 12, 2021

Update Normalization-Free nets to include new NFNet-F (https://arxiv.org/abs/2102.06171) model defs

Feb 10, 2021

First Normalization-Free model training experiments done,
- nf_resnet50 - 80.68 top-1 @ 288x288, 80.31 @ 256x256
- nf_regnet_b1 - 79.30 @ 288x288, 78.75 @ 256x256
More model archs, incl a flexible ByobNet backbone ('Bring-your-own-blocks')
- GPU-Efficient-Networks (https://github.com/idstcv/GPU-Efficient-Networks), impl in byobnet.py
- RepVGG (https://github.com/DingXiaoH/RepVGG), impl in byobnet.py
- classic VGG (from torchvision, impl in vgg.py)
Refinements to normalizer layer arg handling and normalizer+act layer handling in some models
Default AMP mode changed to native PyTorch AMP instead of APEX. Issues not being fixed with APEX. Native works with --channels-last and --torchscript model training, APEX does not.
Fix a few bugs introduced since last pypi release

Feb 8, 2021

Add several ResNet weights with ECA attention. 26t & 50t trained @ 256, test @ 320. 269d train @ 256, fine-tune @320, test @ 352.
- ecaresnet26t - 79.88 top-1 @ 320x320, 79.08 @ 256x256
- ecaresnet50t - 82.35 top-1 @ 320x320, 81.52 @ 256x256
- ecaresnet269d - 84.93 top-1 @ 352x352, 84.87 @ 320x320
Remove separate tiered (t) vs tiered_narrow (tn) ResNet model defs, all tn changed to t and t models removed (seresnext26t_32x4d only model w/ weights that was removed).
Support model default_cfgs with separate train vs test resolution test_input_size and remove extra _320 suffix ResNet model defs that were just for test.

Jan 30, 2021

Add initial "Normalization Free" NF-RegNet-B* and NF-ResNet model definitions based on paper

Jan 25, 2021

Add ResNetV2 Big Transfer (BiT) models w/ ImageNet-1k and 21k weights from https://github.com/google-research/big_transfer
Add official R50+ViT-B/16 hybrid models + weights from https://github.com/google-research/vision_transformer
ImageNet-21k ViT weights are added w/ model defs and representation layer (pre logits) support
- NOTE: ImageNet-21k classifier heads were zero'd in original weights, they are only useful for transfer learning
Add model defs and weights for DeiT Vision Transformer models from https://github.com/facebookresearch/deit
Refactor dataset classes into ImageDataset/IterableImageDataset + dataset specific parser classes
Add Tensorflow-Datasets (TFDS) wrapper to allow use of TFDS image classification sets with train script
- Ex: train.py /data/tfds --dataset tfds/oxford_iiit_pet --val-split test --model resnet50 -b 256 --amp --num-classes 37 --opt adamw --lr 3e-4 --weight-decay .001 --pretrained -j 2
Add improved .tar dataset parser that reads images from .tar, folder of .tar files, or .tar within .tar
- Run validation on full ImageNet-21k directly from tar w/ BiT model: validate.py /data/fall11_whole.tar --model resnetv2_50x1_bitm_in21k --amp
Models in this update should be stable w/ possible exception of ViT/BiT, possibility of some regressions with train/val scripts and dataset handling

Jan 3, 2021

Add SE-ResNet-152D weights
- 256x256 val, 0.94 crop top-1 - 83.75
- 320x320 val, 1.0 crop - 84.36
Update results files

Introduction

PyTorch Image Models (timm) is a collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts that aim to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.

The work of many others is present here. I've tried to make sure all source material is acknowledged via links to github, arxiv papers, etc in the README, documentation, and code docstrings. Please let me know if I missed anything.

Models

All model architecture families include variants with pretrained weights. There are specific model variants without any weights, it is NOT a bug. Help training new or better weights is always appreciated. Here are some example training hparams to get you started.

A full version of the list below with source links can be found in the documentation.

Aggregating Nested Transformers - https://arxiv.org/abs/2105.12723
Big Transfer ResNetV2 (BiT) - https://arxiv.org/abs/1912.11370
Bottleneck Transformers - https://arxiv.org/abs/2101.11605
CaiT (Class-Attention in Image Transformers) - https://arxiv.org/abs/2103.17239
CoaT (Co-Scale Conv-Attentional Image Transformers) - https://arxiv.org/abs/2104.06399
ConViT (Soft Convolutional Inductive Biases Vision Transformers)- https://arxiv.org/abs/2103.10697
CspNet (Cross-Stage Partial Networks) - https://arxiv.org/abs/1911.11929
DeiT (Vision Transformer) - https://arxiv.org/abs/2012.12877
DenseNet - https://arxiv.org/abs/1608.06993
DLA - https://arxiv.org/abs/1707.06484
DPN (Dual-Path Network) - https://arxiv.org/abs/1707.01629
EfficientNet (MBConvNet Family)
- EfficientNet NoisyStudent (B0-B7, L2) - https://arxiv.org/abs/1911.04252
- EfficientNet AdvProp (B0-B8) - https://arxiv.org/abs/1911.09665
- EfficientNet (B0-B7) - https://arxiv.org/abs/1905.11946
- EfficientNet-EdgeTPU (S, M, L) - https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html
- EfficientNet V2 - https://arxiv.org/abs/2104.00298
- FBNet-C - https://arxiv.org/abs/1812.03443
- MixNet - https://arxiv.org/abs/1907.09595
- MNASNet B1, A1 (Squeeze-Excite), and Small - https://arxiv.org/abs/1807.11626
- MobileNet-V2 - https://arxiv.org/abs/1801.04381
- Single-Path NAS - https://arxiv.org/abs/1904.02877
GhostNet - https://arxiv.org/abs/1911.11907
gMLP - https://arxiv.org/abs/2105.08050
GPU-Efficient Networks - https://arxiv.org/abs/2006.14090
Halo Nets - https://arxiv.org/abs/2103.12731
HardCoRe-NAS - https://arxiv.org/abs/2102.11646
HRNet - https://arxiv.org/abs/1908.07919
Inception-V3 - https://arxiv.org/abs/1512.00567
Inception-ResNet-V2 and Inception-V4 - https://arxiv.org/abs/1602.07261
Lambda Networks - https://arxiv.org/abs/2102.08602
LeViT (Vision Transformer in ConvNet's Clothing) - https://arxiv.org/abs/2104.01136
MLP-Mixer - https://arxiv.org/abs/2105.01601
MobileNet-V3 (MBConvNet w/ Efficient Head) - https://arxiv.org/abs/1905.02244
NASNet-A - https://arxiv.org/abs/1707.07012
NFNet-F - https://arxiv.org/abs/2102.06171
NF-RegNet / NF-ResNet - https://arxiv.org/abs/2101.08692
PNasNet - https://arxiv.org/abs/1712.00559
Pooling-based Vision Transformer (PiT) - https://arxiv.org/abs/2103.16302
RegNet - https://arxiv.org/abs/2003.13678
RepVGG - https://arxiv.org/abs/2101.03697
ResMLP - https://arxiv.org/abs/2105.03404
ResNet/ResNeXt
- ResNet (v1b/v1.5) - https://arxiv.org/abs/1512.03385
- ResNeXt - https://arxiv.org/abs/1611.05431
- 'Bag of Tricks' / Gluon C, D, E, S variations - https://arxiv.org/abs/1812.01187
- Weakly-supervised (WSL) Instagram pretrained / ImageNet tuned ResNeXt101 - https://arxiv.org/abs/1805.00932
- Semi-supervised (SSL) / Semi-weakly Supervised (SWSL) ResNet/ResNeXts - https://arxiv.org/abs/1905.00546
- ECA-Net (ECAResNet) - https://arxiv.org/abs/1910.03151v4
- Squeeze-and-Excitation Networks (SEResNet) - https://arxiv.org/abs/1709.01507
- ResNet-RS - https://arxiv.org/abs/2103.07579
Res2Net - https://arxiv.org/abs/1904.01169
ResNeSt - https://arxiv.org/abs/2004.08955
ReXNet - https://arxiv.org/abs/2007.00992
SelecSLS - https://arxiv.org/abs/1907.00837
Selective Kernel Networks - https://arxiv.org/abs/1903.06586
Swin Transformer - https://arxiv.org/abs/2103.14030
Transformer-iN-Transformer (TNT) - https://arxiv.org/abs/2103.00112
TResNet - https://arxiv.org/abs/2003.13630
Twins (Spatial Attention in Vision Transformers) - https://arxiv.org/pdf/2104.13840.pdf
Vision Transformer - https://arxiv.org/abs/2010.11929
VovNet V2 and V1 - https://arxiv.org/abs/1911.06667
Xception - https://arxiv.org/abs/1610.02357
Xception (Modified Aligned, Gluon) - https://arxiv.org/abs/1802.02611
Xception (Modified Aligned, TF) - https://arxiv.org/abs/1802.02611
XCiT (Cross-Covariance Image Transformers) - https://arxiv.org/abs/2106.09681

Features

Several (less common) features that I often utilize in my projects are included. Many of their additions are the reason why I maintain my own set of models, instead of using others' via PIP:

All models have a common default configuration interface and API for
- accessing/changing the classifier - get_classifier and reset_classifier
- doing a forward pass on just the features - forward_features (see documentation)
- these makes it easy to write consistent network wrappers that work with any of the models
All models support multi-scale feature map extraction (feature pyramids) via create_model (see documentation)
- create_model(name, features_only=True, out_indices=..., output_stride=...)
- out_indices creation arg specifies which feature maps to return, these indices are 0 based and generally correspond to the C(i + 1) feature level.
- output_stride creation arg controls output stride of the network by using dilated convolutions. Most networks are stride 32 by default. Not all networks support this.
- feature map channel counts, reduction level (stride) can be queried AFTER model creation via the .feature_info member
All models have a consistent pretrained weight loader that adapts last linear if necessary, and from 3 to 1 channel input if desired
High performance reference training, validation, and inference scripts that work in several process/GPU modes:
- NVIDIA DDP w/ a single GPU per process, multiple processes with APEX present (AMP mixed-precision optional)
- PyTorch DistributedDataParallel w/ multi-gpu, single process (AMP disabled as it crashes when enabled)
- PyTorch w/ single GPU single process (AMP optional)
A dynamic global pool implementation that allows selecting from average pooling, max pooling, average + max, or concat([average, max]) at model creation. All global pooling is adaptive average by default and compatible with pretrained weights.
A 'Test Time Pool' wrapper that can wrap any of the included models and usually provides improved performance doing inference with input images larger than the training size. Idea adapted from original DPN implementation when I ported (https://github.com/cypw/DPNs)
Learning rate schedulers
- Ideas adopted from
  - AllenNLP schedulers
  - FAIRseq lr_scheduler
  - SGDR: Stochastic Gradient Descent with Warm Restarts (https://arxiv.org/abs/1608.03983)
- Schedulers include step, cosine w/ restarts, tanh w/ restarts, plateau
Optimizers:
- rmsprop_tf adapted from PyTorch RMSProp by myself. Reproduces much improved Tensorflow RMSProp behaviour.
- radam by Liyuan Liu (https://arxiv.org/abs/1908.03265)
- novograd by Masashi Kimura (https://arxiv.org/abs/1905.11286)
- lookahead adapted from impl by Liam (https://arxiv.org/abs/1907.08610)
- fused<name> optimizers by name with NVIDIA Apex installed
- adamp and sgdp by Naver ClovAI (https://arxiv.org/abs/2006.08217)
- adafactor adapted from FAIRSeq impl (https://arxiv.org/abs/1804.04235)
- adahessian by David Samuel (https://arxiv.org/abs/2006.00719)
Random Erasing from Zhun Zhong (https://arxiv.org/abs/1708.04896)
Mixup (https://arxiv.org/abs/1710.09412)
CutMix (https://arxiv.org/abs/1905.04899)
AutoAugment (https://arxiv.org/abs/1805.09501) and RandAugment (https://arxiv.org/abs/1909.13719) ImageNet configurations modeled after impl for EfficientNet training (https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py)
AugMix w/ JSD loss (https://arxiv.org/abs/1912.02781), JSD w/ clean + augmented mixing support works with AutoAugment and RandAugment as well
SplitBachNorm - allows splitting batch norm layers between clean and augmented (auxiliary batch norm) data
DropPath aka "Stochastic Depth" (https://arxiv.org/abs/1603.09382)
DropBlock (https://arxiv.org/abs/1810.12890)
Blur Pooling (https://arxiv.org/abs/1904.11486)
Space-to-Depth by mrT23 (https://arxiv.org/abs/1801.04590) -- original paper?
Adaptive Gradient Clipping (https://arxiv.org/abs/2102.06171, https://github.com/deepmind/deepmind-research/tree/master/nfnets)
An extensive selection of channel and/or spatial attention modules:
- Bottleneck Transformer - https://arxiv.org/abs/2101.11605
- CBAM - https://arxiv.org/abs/1807.06521
- Effective Squeeze-Excitation (ESE) - https://arxiv.org/abs/1911.06667
- Efficient Channel Attention (ECA) - https://arxiv.org/abs/1910.03151
- Gather-Excite (GE) - https://arxiv.org/abs/1810.12348
- Global Context (GC) - https://arxiv.org/abs/1904.11492
- Halo - https://arxiv.org/abs/2103.12731
- Involution - https://arxiv.org/abs/2103.06255
- Lambda Layer - https://arxiv.org/abs/2102.08602
- Non-Local (NL) - https://arxiv.org/abs/1711.07971
- Squeeze-and-Excitation (SE) - https://arxiv.org/abs/1709.01507
- Selective Kernel (SK) - (https://arxiv.org/abs/1903.06586
- Split (SPLAT) - https://arxiv.org/abs/2004.08955
- Shifted Window (SWIN) - https://arxiv.org/abs/2103.14030

Results

Model validation results can be found in the documentation and in the results tables

Getting Started (Documentation)

My current documentation for timm covers the basics.

timmdocs is quickly becoming a much more comprehensive set of documentation for timm. A big thanks to Aman Arora for his efforts creating timmdocs.

paperswithcode is a good resource for browsing the models within timm.

Train, Validation, Inference Scripts

The root folder of the repository contains reference train, validation, and inference scripts that work with the included models and other features of this repository. They are adaptable for other datasets and use cases with a little hacking. See documentation for some basics and training hparams for some train examples that produce SOTA ImageNet results.

Awesome PyTorch Resources

One of the greatest assets of PyTorch is the community and their contributions. A few of my favourite resources that pair well with the models and components here are listed below.

Licenses

Code

The code here is licensed Apache 2.0. I've taken care to make sure any third party code included or adapted has compatible (permissive) licenses such as MIT, BSD, etc. I've made an effort to avoid any GPL / LGPL conflicts. That said, it is your responsibility to ensure you comply with licenses here and conditions of any dependent licenses. Where applicable, I've linked the sources/references for various components in docstrings. If you think I've missed anything please create an issue.

Pretrained Weights

So far all of the pretrained weights available here are pretrained on ImageNet with a select few that have some additional pretraining (see extra note below). ImageNet was released for non-commercial research purposes only (https://image-net.org/download). It's not clear what the implications of that are for the use of pretrained weights from that dataset. Any models I have trained with ImageNet are done for research purposes and one should assume that the original dataset license applies to the weights. It's best to seek legal advice if you intend to use the pretrained weights in a commercial product.

Pretrained on more than ImageNet

Several weights included or references here were pretrained with proprietary datasets that I do not have access to. These include the Facebook WSL, SSL, SWSL ResNe(Xt) and the Google Noisy Student EfficientNet models. The Facebook models have an explicit non-commercial license (CC-BY-NC 4.0, https://github.com/facebookresearch/semi-supervised-ImageNet1K-models, https://github.com/facebookresearch/WSL-Images). The Google models do not appear to have any restriction beyond the Apache 2.0 license (and ImageNet concerns). In either case, you should contact Facebook or Google with any questions.

Citing

BibTeX

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

Latest DOI

Comments

Efficientnetb1-b7 hyper parameters

First of all thanks for the fantastic code!

I am wondering if anyone has successfully reproduce (or close to it) the results for Efficientnetb1-b7? I am able to reproduce b0 with jiefengpeng's setting: ./distributed_train.sh 8 ../ImageNet/ --model efficientnet_b0 -b 256 --sched step --epochs 500 --decay-epochs 3 --decay-rate 0.963 --opt rmsproptf --opt-eps .001 -j 8 --warmup-epochs 5 --weight-decay 1e-5 --drop 0.2 --color-jitter .06 --model-ema --lr .128

The same setting (with adjusted drop rate) for b1 came with only 78.11 (with EMA enabled), compared to 78.8% reported in the paper.

opened by pichuang1984 40
default training hyper-parameters

Hi, Impressive work! The train scipts contains a large combination of various hyper-parameter options. However, there are different types of models, and even many models are contained even within the efficientnet part. I wonder whether you trained models with default ones. If not, do you plan to release model specific hyper-parameters? Thanks!

opened by cxxgtxy 26
FX feature extraction
Added timm.models.fx_features.FeatureGraphNet as another option for feature extraction. (works as a standalone commit)

Made all models traceable (2nd commit)

Tests to enforce all models traceable (3rd commit)

~Caveat - Right now we can only safely say it works in eval mode. Control flow that depends on the value of model.training is frozen into place by the tracing operation. So if the model was traced in eval mode, it stays that way (actually only those parts that were traced through, leaf modules and leaf functions respect the training mode). Therefore, we cannot expect model.train() to have the desired effect. This is a TODO, so right now there is a warning when the user tries to do model.train().~

This is sorted but hasn't been tested in anger.

All local tests passed.

EDIT - This feature has been added to torchvision https://github.com/pytorch/vision/commit/72d650ae0bf21f4d98cb8af5e308bddd88131d5e
opened by alexander-soare 19
CUDA out of memory when load model

I have train mobilenetv3_large_100 using 8 2080Ti GPU, and the batch size is 128, which means 128 * 8 =1024 pictures every batch. When I resumed the model, there was an "CUDA out of memory" error. However, when I trained it again from scratch, there wasn't any error. I noticed that your codes of "helper.py" has loaded the model in cpu, it should be the solution for this bug, but why this happend? checkpoint = torch.load(checkpoint_path, map_location='cpu')

Another interesting problem is that I find the acc@1 is very low in the first few epochs(nearly random property), and the eval_loss even rises, why？？？

opened by Andy1621 19
ViT Training Details

Hi,

In your code comments you are able to train a small version of the model to 75% top-1 accuracy. Could you give more details about the hyper-params used (like batch size, learning rate etc.)

Thanks.

opened by gupta-abhay 17
[FEATURE] Method to convert feature embeddings into predictions
Is your feature request related to a problem? Please describe. Currently, saving both the feature embedding vector and the prediction vector requires two forward passes:

predictions = model(inputs) embeddings = model.forward_features(inputs) # inefficient

This is technically not necessary, since the embeddings are computed when you compute predictions.

In addition, once you have your embeddings, there is no standardized method to convert the embeddings into predictions. For example, with ResNet Models I could do something like:

predictions = model.fc(embeddings)

...but that does not generalize to other models since not every model has a fc layer.

Describe the solution you'd like

For every model to have a forward method (I suggest the name forward_predictions) which takes an embedding as input and outputs a prediction.

For example, ResNet would go from:

def forward(self, x): x = self.forward_features(x) x = self.global_pool(x) if self.drop_rate: x = F.dropout(x, p=float(self.drop_rate), training=self.training) x = self.fc(x) return x

to:

def forward_predictions(self, x): # x is embedding vector x = self.global_pool(x) if self.drop_rate: x = F.dropout(x, p=float(self.drop_rate), training=self.training) x = self.fc(x) return x def forward(self, x): x = self.forward_features(x) x = self.forward_predictions(x) return x

Our inference code, then, would become:

embeddings = model.forward_features(input) predictions = model.forward_predictions(embeddings) # no redundant compute return embeddings, predictions

... and enables us to convert feature embedding vectors from a feature store into a prediction:

embedding = feature_store.query(<interesting image>) nearest_neighbor_embedding = feature_store.get_nearest_vector(embedding) prediction = model.forward_predictions(nearest_neighbor_embedding)

Describe alternatives you've considered Option 1: Two separate forward passes (for embeddings and predictions) for each image. Create "forward_predictions" helper functions for each different architecture we use to convert embeddings into predictions.

Option 2: register a custom forward hook for every architecture which intercepts the embedding vector during the forward pass.
enhancement
opened by crypdick 16
Cutmix

clovaai/CutMix-PyTorch: Official Pytorch implementation of CutMix regularizer GitHub: https://github.com/clovaai/CutMix-PyTorch

Hi, I saw that you’ve been implementing mixup as an additional feature. However, if the model trained with mixup used as the backbone of object detector, it seems the performance of the detector degenerates.

Could you please consider cutmix in addition to mixup?

thanks!
enhancement

opened by dandelin 16
FEATURE: mobilenetv2 0.35 training on ImageNet - train (and possibly add) smaller mobilenet v2/v3/mnasnet models

Discussed in https://github.com/rwightman/pytorch-image-models/discussions/1020

^{Originally posted by IgorKasianenko December 5, 2021} Hello, I want to train mobilenetv2 0.35 on Imagenet. I try to do it using example of https://rwightman.github.io/pytorch-image-models/training_hparam_examples/#mobilenetv3-large-100-75766-top-1-92542-top-5 After reading documentation of timm I assume that 0.35 depth model would be named like mobilenetv2_035 similarly to mobilenetv2_100 but I get error RuntimeError: Unknown model (mobilenetv2_035) Please advice how to add this model to timm so I can utilize ImageNet training script. Thanks
enhancement help wanted

opened by IgorKasianenko 13

use `Image.Resampling` namespace for PIL mapping

PIL version 9.1.0 shows a deprecation warning when accessing resampling constants via the Image namespace. The suggested namespace is Image.Resampling. This commit updates _pil_interpolation_to_str to use the Image.Resampling namespace.

/tmp/ipykernel_11959/698124036.py:2: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  Image.NEAREST: 'nearest',
/tmp/ipykernel_11959/698124036.py:3: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  Image.BILINEAR: 'bilinear',
/tmp/ipykernel_11959/698124036.py:4: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  Image.BICUBIC: 'bicubic',
/tmp/ipykernel_11959/698124036.py:5: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
  Image.BOX: 'box',
/tmp/ipykernel_11959/698124036.py:6: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
  Image.HAMMING: 'hamming',
/tmp/ipykernel_11959/698124036.py:7: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
  Image.LANCZOS: 'lanczos',

opened by kaczmarj 12

Add option for ML-Decoder - an improved classification head

While almost every aspect of ImageNet training had improved in the last couple of years (backbones, augmentations, loss,...), a plain classification head, GAP + fully connected, remains the default option. In our paper,"ML-Decoder: Scalable and Versatile Classification Head" ( https://github.com/Alibaba-MIIL/ML_Decoder ), we propose a new attention-based classification head, that not only improves results, but also provides better speed-accuracy tradeoff on various classification tasks - multi-label, single-label and zero shot.

A technical note about the merge request - since each model has a unique coding style, systematically using a different classification head is challenging. This merge request enables ML-Decoder head to all CNNs (I specifically checked ResNet, ResNetD, EfficientNet, RgeNet and TResNet). For Transformers, the GAP operation is embedded inside the 'forward_features' pass, so it is hard to use a different classification head without editing each model separately.

opened by mrT23 11
Model request: CSPNet

https://openaccess.thecvf.com/content_CVPRW_2020/papers/w28/Wang_CSPNet_A_New_Backbone_That_Can_Enhance_Learning_Capability_of_CVPRW_2020_paper.pdf

The authors, as usual, claim that their models are faster, lighter, more accurate.

It would be nice to add them to the repo.

opened by ternaus 11
tf_efficientnet_b0_ap model was removed but is still in the doc
Describe the bug A clear and concise description of what the bug is.

tf_efficientnet_b0_ap model was removed in https://github.com/rwightman/pytorch-image-models/commit/6a01101905e78007e5396f5ffdaae0c4725ba72c#diff-27c2bbd967991cbb5264f93cb5da34895fdab02424b2cc8c63d3d0768e65d47aL1833, but is still in doc https://github.com/rwightman/pytorch-image-models/blob/6a01101905e78007e5396f5ffdaae0c4725ba72c/docs/models/advprop.md#how-do-i-use-this-model-on-an-image

To Reproduce Steps to reproduce the behavior:

$ python -c "import timm; timm.create_model('tf_efficientnet_b0_ap', pretrained=True)" Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/xwang/Developer/pytorch-image-models/timm/models/_factory.py", line 89, in create_model raise RuntimeError('Unknown model (%s)' % model_name) RuntimeError: Unknown model (tf_efficientnet_b0_ap)
bug
opened by xwang233 2
What batch size number other than 1024 have you tried when training a DeiT or ViT model?

What batch size number other than batch size of 1024 have you tried when training a DeiT or ViT model? In the paper, DeiT (https://arxiv.org/abs/2012.12877), they used a batch size of 1024 and they mentioned that the learning rate should be scaled according to the batch size.

However, I was wondering if you guys have any experience or successfully train a DeiT model with a batch size that is even less than 512? If yes, what accuracy did you achieve?

This would be helpful for someone training on constrained resources that cannot train on a batch size of 1024.

opened by CharlesLeeeee 0
[FEATURE] Script to convert weight from Jax to PyTorch

Is your feature request related to a problem? Please describe. I am trying to create multiple checkpoints of ViT at different iterations. Are there any systematic way to perform such conversion?

Describe the solution you'd like I would like to be able to convert JAX ViT model to a PyTorch model, similar to this model (https://huggingface.co/google/vit-base-patch16-224)

Describe alternatives you've considered I have tried to start pre-training HF models on A100 but so far was not successful to reach to same accuracy.
enhancement

opened by yazdanbakhsh 6
[FEATURE] BEIT pre-training model

Is your feature request related to a problem? Please describe. There is no problem or bug

Describe the solution you'd like I would like the implementation of BEIT pre-training pipeline in order to be able to manually pre-training the architecture

Describe alternatives you've considered No

Additional context No
enhancement

opened by lorenzbaraldi 2
[BUG] ViT ImageNet1K weights

Describe the bug In Version: 0.5.4 for example, dose vit_tiny_patch16_224 means vit_tiny trained from scratch on ImageNet1K? However, in the current Version 0.6, vit_tiny_patch16_224 means vit_tiny pretrained on 21k and then fine-tuned on in1k, which is very misleading and leading to errors for down-stream experiments.

To Reproduce Steps to reproduce the behavior: https://github.com/rwightman/pytorch-image-models/blob/18ec173f95aa220af753358bf860b16b6691edb2/timm/models/vision_transformer.py#L642

Expected behavior Regular ImageNet-1K training without extra data knowledge.
bug

opened by hellojialee 2

Pruned efficientnets don't respect the `in_chans` parameter

When creating a model using timm.create_model(arch, pretrained=True, in_chans=1, num_classes=1), single-channel input images can be used with tf_efficientnet_b2_ns, but not efficientnet_b3_pruned. The pruned models result in the following error:

  File "/home/james/miniconda3/envs/mammo/lib/python3.10/site-packages/timm/models/efficientnet.py", line 557, in forward
    x = self.forward_features(x)

  File "/home/james/miniconda3/envs/mammo/lib/python3.10/site-packages/timm/models/efficientnet.py", line 540, in forward_features
    x = self.conv_stem(x)

  File "/home/james/miniconda3/envs/mammo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)

  File "/home/james/miniconda3/envs/mammo/lib/python3.10/site-packages/timm/models/layers/conv2d_same.py", line 30, in forward
    return conv2d_same(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)

  File "/home/james/miniconda3/envs/mammo/lib/python3.10/site-packages/timm/models/layers/conv2d_same.py", line 17, in conv2d_same
    return F.conv2d(x, weight, bias, stride, (0, 0), dilation, groups)

RuntimeError: Given groups=1, weight of size [40, 3, 3, 3], expected input[14, 1, 2459, 2459] to have 3 channels, but got 1 channels instead

bug

opened by jphdotam 0

Releases(v0.8.2dev0)

v0.8.2dev0(Dec 24, 2022)
Part way through the conversion of models to multi-weight support (model_arch.pretrain_tag), module reorg for future building, and lots of new weights and model additions as we go...

This is considered a development release. Please stick to 0.6.x if you need stability. Some of the model names, tags will shift a bit, some old names have already been deprecated and remapping support not added yet. For code 0.6.x branch is considered 'stable' https://github.com/rwightman/pytorch-image-models/tree/0.6.x

Dec 23, 2022 🎄☃

Add FlexiViT models and weights from https://github.com/google-research/big_vision (check out paper at https://arxiv.org/abs/2212.08013)

NOTE currently resizing is static on model creation, on-the-fly dynamic / train patch size sampling is a WIP

Many more models updated to multi-weight and downloadable via HF hub now (convnext, efficientnet, mobilenet, vision_transformer*, beit)

More model pretrained tag and adjustments, some model names changed (working on deprecation translations, consider main branch DEV branch right now, use 0.6.x for stable use)

More ImageNet-12k (subset of 22k) pretrain models popping up:

efficientnet_b5.in12k_ft_in1k - 85.9 @ 448x448

vit_medium_patch16_gap_384.in12k_ft_in1k - 85.5 @ 384x384

vit_medium_patch16_gap_256.in12k_ft_in1k - 84.5 @ 256x256

convnext_nano.in12k_ft_in1k - 82.9 @ 288x288

Dec 8, 2022

Add 'EVA l' to vision_transformer.py, MAE style ViT-L/14 MIM pretrain w/ EVA-CLIP targets, FT on ImageNet-1k (w/ ImageNet-22k intermediate for some)

original source: https://github.com/baaivision/EVA

| model | top1 | param_count | gmac | macts | hub | |:------------------------------------------|-----:|------------:|------:|------:|:----------------------------------------| | eva_large_patch14_336.in22k_ft_in22k_in1k | 89.2 | 304.5 | 191.1 | 270.2 | link | | eva_large_patch14_336.in22k_ft_in1k | 88.7 | 304.5 | 191.1 | 270.2 | link | | eva_large_patch14_196.in22k_ft_in22k_in1k | 88.6 | 304.1 | 61.6 | 63.5 | link | | eva_large_patch14_196.in22k_ft_in1k | 87.9 | 304.1 | 61.6 | 63.5 | link |

Dec 6, 2022

Add 'EVA g', BEiT style ViT-g/14 model weights w/ both MIM pretrain and CLIP pretrain to beit.py.

original source: https://github.com/baaivision/EVA

paper: https://arxiv.org/abs/2211.07636

| model | top1 | param_count | gmac | macts | hub | |:-----------------------------------------|-------:|--------------:|-------:|--------:|:----------------------------------------| | eva_giant_patch14_560.m30m_ft_in22k_in1k | 89.8 | 1014.4 | 1906.8 | 2577.2 | link | | eva_giant_patch14_336.m30m_ft_in22k_in1k | 89.6 | 1013 | 620.6 | 550.7 | link | | eva_giant_patch14_336.clip_ft_in1k | 89.4 | 1013 | 620.6 | 550.7 | link | | eva_giant_patch14_224.clip_ft_in1k | 89.1 | 1012.6 | 267.2 | 192.6 | link |

Dec 5, 2022

Pre-release (0.8.0dev0) of multi-weight support (model_arch.pretrained_tag). Install with pip install --pre timm

vision_transformer, maxvit, convnext are the first three model impl w/ support

model names are changing with this (previous _21k, etc. fn will merge), still sorting out deprecation handling

bugs are likely, but I need feedback so please try it out

if stability is needed, please use 0.6.x pypi releases or clone from 0.6.x branch

Support for PyTorch 2.0 compile is added in train/validate/inference/benchmark, use --torchcompile argument

Inference script allows more control over output, select k for top-class index + prob json, csv or parquet output

Add a full set of fine-tuned CLIP image tower weights from both LAION-2B and original OpenAI CLIP models

| model | top1 | param_count | gmac | macts | hub | |:-------------------------------------------------|-------:|--------------:|-------:|--------:|:-------------------------------------------------------------------------------------| | vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k | 88.6 | 632.5 | 391 | 407.5 | link | | vit_large_patch14_clip_336.openai_ft_in12k_in1k | 88.3 | 304.5 | 191.1 | 270.2 | link | | vit_huge_patch14_clip_224.laion2b_ft_in12k_in1k | 88.2 | 632 | 167.4 | 139.4 | link | | vit_large_patch14_clip_336.laion2b_ft_in12k_in1k | 88.2 | 304.5 | 191.1 | 270.2 | link | | vit_large_patch14_clip_224.openai_ft_in12k_in1k | 88.2 | 304.2 | 81.1 | 88.8 | link | | vit_large_patch14_clip_224.laion2b_ft_in12k_in1k | 87.9 | 304.2 | 81.1 | 88.8 | link | | vit_large_patch14_clip_224.openai_ft_in1k | 87.9 | 304.2 | 81.1 | 88.8 | link | | vit_large_patch14_clip_336.laion2b_ft_in1k | 87.9 | 304.5 | 191.1 | 270.2 | link | | vit_huge_patch14_clip_224.laion2b_ft_in1k | 87.6 | 632 | 167.4 | 139.4 | link | | vit_large_patch14_clip_224.laion2b_ft_in1k | 87.3 | 304.2 | 81.1 | 88.8 | link | | vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 87.2 | 86.9 | 55.5 | 101.6 | link | | vit_base_patch16_clip_384.openai_ft_in12k_in1k | 87 | 86.9 | 55.5 | 101.6 | link | | vit_base_patch16_clip_384.laion2b_ft_in1k | 86.6 | 86.9 | 55.5 | 101.6 | link | | vit_base_patch16_clip_384.openai_ft_in1k | 86.2 | 86.9 | 55.5 | 101.6 | link | | vit_base_patch16_clip_224.laion2b_ft_in12k_in1k | 86.2 | 86.6 | 17.6 | 23.9 | link | | vit_base_patch16_clip_224.openai_ft_in12k_in1k | 85.9 | 86.6 | 17.6 | 23.9 | link | | vit_base_patch32_clip_448.laion2b_ft_in12k_in1k | 85.8 | 88.3 | 17.9 | 23.9 | link | | vit_base_patch16_clip_224.laion2b_ft_in1k | 85.5 | 86.6 | 17.6 | 23.9 | link | | vit_base_patch32_clip_384.laion2b_ft_in12k_in1k | 85.4 | 88.3 | 13.1 | 16.5 | link | | vit_base_patch16_clip_224.openai_ft_in1k | 85.3 | 86.6 | 17.6 | 23.9 | link | | vit_base_patch32_clip_384.openai_ft_in12k_in1k | 85.2 | 88.3 | 13.1 | 16.5 | link | | vit_base_patch32_clip_224.laion2b_ft_in12k_in1k | 83.3 | 88.2 | 4.4 | 5 | link | | vit_base_patch32_clip_224.laion2b_ft_in1k | 82.6 | 88.2 | 4.4 | 5 | link | | vit_base_patch32_clip_224.openai_ft_in1k | 81.9 | 88.2 | 4.4 | 5 | link |

Port of MaxViT Tensorflow Weights from official impl at https://github.com/google-research/maxvit

There was larger than expected drops for the upscaled 384/512 in21k fine-tune weights, possible detail missing, but the 21k FT did seem sensitive to small preprocessing

| model | top1 | param_count | gmac | macts | hub | |:-----------------------------------|-------:|--------------:|-------:|--------:|:-----------------------------------------------------------------------| | maxvit_xlarge_tf_512.in21k_ft_in1k | 88.5 | 475.8 | 534.1 | 1413.2 | link | | maxvit_xlarge_tf_384.in21k_ft_in1k | 88.3 | 475.3 | 292.8 | 668.8 | link | | maxvit_base_tf_512.in21k_ft_in1k | 88.2 | 119.9 | 138 | 704 | link | | maxvit_large_tf_512.in21k_ft_in1k | 88 | 212.3 | 244.8 | 942.2 | link | | maxvit_large_tf_384.in21k_ft_in1k | 88 | 212 | 132.6 | 445.8 | link | | maxvit_base_tf_384.in21k_ft_in1k | 87.9 | 119.6 | 73.8 | 332.9 | link | | maxvit_base_tf_512.in1k | 86.6 | 119.9 | 138 | 704 | link | | maxvit_large_tf_512.in1k | 86.5 | 212.3 | 244.8 | 942.2 | link | | maxvit_base_tf_384.in1k | 86.3 | 119.6 | 73.8 | 332.9 | link | | maxvit_large_tf_384.in1k | 86.2 | 212 | 132.6 | 445.8 | link | | maxvit_small_tf_512.in1k | 86.1 | 69.1 | 67.3 | 383.8 | link | | maxvit_tiny_tf_512.in1k | 85.7 | 31 | 33.5 | 257.6 | link | | maxvit_small_tf_384.in1k | 85.5 | 69 | 35.9 | 183.6 | link | | maxvit_tiny_tf_384.in1k | 85.1 | 31 | 17.5 | 123.4 | link | | maxvit_large_tf_224.in1k | 84.9 | 211.8 | 43.7 | 127.4 | link | | maxvit_base_tf_224.in1k | 84.9 | 119.5 | 24 | 95 | link | | maxvit_small_tf_224.in1k | 84.4 | 68.9 | 11.7 | 53.2 | link | | maxvit_tiny_tf_224.in1k | 83.4 | 30.9 | 5.6 | 35.8 | link |

Oct 15, 2022

Train and validation script enhancements

Non-GPU (ie CPU) device support

SLURM compatibility for train script

HF datasets support (via ReaderHfds)

TFDS/WDS dataloading improvements (sample padding/wrap for distributed use fixed wrt sample count estimate)

in_chans !=3 support for scripts / loader

Adan optimizer

Can enable per-step LR scheduling via args

Dataset 'parsers' renamed to 'readers', more descriptive of purpose

AMP args changed, APEX via --amp-impl apex, bfloat16 supportedf via --amp-dtype bfloat16

main branch switched to 0.7.x version, 0.6x forked for stable release of weight only adds

master -> main branch rename

Source code(tar.gz)
Source code(zip)
v0.6.12(Nov 23, 2022)
Minor bug fixes to HF push_to_hub, plus some more MaxVit weights

Oct 10, 2022

More weights in maxxvit series, incl first ConvNeXt block based coatnext and maxxvit experiments:

coatnext_nano_rw_224 - 82.0 @ 224 (G) -- (uses ConvNeXt conv block, no BatchNorm)

maxxvit_rmlp_nano_rw_256 - 83.0 @ 256, 83.7 @ 320 (G) (uses ConvNeXt conv block, no BN)

maxvit_rmlp_small_rw_224 - 84.5 @ 224, 85.1 @ 320 (G)

maxxvit_rmlp_small_rw_256 - 84.6 @ 256, 84.9 @ 288 (G) -- could be trained better, hparams need tuning (uses ConvNeXt block, no BN)

coatnet_rmlp_2_rw_224 - 84.6 @ 224, 85 @ 320 (T)

Source code(tar.gz)
Source code(zip)
v0.6.11(Oct 3, 2022)
Changes Since 0.6.7

Sept 23, 2022

CLIP LAION-2B pretrained B/32, L/14, H/14, and g/14 image tower weights as vit models (for fine-tune)

Sept 7, 2022

Hugging Face timm docs home now exists, look for more here in the future

Add BEiT-v2 weights for base and large 224x224 models from https://github.com/microsoft/unilm/tree/master/beit2

Add more weights in maxxvit series incl a pico (7.5M params, 1.9 GMACs), two tiny variants:

maxvit_rmlp_pico_rw_256 - 80.5 @ 256, 81.3 @ 320 (T)

maxvit_tiny_rw_224 - 83.5 @ 224 (G)

maxvit_rmlp_tiny_rw_256 - 84.2 @ 256, 84.8 @ 320 (T)

Aug 29, 2022

MaxVit window size scales with img_size by default. Add new RelPosMlp MaxViT weight that leverages this:

maxvit_rmlp_nano_rw_256 - 83.0 @ 256, 83.6 @ 320 (T)

Aug 26, 2022

CoAtNet (https://arxiv.org/abs/2106.04803) and MaxVit (https://arxiv.org/abs/2204.01697) timm original models

both found in maxxvit.py model def, contains numerous experiments outside scope of original papers

an unfinished Tensorflow version from MaxVit authors can be found https://github.com/google-research/maxvit

Initial CoAtNet and MaxVit timm pretrained weights (working on more):

coatnet_nano_rw_224 - 81.7 @ 224 (T)

coatnet_rmlp_nano_rw_224 - 82.0 @ 224, 82.8 @ 320 (T)

coatnet_0_rw_224 - 82.4 (T) -- NOTE timm '0' coatnets have 2 more 3rd stage blocks

coatnet_bn_0_rw_224 - 82.4 (T)

maxvit_nano_rw_256 - 82.9 @ 256 (T)

coatnet_rmlp_1_rw_224 - 83.4 @ 224, 84 @ 320 (T)

coatnet_1_rw_224 - 83.6 @ 224 (G)

(T) = TPU trained with bits_and_tpu branch training code, (G) = GPU trained

GCVit (weights adapted from https://github.com/NVlabs/GCVit, code 100% timm re-write for license purposes)

MViT-V2 (multi-scale vit, adapted from https://github.com/facebookresearch/mvit)

EfficientFormer (adapted from https://github.com/snap-research/EfficientFormer)

PyramidVisionTransformer-V2 (adapted from https://github.com/whai362/PVT)

'Fast Norm' support for LayerNorm and GroupNorm that avoids float32 upcast w/ AMP (uses APEX LN if available for further boost)

Aug 15, 2022

ConvNeXt atto weights added

convnext_atto - 75.7 @ 224, 77.0 @ 288

convnext_atto_ols - 75.9 @ 224, 77.2 @ 288

Aug 5, 2022

More custom ConvNeXt smaller model defs with weights

convnext_femto - 77.5 @ 224, 78.7 @ 288

convnext_femto_ols - 77.9 @ 224, 78.9 @ 288

convnext_pico - 79.5 @ 224, 80.4 @ 288

convnext_pico_ols - 79.5 @ 224, 80.5 @ 288

convnext_nano_ols - 80.9 @ 224, 81.6 @ 288

Updated EdgeNeXt to improve ONNX export, add new base variant and weights from original (https://github.com/mmaaz60/EdgeNeXt)

July 28, 2022

Add freshly minted DeiT-III Medium (width=512, depth=12, num_heads=8) model weights. Thanks Hugo Touvron!

Source code(tar.gz)
Source code(zip)
v0.1-weights-maxx(Aug 24, 2022)
CoAtNet (https://arxiv.org/abs/2106.04803) and MaxVit (https://arxiv.org/abs/2204.01697) timm trained weights

Weights were created reproducing the paper architectures and exploring timm sepcific additions such as ConvNeXt blocks, parallel partitioning, and other experiments.

Weights were trained on a mix of TPU and GPU systems. Bulk of weights were trained on TPU via the TRC program (https://sites.research.google/trc/about/).

CoAtNet variants run particularly well on TPU, it's a great combination. MaxVit is better suited to GPU due to the window partitioning, although there are some optimizations that can be made to improve TPU padding/utilization incl using 256x256 image size (8, 8) windo/grid size, and keeping format in NCHW for partition attention when using PyTorch XLA.

Glossary:

coatnet - CoAtNet (MBConv + transformer blocks)

coatnext - CoAtNet w/ ConvNeXt conv blocks

maxvit - MaxViT (MBConv + block (ala swin) and grid partioning transformer blocks)

maxxvit - MaxViT w/ ConvNeXt conv blocks

rmlp - relative position embedding w/ MLP (can be resized) -- if this isn't in model name, it's using relative position bias (ala swin)

rw - my variations on the model, slight differences in sizing / pooling / etc from Google paper spec

Results:

maxvit_rmlp_pico_rw_256 - 80.5 @ 256, 81.3 @ 320 (T)

coatnet_nano_rw_224 - 81.7 @ 224 (T)

coatnext_nano_rw_224 - 82.0 @ 224 (G) -- (uses convnext block, no BatchNorm)

coatnet_rmlp_nano_rw_224 - 82.0 @ 224, 82.8 @ 320 (T)

coatnet_0_rw_224 - 82.4 (T) -- NOTE timm '0' coatnets have 2 more 3rd stage blocks

coatnet_bn_0_rw_224 - 82.4 (T) -- all BatchNorm, no LayerNorm

maxvit_nano_rw_256 - 82.9 @ 256 (T)

maxvit_rmlp_nano_rw_256 - 83.0 @ 256, 83.6 @ 320 (T)

maxxvit_rmlp_nano_rw_256 - 83.0 @ 256, 83.7 @ 320 (G) (uses convnext conv block, no BatchNorm)

coatnet_rmlp_1_rw_224 - 83.4 @ 224, 84 @ 320 (T)

maxvit_tiny_rw_224 - 83.5 @ 224 (G)

coatnet_1_rw_224 - 83.6 @ 224 (G)

maxvit_rmlp_tiny_rw_256 - 84.2 @ 256, 84.8 @ 320 (T)

maxvit_rmlp_small_rw_224 - 84.5 @ 224, 85.1 @ 320 (G)

maxxvit_rmlp_small_rw_256 - 84.6 @ 256, 84.9 @ 288 (G) -- could be trained better, hparms need tuning (uses convnext conv block, no BN)

coatnet_rmlp_2_rw_224 - 84.6 @ 224, 85 @ 320 (T)

(T) = TPU trained with bits_and_tpu branch training code, (G) = GPU trained
Source code(tar.gz)
Source code(zip)
coatnet_0_rw_224_sw-a6439706.pth(104.73 MB)
coatnet_1_rw_224_sw-5cae1ea8.pth(159.30 MB)
coatnet_bn_0_rw_224_sw-c228e218.pth(104.81 MB)
coatnet_nano_rw_224_sw-f53093b4.pth(57.84 MB)
coatnet_rmlp_1_rw_224_sw-9051e6c3.pth(159.19 MB)
coatnet_rmlp_2_rw_224_sw-5ccfac55.pth(282.02 MB)
coatnet_rmlp_nano_rw_224_sw-bd1d51b3.pth(57.86 MB)
coatnext_nano_rw_224_ad-22cb71c2.pth(56.11 MB)
maxvit_nano_rw_256_sw-fb127241.pth(59.07 MB)
maxvit_rmlp_nano_rw_256_sw-c17bb0d6.pth(59.27 MB)
maxvit_rmlp_pico_rw_256_sw-8d82f2c6.pth(28.84 MB)
maxvit_rmlp_small_rw_224_sw-6ef0ae4f.pth(247.88 MB)
maxvit_rmlp_tiny_rw_256_sw-2da819a5.pth(111.44 MB)
maxvit_rmlp_tiny_rw_256_sw-bbef0ff5.pth(111.44 MB)
maxvit_tiny_rw_224_sw-7d0dffeb.pth(111.08 MB)
maxxvit_rmlp_nano_rw_256_sw-0325d459.pth(64.06 MB)
maxxvit_rmlp_small_rw_256_sw-37e217ff.pth(251.89 MB)
v0.1-weights-morevit(Aug 17, 2022)

More weights for 3rd party ViT / ViT-CNN hybrids that needed remapping / re-hosting

EfficientFormer

Rehosted and remaped checkpoints from https://github.com/snap-research/EfficientFormer (originals in Google Drive)

GCViT

Heavily remaped from originals at https://github.com/NVlabs/GCVit due to from-scratch re-write of model code

NOTE: these checkpoints have a non-commercial CC-BY-NC-SA-4.0 license.
Source code(tar.gz)
Source code(zip)
efficientformer_l1_1000d_224-5b08fab0.pth(47.06 MB)
efficientformer_l3_300d_224-6816624f.pth(120.17 MB)
efficientformer_l7_300d_224-e957ab75.pth(314.26 MB)
gcvit_base_224_nvidia-f009139b.pth(344.62 MB)
gcvit_small_224_nvidia-4e98afa2.pth(194.98 MB)
gcvit_tiny_224_nvidia-ac783954.pth(107.73 MB)
gcvit_xtiny_224_nvidia-274b92b7.pth(76.26 MB)
gcvit_xxtiny_224_nvidia-d1d86009.pth(45.79 MB)
v0.6.7(Jul 27, 2022)
Minor bug fixes and a few more weights since 0.6.5

A few more weights & model defs added:

darknetaa53 - 79.8 @ 256, 80.5 @ 288

convnext_nano - 80.8 @ 224, 81.5 @ 288

cs3sedarknet_l - 81.2 @ 256, 81.8 @ 288

cs3darknet_x - 81.8 @ 256, 82.2 @ 288

cs3sedarknet_x - 82.2 @ 256, 82.7 @ 288

cs3edgenet_x - 82.2 @ 256, 82.7 @ 288

cs3se_edgenet_x - 82.8 @ 256, 83.5 @ 320

cs3* weights above all trained on TPU w/ bits_and_tpu branch. Thanks to TRC program!

Add output_stride=8 and 16 support to ConvNeXt (dilation)

deit3 models not being able to resize pos_emb fixed

Source code(tar.gz)
Source code(zip)
v0.6.5(Jul 10, 2022)
First official release in a long while (since 0.5.4). All change log since 0.5.4 below,

July 8, 2022

More models, more fixes

Official research models (w/ weights) added:

EdgeNeXt from (https://github.com/mmaaz60/EdgeNeXt)

MobileViT-V2 from (https://github.com/apple/ml-cvnets)

DeiT III (Revenge of the ViT) from (https://github.com/facebookresearch/deit)

My own models:

Small ResNet defs added by request with 1 block repeats for both basic and bottleneck (resnet10 and resnet14)

CspNet refactored with dataclass config, simplified CrossStage3 (cs3) option. These are closer to YOLO-v5+ backbone defs.

More relative position vit fiddling. Two srelpos (shared relative position) models trained, and a medium w/ class token.

Add an alternate downsample mode to EdgeNeXt and train a small model. Better than original small, but not their new USI trained weights.

My own model weight results (all ImageNet-1k training)

resnet10t - 66.5 @ 176, 68.3 @ 224

resnet14t - 71.3 @ 176, 72.3 @ 224

resnetaa50 - 80.6 @ 224 , 81.6 @ 288

darknet53 - 80.0 @ 256, 80.5 @ 288

cs3darknet_m - 77.0 @ 256, 77.6 @ 288

cs3darknet_focus_m - 76.7 @ 256, 77.3 @ 288

cs3darknet_l - 80.4 @ 256, 80.9 @ 288

cs3darknet_focus_l - 80.3 @ 256, 80.9 @ 288

vit_srelpos_small_patch16_224 - 81.1 @ 224, 82.1 @ 320

vit_srelpos_medium_patch16_224 - 82.3 @ 224, 83.1 @ 320

vit_relpos_small_patch16_cls_224 - 82.6 @ 224, 83.6 @ 320

edgnext_small_rw - 79.6 @ 224, 80.4 @ 320

cs3, darknet, and vit_*relpos weights above all trained on TPU thanks to TRC program! Rest trained on overheating GPUs.

Hugging Face Hub support fixes verified, demo notebook TBA

Pretrained weights / configs can be loaded externally (ie from local disk) w/ support for head adaptation.

Add support to change image extensions scanned by timm datasets/parsers. See (https://github.com/rwightman/pytorch-image-models/pull/1274#issuecomment-1178303103)

Default ConvNeXt LayerNorm impl to use F.layer_norm(x.permute(0, 2, 3, 1), ...).permute(0, 3, 1, 2) via LayerNorm2d in all cases.

a bit slower than previous custom impl on some hardware (ie Ampere w/ CL), but overall fewer regressions across wider HW / PyTorch version ranges.

previous impl exists as LayerNormExp2d in models/layers/norm.py

Numerous bug fixes

Currently testing for imminent PyPi 0.6.x release

LeViT pretraining of larger models still a WIP, they don't train well / easily without distillation. Time to add distill support (finally)?

ImageNet-22k weight training + finetune ongoing, work on multi-weight support (slowly) chugging along (there are a LOT of weights, sigh) ...

May 13, 2022

Official Swin-V2 models and weights added from (https://github.com/microsoft/Swin-Transformer). Cleaned up to support torchscript.

Some refactoring for existing timm Swin-V2-CR impl, will likely do a bit more to bring parts closer to official and decide whether to merge some aspects.

More Vision Transformer relative position / residual post-norm experiments (all trained on TPU thanks to TRC program)

vit_relpos_small_patch16_224 - 81.5 @ 224, 82.5 @ 320 -- rel pos, layer scale, no class token, avg pool

vit_relpos_medium_patch16_rpn_224 - 82.3 @ 224, 83.1 @ 320 -- rel pos + res-post-norm, no class token, avg pool

vit_relpos_medium_patch16_224 - 82.5 @ 224, 83.3 @ 320 -- rel pos, layer scale, no class token, avg pool

vit_relpos_base_patch16_gapcls_224 - 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)

Bring 512 dim, 8-head 'medium' ViT model variant back to life (after using in a pre DeiT 'small' model for first ViT impl back in 2020)

Add ViT relative position support for switching btw existing impl and some additions in official Swin-V2 impl for future trials

Sequencer2D impl (https://arxiv.org/abs/2205.01972), added via PR from author (https://github.com/okojoalg)

May 2, 2022

Vision Transformer experiments adding Relative Position (Swin-V2 log-coord) (vision_transformer_relpos.py) and Residual Post-Norm branches (from Swin-V2) (vision_transformer*.py)

vit_relpos_base_patch32_plus_rpn_256 - 79.5 @ 256, 80.6 @ 320 -- rel pos + extended width + res-post-norm, no class token, avg pool

vit_relpos_base_patch16_224 - 82.5 @ 224, 83.6 @ 320 -- rel pos, layer scale, no class token, avg pool

vit_base_patch16_rpn_224 - 82.3 @ 224 -- rel pos + res-post-norm, no class token, avg pool

Vision Transformer refactor to remove representation layer that was only used in initial vit and rarely used since with newer pretrain (ie How to Train Your ViT)

vit_* models support removal of class token, use of global average pool, use of fc_norm (ala beit, mae).

April 22, 2022

timm models are now officially supported in fast.ai! Just in time for the new Practical Deep Learning course. timmdocs documentation link updated to timm.fast.ai.

Two more model weights added in the TPU trained series. Some In22k pretrain still in progress.

seresnext101d_32x8d - 83.69 @ 224, 84.35 @ 288

seresnextaa101d_32x8d (anti-aliased w/ AvgPool2d) - 83.85 @ 224, 84.57 @ 288

March 23, 2022

Add ParallelBlock and LayerScale option to base vit models to support model configs in Three things everyone should know about ViT

convnext_tiny_hnf (head norm first) weights trained with (close to) A2 recipe, 82.2% top-1, could do better with more epochs.

March 21, 2022

Merge norm_norm_norm. IMPORTANT this update for a coming 0.6.x release will likely de-stabilize the master branch for a while. Branch 0.5.x or a previous 0.5.x release can be used if stability is required.

Significant weights update (all TPU trained) as described in this release

regnety_040 - 82.3 @ 224, 82.96 @ 288

regnety_064 - 83.0 @ 224, 83.65 @ 288

regnety_080 - 83.17 @ 224, 83.86 @ 288

regnetv_040 - 82.44 @ 224, 83.18 @ 288 (timm pre-act)

regnetv_064 - 83.1 @ 224, 83.71 @ 288 (timm pre-act)

regnetz_040 - 83.67 @ 256, 84.25 @ 320

regnetz_040h - 83.77 @ 256, 84.5 @ 320 (w/ extra fc in head)

resnetv2_50d_gn - 80.8 @ 224, 81.96 @ 288 (pre-act GroupNorm)

resnetv2_50d_evos 80.77 @ 224, 82.04 @ 288 (pre-act EvoNormS)

regnetz_c16_evos - 81.9 @ 256, 82.64 @ 320 (EvoNormS)

regnetz_d8_evos - 83.42 @ 256, 84.04 @ 320 (EvoNormS)

xception41p - 82 @ 299 (timm pre-act)

xception65 - 83.17 @ 299

xception65p - 83.14 @ 299 (timm pre-act)

resnext101_64x4d - 82.46 @ 224, 83.16 @ 288

seresnext101_32x8d - 83.57 @ 224, 84.270 @ 288

resnetrs200 - 83.85 @ 256, 84.44 @ 320

HuggingFace hub support fixed w/ initial groundwork for allowing alternative 'config sources' for pretrained model definitions and weights (generic local file / remote url support soon)

SwinTransformer-V2 implementation added. Submitted by Christoph Reich. Training experiments and model changes by myself are ongoing so expect compat breaks.

Swin-S3 (AutoFormerV2) models / weights added from https://github.com/microsoft/Cream/tree/main/AutoFormerV2

MobileViT models w/ weights adapted from https://github.com/apple/ml-cvnets

PoolFormer models w/ weights adapted from https://github.com/sail-sg/poolformer

VOLO models w/ weights adapted from https://github.com/sail-sg/volo

Significant work experimenting with non-BatchNorm norm layers such as EvoNorm, FilterResponseNorm, GroupNorm, etc

Enhance support for alternate norm + act ('NormAct') layers added to a number of models, esp EfficientNet/MobileNetV3, RegNet, and aligned Xception

Grouped conv support added to EfficientNet family

Add 'group matching' API to all models to allow grouping model parameters for application of 'layer-wise' LR decay, lr scale added to LR scheduler

Gradient checkpointing support added to many models

forward_head(x, pre_logits=False) fn added to all models to allow separate calls of forward_features + forward_head

All vision transformer and vision MLP models update to return non-pooled / non-token selected features from foward_features, for consistency with CNN models, token selection or pooling now applied in forward_head

Feb 2, 2022

Chris Hughes posted an exhaustive run through of timm on his blog yesterday. Well worth a read. Getting Started with PyTorch Image Models (timm): A Practitioner’s Guide

I'm currently prepping to merge the norm_norm_norm branch back to master (ver 0.6.x) in next week or so.

The changes are more extensive than usual and may destabilize and break some model API use (aiming for full backwards compat). So, beware pip install git+https://github.com/rwightman/pytorch-image-models installs!

0.5.x releases and a 0.5.x branch will remain stable with a cherry pick or two until dust clears. Recommend sticking to pypi install for a bit if you want stable.

Source code(tar.gz)
Source code(zip)
v0.1-weights-swinv2(Apr 3, 2022)
This release holds weights for timm's variant of Swin V2 (from @ChristophReich1996 impl, https://github.com/ChristophReich1996/Swin-Transformer-V2)

NOTE: ns variants of the models have extra norms on the main branch at the end of each stage, this seems to help training. The current small model is not using this, but currently training one. Will have a non-ns tiny soon as well as a comparsion. in21k and 1k base models are also in the works...

small checkpoints trained on TPU-VM instances via the TPU-Research Cloud (https://sites.research.google/trc/about/)

swin_v2_tiny_ns_224 - 81.80 top-1

swin_v2_small_224 - 83.13 top-1

swin_v2_small_ns_224 - 83.5 top-1

Source code(tar.gz)
Source code(zip)
swin_v2_cr_small_224-0813c165.pth(189.63 MB)
swin_v2_cr_small_ns_224_iv-2ce90f8e.pth(189.64 MB)
swin_v2_cr_tiny_ns_224-ba8166c6.pth(108.11 MB)
v0.1-tpu-weights(Mar 18, 2022)
A wide range of mid-large sized models trained in PyTorch XLA on TPU VM instances. Demonstrating viability of the TPU + PyTorch combo for excellent image model results. All models trained w/ the bits_and_tpu branch of this codebase.

A big thanks to the TPU Research Cloud (https://sites.research.google/trc/about/) for the compute used in these experiments.

This set includes several novel weights, including EvoNorm-S RegNetZ (C/D timm variants) and ResNet-V2 model experiments, as well as custom pre-activation model variants of RegNet-Y (called RegNet-V) and Xception (Xception-P) models.

Many if not all of the included RegNet weights surpass original paper results by a wide margin and remain above other known results (e.g. recent torchvision updates) in ImageNet-1k validation and especially OOD test set / robustness performance and scaling to higher resolutions.

RegNets

regnety_040 - 82.3 @ 224, 82.96 @ 288

regnety_064 - 83.0 @ 224, 83.65 @ 288

regnety_080 - 83.17 @ 224, 83.86 @ 288

regnetv_040 - 82.44 @ 224, 83.18 @ 288 (timm pre-act)

regnetv_064 - 83.1 @ 224, 83.71 @ 288 (timm pre-act)

regnetz_040 - 83.67 @ 256, 84.25 @ 320

regnetz_040h - 83.77 @ 256, 84.5 @ 320 (w/ extra fc in head)

Alternative norm layers (no BN!)

resnetv2_50d_gn - 80.8 @ 224, 81.96 @ 288 (pre-act GroupNorm)

resnetv2_50d_evos 80.77 @ 224, 82.04 @ 288 (pre-act EvoNormS)

regnetz_c16_evos - 81.9 @ 256, 82.64 @ 320 (EvoNormS)

regnetz_d8_evos - 83.42 @ 256, 84.04 @ 320 (EvoNormS)

Xception redux

xception41p - 82 @ 299 (timm pre-act)

xception65 - 83.17 @ 299

xception65p - 83.14 @ 299 (timm pre-act)

ResNets (w/ SE and/or NeXT)

resnext101_64x4d - 82.46 @ 224, 83.16 @ 288

seresnext101_32x8d - 83.57 @ 224, 84.27 @ 288

seresnext101d_32x8d - 83.69 @ 224, 84.35 @ 288

seresnextaa101d_32x8d - 83.85 @ 224, 84.57 @ 288

resnetrs200 - 83.85 @ 256, 84.44 @ 320

Vision transformer experiments -- relpos, residual-post-norm, layer-scale, fc-norm, and GAP

vit_relpos_base_patch32_plus_rpn_256 - 79.5 @ 256, 80.6 @ 320 -- rel pos + extended width + res-post-norm, no class token, avg pool

vit_relpos_small_patch16_224 - 81.5 @ 224, 82.5 @ 320 -- rel pos, layer scale, no class token, avg pool

vit_relpos_medium_patch16_rpn_224 - 82.3 @ 224, 83.1 @ 320 -- rel pos + res-post-norm, no class token, avg pool

vit_base_patch16_rpn_224 - 82.3 @ 224 -- rel pos + res-post-norm, no class token, avg pool

vit_relpos_medium_patch16_224 - 82.5 @ 224, 83.3 @ 320 -- rel pos, layer scale, no class token, avg pool

vit_relpos_base_patch16_224 - 82.5 @ 224, 83.6 @ 320 -- rel pos, layer scale, no class token, avg pool

vit_relpos_base_patch16_gapcls_224 - 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)

Source code(tar.gz)
Source code(zip)
cs3darknet_focus_l_c2ns-65ef8888.pth(80.84 MB)
cs3darknet_focus_m_c2ns-e23bed41.pth(35.59 MB)
cs3darknet_l_c2ns-16220c5d.pth(80.89 MB)
cs3darknet_m_c2ns-43f06604.pth(35.61 MB)
cs3darknet_x_c2ns-4e4490aa.pth(133.89 MB)
cs3edgenet_x_c2-2e1610a9.pth(182.65 MB)
cs3sedarknet_l_c2ns-e8d1dc13.pth(83.76 MB)
cs3sedarknet_x_c2ns-b4d0abc0.pth(135.25 MB)
cs3se_edgenet_x_c2ns-76f8e3ac.pth(193.73 MB)
darknet53_256_c2ns-3aeff817.pth(158.91 MB)
darknetaa53_c2ns-5c28ec8a.pth(137.60 MB)
regnetv_040_ra3-c248f51f.pth(79.02 MB)
regnetv_064_ra3-530616c2.pth(117.00 MB)
regnety_040_ra3-670e1166.pth(79.07 MB)
regnety_064_ra3-aa26dc7d.pth(117.06 MB)
regnety_080_ra3-1fdc4344.pth(149.84 MB)
regnetz_040h_ra3-f594343b.pth(110.99 MB)
regnetz_040_ra3-9007edf5.pth(104.03 MB)
regnetz_c16_evos_ch-d8311942.pth(51.50 MB)
regnetz_d8_evos_ch-2bc12646.pth(89.56 MB)
resnetrs200_c-6b698b88.pth(356.45 MB)
resnetv2_50d_evos_ah-7c4dd548.pth(97.65 MB)
resnetv2_50d_gn_ah-c415c11a.pth(97.56 MB)
resnext101_64x4d_c-0d0e0cc0.pth(319.22 MB)
seresnext101d_32x8d_ah-191d7b94.pth(357.89 MB)
seresnext101_32x8d_ah-e6bc4c0a.pth(357.82 MB)
seresnextaa101d_32x8d_ah-83c8ae12.pth(357.89 MB)
vit_base_patch16_rpn_224-sw-3b07e89d.pth(330.13 MB)
vit_relpos_base_patch16_224-sw-49049aed.pth(329.73 MB)
vit_relpos_base_patch16_gapcls_224-sw-1a341d6c.pth(329.73 MB)
vit_relpos_medium_patch16_224-sw-11c174af.pth(147.83 MB)
vit_relpos_medium_patch16_cls_224-sw-cfe8e259.pth(147.90 MB)
vit_relpos_medium_patch16_rpn_224-sw-5d2befd8.pth(147.78 MB)
vit_relpos_small_patch16_224-sw-ec2778b4.pth(83.89 MB)
vit_replos_base_patch32_plus_rpn_256-sw-dd486f51.pth(455.59 MB)
vit_srelpos_medium_patch16_224-sw-ad702b8c.pth(147.78 MB)
vit_srelpos_small_patch16_224-sw-6cdb8849.pth(83.84 MB)
xception41p_ra3-33195bc8.pth(102.89 MB)
xception65p_ra3-3c6114e4.pth(152.30 MB)
xception65_ra3-1447db8d.pth(153.09 MB)
v0.1-mvit-weights(Jan 31, 2022)

Pretrained weights for MobileViT and MobileViT-V2 adapted from Apple impl at https://github.com/apple/ml-cvnets

Checkpoints remapped to timm impl of the model with BGR corrected to RGB (for V1).
Source code(tar.gz)
Source code(zip)
mobilevitv2_050-49951ee2.pth(5.29 MB)
mobilevitv2_075-b5556ef6.pth(11.01 MB)
mobilevitv2_100-e464ef3b.pth(18.79 MB)
mobilevitv2_125-0ae35027.pth(28.64 MB)
mobilevitv2_150-737c5019.pth(40.54 MB)
mobilevitv2_150_384_in22ft1k-9e142854.pth(40.54 MB)
mobilevitv2_150_in22ft1k-0b555d7b.pth(40.54 MB)
mobilevitv2_175-16462ee2.pth(54.51 MB)
mobilevitv2_175_384_in22ft1k-059cbe56.pth(54.51 MB)
mobilevitv2_175_in22ft1k-4117fa1f.pth(54.51 MB)
mobilevitv2_200-b3422f67.pth(70.53 MB)
mobilevitv2_200_384_in22ft1k-32c87503.pth(70.53 MB)
mobilevitv2_200_in22ft1k-1d7c8927.pth(70.53 MB)
mobilevit_s-38a5a959.pth(21.37 MB)
mobilevit_xs-8fbd6366.pth(8.92 MB)
mobilevit_xxs-ad385b40.pth(4.91 MB)
v0.5.4(Jan 17, 2022)

Source code(tar.gz)
Source code(zip)
v0.1-rsb-weights(Oct 4, 2021)

Weights for ResNet Strikes Back

Paper: https://arxiv.org/abs/2110.00476

More details on weights and hparams to come...
Source code(tar.gz)
Source code(zip)
convnext_atto_d2-01bb0f51.pth(14.11 MB)
convnext_atto_ols_a2-78d1c8f3.pth(14.14 MB)
convnext_femto_d1-d71d5b4c.pth(19.92 MB)
convnext_femto_ols_d1-246bf2ed.pth(19.95 MB)
convnext_nano_d1h-7eb4bdea.pth(59.50 MB)
convnext_nano_ols_d1h-ae424a9a.pth(59.72 MB)
convnext_pico_d1-10ad7f0d.pth(34.52 MB)
convnext_pico_ols_d1-611f0ca7.pth(34.58 MB)
convnext_tiny_hnf_a2h-ab7e9df2.pth(109.08 MB)
deit_base_patch16_a1_0-141881b8.pth(330.25 MB)
deit_base_patch16_a2_0-95d90282.pth(330.25 MB)
deit_small_patch16_a1_0-bfd3c1ab.pth(84.13 MB)
deit_small_patch16_a2_0-83b53863.pth(84.13 MB)
deit_tiny_patch16_a1_0-90f89490.pth(21.83 MB)
deit_tiny_patch16_a2_0-324fe5ea.pth(21.83 MB)
ecaresnet269d_a1_0-b848cf33.pth(390.64 MB)
ecaresnet269d_a2_0-715c86c9.pth(390.64 MB)
ecaresnet269d_a3_0-ec68cab2.pth(390.64 MB)
ecaresnet50t_a1_0-99bd76a8.pth(97.80 MB)
ecaresnet50t_a2_0-b1c7b745.pth(97.80 MB)
ecaresnet50t_a3_0-8cc311f1.pth(97.80 MB)
efficientnetv2_rw_m_a1_0-b788290c.pth(204.41 MB)
efficientnetv2_rw_m_a2_0-12297cd3.pth(204.41 MB)
efficientnetv2_rw_m_a3_0-68b15d26.pth(204.41 MB)
efficientnetv2_rw_s_a1_0-59d76611.pth(92.04 MB)
efficientnetv2_rw_s_a2_0-cafb8f99.pth(92.04 MB)
efficientnetv2_rw_s_a3_0-11105c48.pth(92.04 MB)
gluon_senet154_a1_0-ef9d383e.pth(440.12 MB)
gluon_senet154_a2_0-63cb3b08.pth(440.12 MB)
gluon_senet154_a3_0-d8df0fde.pth(440.12 MB)
regnety_040_a1_0-453380cb.pth(79.06 MB)
regnety_040_a2_0-acda0189.pth(79.06 MB)
regnety_040_a3_0-9705a0d6.pth(79.06 MB)
regnety_080_a1_0-7d647454.pth(149.83 MB)
regnety_080_a2_0-2298ae4e.pth(149.83 MB)
regnety_080_a3_0-2fb073a0.pth(149.83 MB)
regnety_160_a1_0-ed74711e.pth(319.39 MB)
regnety_160_a2_0-6631355e.pth(319.39 MB)
regnety_160_a3_0-9ee45d21.pth(319.39 MB)
regnety_320_a1_0-6c920aed.pth(553.97 MB)
regnety_320_a2_0-a9fedcbf.pth(553.97 MB)
regnety_320_a3_0-242d2987.pth(553.97 MB)
resnet101_a1h-36d3f2aa.pth(170.43 MB)
resnet101_a1_0-cdcb52a9.pth(170.42 MB)
resnet101_a2_0-6edb36c7.pth(170.42 MB)
resnet101_a3_0-1db14157.pth(170.42 MB)
resnet10t_176_c3-f3215ab1.pth(20.76 MB)
resnet14t_176_c3-c4ed2c37.pth(38.54 MB)
resnet152_a1h-dc400468.pth(230.32 MB)
resnet152_a1_0-2eee8a7a.pth(230.31 MB)
resnet152_a2_0-b4c6978f.pth(230.31 MB)
resnet152_a3_0-134d4688.pth(230.31 MB)
resnet18_a1_0-d63eafa0.pth(44.64 MB)
resnet18_a2_0-b61bd467.pth(44.64 MB)
resnet18_a3_0-40c531c8.pth(44.64 MB)
resnet34_a1_0-46f8f793.pth(83.24 MB)
resnet34_a2_0-82d47d71.pth(83.24 MB)
resnet34_a3_0-a20cabb6.pth(83.24 MB)
resnet50d_a1_0-e20cff14.pth(97.81 MB)
resnet50d_a2_0-a3adc64d.pth(97.81 MB)
resnet50d_a3_0-403fdfad.pth(97.81 MB)
resnet50_a1h-35c100f8.pth(97.74 MB)
resnet50_a1h2_176-001a1197.pth(97.74 MB)
resnet50_a1_0-14fe96d1.pth(97.73 MB)
resnet50_a2_0-a2746f79.pth(97.73 MB)
resnet50_a3_0-59cae1ef.pth(97.73 MB)
resnet50_b1k-532a802a.pth(97.74 MB)
resnet50_b2k-1ba180c1.pth(97.74 MB)
resnet50_c1-5ba5e060.pth(97.74 MB)
resnet50_c2-d01e05b2.pth(97.74 MB)
resnet50_d-f39db8af.pth(97.74 MB)
resnet50_ft_cars_a1.pth(91.49 MB)
resnet50_ft_cars_a2.pth(91.49 MB)
resnet50_ft_cars_a3.pth(91.49 MB)
resnet50_ft_cars_pt.pth(91.49 MB)
resnet50_ft_cifar100_a1.pth(90.74 MB)
resnet50_ft_cifar100_a2.pth(90.74 MB)
resnet50_ft_cifar100_a3.pth(90.74 MB)
resnet50_ft_cifar100_pt.pth(90.74 MB)
resnet50_ft_cifar10_a1.pth(90.04 MB)
resnet50_ft_cifar10_a2.pth(90.04 MB)
resnet50_ft_cifar10_a3.pth(90.04 MB)
resnet50_ft_cifar10_pt.pth(90.04 MB)
resnet50_ft_flowers_a1.pth(90.76 MB)
resnet50_ft_flowers_a2.pth(90.76 MB)
resnet50_ft_flowers_a3.pth(90.76 MB)
resnet50_ft_flowers_pt.pth(90.76 MB)
resnet50_ft_inat19_a1.pth(97.85 MB)
resnet50_ft_inat19_a2.pth(97.85 MB)
resnet50_ft_inat19_a3.pth(97.85 MB)
resnet50_ft_inat19_pt.pth(97.85 MB)
resnet50_gn_a1h2-8fe6c4d0.pth(97.51 MB)
resnetaa50_a1h-4cf422b3.pth(97.74 MB)
resnetv2_101_a1h-5d01f016.pth(170.37 MB)
resnetv2_50_a1h-000cdf49.pth(97.68 MB)
resnext50_32x4d_a1h-0146ab0a.pth(95.78 MB)
resnext50_32x4d_a1_0-b5a91a1d.pth(95.77 MB)
resnext50_32x4d_a2_0-efc76add.pth(95.77 MB)
resnext50_32x4d_a3_0-3e450271.pth(95.77 MB)
seresnet50_a1_0-ffa00869.pth(107.39 MB)
seresnet50_a2_0-850de0d9.pth(107.39 MB)
seresnet50_a3_0-317ecd56.pth(107.39 MB)
tf_efficientnet_b0_a1_0-9188dd46.pth(20.38 MB)
tf_efficientnet_b0_a2_0-48bede62.pth(20.38 MB)
tf_efficientnet_b0_a3_0-94e799dc.pth(20.38 MB)
tf_efficientnet_b1_a1_0-b55e845c.pth(30.03 MB)
tf_efficientnet_b1_a2_0-d342a7bf.pth(30.03 MB)
tf_efficientnet_b1_a3_0-ee9f9669.pth(30.03 MB)
tf_efficientnet_b2_a1_0-f1382665.pth(35.07 MB)
tf_efficientnet_b2_a2_0-ae4f4996.pth(35.07 MB)
tf_efficientnet_b2_a3_0-61f0f688.pth(35.07 MB)
tf_efficientnet_b3_a1_0-efc81b92.pth(47.07 MB)
tf_efficientnet_b3_a2_0-e183dbec.pth(47.07 MB)
tf_efficientnet_b3_a3_0-0a50fa9a.pth(47.07 MB)
tf_efficientnet_b4_a1_0-182bef54.pth(74.35 MB)
tf_efficientnet_b4_a2_0-bc5f172e.pth(74.35 MB)
tf_efficientnet_b4_a3_0-a6a8179a.pth(74.35 MB)
v0.1-attn-weights(Sep 4, 2021)
A collection of weights I've trained comparing various types of SE-like (SE, ECA, GC, etc), self-attention (bottleneck, halo, lambda) blocks, and related non-attn baselines.

ResNet-26-T series

[2, 2, 2, 2] repeat Bottlneck block ResNet architecture

ReLU activations

3 layer stem with 24, 32, 64 chs, max-pool

avg pool in shortcut downsample

self-attn blocks replace 3x3 in both blocks for last stage, and second block of penultimate stage

|model |top1 |top1_err|top5 |top5_err|param_count|img_size|cropt_pct|interpolation| |--------------|------|--------|------|--------|-----------|--------|---------|-------------| |botnet26t_256 |79.246|20.754 |94.53 |5.47 |12.49 |256 |0.95 |bicubic | |halonet26t |79.13 |20.87 |94.314|5.686 |12.48 |256 |0.95 |bicubic | |lambda_resnet26t|79.112|20.888 |94.59 |5.41 |10.96 |256 |0.94 |bicubic | |lambda_resnet26rpt_256|78.964|21.036 |94.428|5.572 |10.99 |256 |0.94 |bicubic | |resnet26t |77.872|22.128 |93.834|6.166 |16.01 |256 |0.94 |bicubic |

Details:

HaloNet - 8 pixel block size, 2 pixel halo (overlap), relative position embedding

BotNet - relative position embedding

Lambda-ResNet-26-T - 3d lambda conv, kernel = 9

Lambda-ResNet-26-RPT - relative position embedding

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

|model |infer_samples_per_sec|infer_step_time|infer_batch_size|infer_img_size|train_samples_per_sec|train_step_time|train_batch_size|train_img_size|param_count| |--------------|---------------------|---------------|----------------|--------------|---------------------|---------------|----------------|--------------|-----------| |resnet26t |2967.55 |86.252 |256 |256 |857.62 |297.984 |256 |256 |16.01 | |botnet26t_256 |2642.08 |96.879 |256 |256 |809.41 |315.706 |256 |256 |12.49 | |halonet26t |2601.91 |98.375 |256 |256 |783.92 |325.976 |256 |256 |12.48 | |lambda_resnet26t|2354.1 |108.732 |256 |256 |697.28 |366.521 |256 |256 |10.96 | |lambda_resnet26rpt_256|1847.34 |138.563 |256 |256 |644.84 |197.892 |128 |256 |10.99 |

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

|model |infer_samples_per_sec|infer_step_time|infer_batch_size|infer_img_size|train_samples_per_sec|train_step_time|train_batch_size|train_img_size|param_count| |----------------------|---------------------|---------------|----------------|--------------|---------------------|---------------|----------------|--------------|-----------| |resnet26t |3691.94 |69.327 |256 |256 |1188.17 |214.96 |256 |256 |16.01 | |botnet26t_256 |3291.63 |77.76 |256 |256 |1126.68 |226.653 |256 |256 |12.49 | |halonet26t |3230.5 |79.232 |256 |256 |1077.82 |236.934 |256 |256 |12.48 | |lambda_resnet26rpt_256|2324.15 |110.133 |256 |256 |864.42 |147.485 |128 |256 |10.99 | |lambda_resnet26t|Not Supported | | | | | |

ResNeXT-26-T series

[2, 2, 2, 2] repeat Bottlneck block ResNeXt architectures

SiLU activations

grouped 3x3 convolutions in bottleneck, 32 channels per group

3 layer stem with 24, 32, 64 chs, max-pool

avg pool in shortcut downsample

channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv

when active, self-attn blocks replace 3x3 conv in both blocks for last stage, and second block of penultimate stage

|model |top1 |top1_err|top5 |top5_err|param_count|img_size|cropt_pct|interpolation| |--------------|------|--------|------|--------|-----------|--------|---------|-------------| |eca_halonext26ts|79.484 |20.516 |94.600 |5.400 |10.76 |256 |0.94 |bicubic | |eca_botnext26ts_256|79.270 |20.730 |94.594 |5.406 |10.59 |256 |0.95 |bicubic | |bat_resnext26ts|78.268|21.732 |94.1 |5.9 |10.73 |256 |0.9 |bicubic | |seresnext26ts |77.852|22.148 |93.784|6.216 |10.39 |256 |0.9 |bicubic | |gcresnext26ts |77.804|22.196 |93.824|6.176 |10.48 |256 |0.9 |bicubic | |eca_resnext26ts|77.446|22.554 |93.57 |6.43 |10.3 |256 |0.9 |bicubic | |resnext26ts |76.764|23.236 |93.136|6.864 |10.3 |256 |0.9 |bicubic |

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

|model |infer_samples_per_sec|infer_step_time|infer_batch_size|infer_img_size|train_samples_per_sec|train_step_time|train_batch_size|train_img_size|param_count| |----------------------|---------------------|---------------|----------------|--------------|---------------------|---------------|----------------|--------------|-----------| |resnext26ts |3006.57 |85.134 |256 |256 |864.4 |295.646 |256 |256 |10.3 | |seresnext26ts |2931.27 |87.321 |256 |256 |836.92 |305.193 |256 |256 |10.39 | |eca_resnext26ts |2925.47 |87.495 |256 |256 |837.78 |305.003 |256 |256 |10.3 | |gcresnext26ts |2870.01 |89.186 |256 |256 |818.35 |311.97 |256 |256 |10.48 | |eca_botnext26ts_256 |2652.03 |96.513 |256 |256 |790.43 |323.257 |256 |256 |10.59 | |eca_halonext26ts |2593.03 |98.705 |256 |256 |766.07 |333.541 |256 |256 |10.76 | |bat_resnext26ts |2469.78 |103.64 |256 |256 |697.21 |365.964 |256 |256 |10.73 |

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

NOTE: there are performance issues with certain grouped conv configs with channels last layout, backwards pass in particular is really slow. Also causing issues for RegNet and NFNet networks. |model |infer_samples_per_sec|infer_step_time|infer_batch_size|infer_img_size|train_samples_per_sec|train_step_time|train_batch_size|train_img_size|param_count| |----------------------|---------------------|---------------|----------------|--------------|---------------------|---------------|----------------|--------------|-----------| |resnext26ts |3952.37 |64.755 |256 |256 |608.67 |420.049 |256 |256 |10.3 | |eca_resnext26ts |3815.77 |67.074 |256 |256 |594.35 |430.146 |256 |256 |10.3 | |seresnext26ts |3802.75 |67.304 |256 |256 |592.82 |431.14 |256 |256 |10.39 | |gcresnext26ts |3626.97 |70.57 |256 |256 |581.83 |439.119 |256 |256 |10.48 | |eca_botnext26ts_256 |3515.84 |72.8 |256 |256 |611.71 |417.862 |256 |256 |10.59 | |eca_halonext26ts |3410.12 |75.057 |256 |256 |597.52 |427.789 |256 |256 |10.76 | |bat_resnext26ts |3053.83 |83.811 |256 |256 |533.23 |478.839 |256 |256 |10.73 |

ResNet-33-T series.

[2, 3, 3, 2] repeat Bottlneck block ResNet architecture

SiLU activations

3 layer stem with 24, 32, 64 chs, no max-pool, 1st and 3rd conv stride 2

avg pool in shortcut downsample

channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv

when active, self-attn blocks replace 3x3 conv last block of stage 2 and 3, and both blocks of final stage

FC 1x1 conv between last block and classifier

The 33-layer models have an extra 1x1 FC layer between last conv block and classifier. There is both a non-attenion 33 layer baseline and a 32 layer without the extra FC.

|model |top1 |top1_err|top5 |top5_err|param_count|img_size|cropt_pct|interpolation| |--------------|------|--------|------|--------|-----------|--------|---------|-------------| |sehalonet33ts |80.986|19.014 |95.272|4.728 |13.69 |256 |0.94 |bicubic | |seresnet33ts |80.388|19.612 |95.108|4.892 |19.78 |256 |0.94 |bicubic | |eca_resnet33ts|80.132|19.868 |95.054|4.946 |19.68 |256 |0.94 |bicubic | |gcresnet33ts |79.99 |20.01 |94.988|5.012 |19.88 |256 |0.94 |bicubic | |resnet33ts |79.352|20.648 |94.596|5.404 |19.68 |256 |0.94 |bicubic | |resnet32ts |79.028|20.972 |94.444|5.556 |17.96 |256 |0.94 |bicubic |

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

|model |infer_samples_per_sec|infer_step_time|infer_batch_size|infer_img_size|train_samples_per_sec|train_step_time|train_batch_size|train_img_size|param_count| |----------------------|---------------------|---------------|----------------|--------------|---------------------|---------------|----------------|--------------|-----------| |resnet32ts |2502.96 |102.266 |256 |256 |733.27 |348.507 |256 |256 |17.96 | |resnet33ts |2473.92 |103.466 |256 |256 |725.34 |352.309 |256 |256 |19.68 | |seresnet33ts |2400.18 |106.646 |256 |256 |695.19 |367.413 |256 |256 |19.78 | |eca_resnet33ts |2394.77 |106.886 |256 |256 |696.93 |366.637 |256 |256 |19.68 | |gcresnet33ts |2342.81 |109.257 |256 |256 |678.22 |376.404 |256 |256 |19.88 | |sehalonet33ts |1857.65 |137.794 |256 |256 |577.34 |442.545 |256 |256 |13.69 |

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

|model |infer_samples_per_sec|infer_step_time|infer_batch_size|infer_img_size|train_samples_per_sec|train_step_time|train_batch_size|train_img_size|param_count| |----------------------|---------------------|---------------|----------------|--------------|---------------------|---------------|----------------|--------------|-----------| |resnet32ts |3306.22 |77.416 |256 |256 |1012.82 |252.158 |256 |256 |17.96 | |resnet33ts |3257.59 |78.573 |256 |256 |1002.38 |254.778 |256 |256 |19.68 | |seresnet33ts |3128.08 |81.826 |256 |256 |950.27 |268.581 |256 |256 |19.78 | |eca_resnet33ts |3127.11 |81.852 |256 |256 |948.84 |269.123 |256 |256 |19.68 | |gcresnet33ts |2984.87 |85.753 |256 |256 |916.98 |278.169 |256 |256 |19.88 | |sehalonet33ts |2188.23 |116.975 |256 |256 |711.63 |179.03 |128 |256 |13.69 |

ResNet-50(ish) models

In Progress

RegNet"Z" series

RegNetZ inspired architecture, inverted bottleneck, SE attention, pre-classifier FC, essentially an EfficientNet w/ grouped conv instead of depthwise

b, c, and d are three different sizes I put together to cover differing flop ranges, not based on the paper (https://arxiv.org/abs/2103.06877) or a search process

for comparison to RegNetY and paper RegNetZ models, at 224x224 b,c, and d models are 1.45, 1.92, and 4.58 GMACs respectively, b, and c are trained at 256 here so higher than that (see tables)

haloregnetz_c uses halo attention for all of last stage, and interleaved every 3 (for 4) of penultimate stage

b, c variants use a stem / 1st stage like the paper, d uses a 3-deep tiered stem with 2-1-2 striding

ImageNet-1k validation at train resolution

|model |top1 |top1_err|top5 |top5_err|param_count|img_size|cropt_pct|interpolation| |-------------|------|--------|------|--------|-----------|--------|---------|-------------| |regnetz_d |83.422|16.578 |96.636|3.364 |27.58 |256 |0.95 |bicubic | |regnetz_c |82.164|17.836 |96.058|3.942 |13.46 |256 |0.94 |bicubic | |haloregnetz_b|81.058|18.942 |95.2 |4.8 |11.68 |224 |0.94 |bicubic | |regnetz_b |79.868|20.132 |94.988|5.012 |9.72 |224 |0.94 |bicubic |

ImageNet-1k validation at optimal test res

|model |top1 |top1_err|top5 |top5_err|param_count|img_size|cropt_pct|interpolation| |-------------|------|--------|------|--------|-----------|--------|---------|-------------| |regnetz_d |84.04 |15.96 |96.87 |3.13 |27.58 |320 |0.95 |bicubic | |regnetz_c |82.516|17.484 |96.356|3.644 |13.46 |320 |0.94 |bicubic | |haloregnetz_b|81.058|18.942 |95.2 |4.8 |11.68 |224 |0.94 |bicubic | |regnetz_b |80.728|19.272 |95.47 |4.53 |9.72 |288 |0.94 |bicubic |

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

|model |infer_samples_per_sec|infer_step_time|infer_batch_size|infer_img_size|infer_GMACs|train_samples_per_sec|train_step_time|train_batch_size|train_img_size|param_count| |-------------|---------------------|---------------|----------------|--------------|-----------|---------------------|---------------|----------------|--------------|-----------| |regnetz_b |2703.42 |94.68 |256 |224 |1.45 |764.85 |333.348 |256 |224 |9.72 | |haloregnetz_b|2086.22 |122.695 |256 |224 |1.88 |620.1 |411.415 |256 |224 |11.68 | |regnetz_c |1653.19 |154.836 |256 |256 |2.51 |459.41 |277.268 |128 |256 |13.46 | |regnetz_d |1060.91 |241.284 |256 |256 |5.98 |296.51 |430.143 |128 |256 |27.58 |

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

NOTE: channels last layout is painfully slow for backward pass here due to some sort of cuDNN issue |model |infer_samples_per_sec|infer_step_time|infer_batch_size|infer_img_size|infer_GMACs|train_samples_per_sec|train_step_time|train_batch_size|train_img_size|param_count| |-------------|---------------------|---------------|----------------|--------------|-----------|---------------------|---------------|----------------|--------------|-----------| |regnetz_b |4152.59 |61.634 |256 |224 |1.45 |399.37 |639.572 |256 |224 |9.72 | |haloregnetz_b|2770.78 |92.378 |256 |224 |1.88 |364.22 |701.386 |256 |224 |11.68 | |regnetz_c |2512.4 |101.878 |256 |256 |2.51 |376.72 |338.372 |128 |256 |13.46 | |regnetz_d |1456.05 |175.8 |256 |256 |5.98 |111.32 |1148.279 |128 |256 |27.58 |
Source code(tar.gz)
Source code(zip)
bat_resnext26ts_256-fa6fd595.pth(41.13 MB)
botnet26t_a1h_256-f2406920.pth(47.78 MB)
botnet26t_c1_256-167a0e9f.pth(47.78 MB)
eca_botnext26ts_c_256-95a898f6.pth(40.55 MB)
eca_halonext26ts_256-1e55880b.pth(41.18 MB)
eca_halonext26ts_c_256-06906299.pth(41.18 MB)
eca_resnet33ts_256-8f98face.pth(75.24 MB)
eca_resnext26ts_256-5a1d030f.pth(39.43 MB)
edgenext_small_rw-sw-b00041bb.pth(29.88 MB)
gcresnet33ts_256-0e0cd345.pth(76.03 MB)
gcresnet50t_256-96374d1c.pth(99.06 MB)
gcresnext26ts_256-e414378b.pth(40.12 MB)
gcresnext50ts_256-3e0f515e.pth(60.03 MB)
halo2botnet50ts_a1h2_256-fd9c11a3.pth(86.60 MB)
halo2botnet50ts_a1h_256-ad9e16fb.pth(86.60 MB)
halonet26t_256-9b4bf0b3.pth(44.48 MB)
halonet26t_a1h_256-3083328c.pth(47.75 MB)
halonet50ts_256_ra3-f07eab9f.pth(86.97 MB)
halonet50ts_a1h2_256-f3a3daee.pth(86.97 MB)
halonet50ts_a1h_256-c6d7ff15.pth(86.97 MB)
haloregnetz_c_raa_256-c8ad7616.pth(44.82 MB)
lambda_resnet26rpt_a2h_256-482adad8.pth(42.07 MB)
lambda_resnet26rpt_c_256-ab00292d.pth(42.07 MB)
lambda_resnet26t_256-b040fce6.pth(41.95 MB)
lambda_resnet26t_a2h_256-25ded63d.pth(41.95 MB)
lambda_resnet26t_c_256-e5a5c857.pth(41.95 MB)
lambda_resnet50ts_a1h_256-b87370f7.pth(82.42 MB)
lamhalobotnet50ts_a1h2_256-fe3d9445.pth(86.35 MB)
lamhalobotnet_a1h_256-c9bc4e74.pth(86.35 MB)
regnetz_b_raa-677d9606.pth(37.32 MB)
regnetz_c_rab2_256-a54bf36a.pth(51.66 MB)
regnetz_c_rab_256-6bdb3c01.pth(51.66 MB)
regnetz_d8_bh-afc03c55.pth(89.59 MB)
regnetz_d_rab_256-b8073a89.pth(105.63 MB)
regnetz_e8_bh-aace8e6e.pth(220.84 MB)
resnet26t_256_ra2-6f6fa748.pth(61.22 MB)
resnet32ts_256-aacf5250.pth(68.70 MB)
resnet33ts_256-e91b09a4.pth(75.24 MB)
resnext26ts_256_ra2-8bbd9106.pth(39.42 MB)
sebotnet33ts_a1h2_256-957e3c3e.pth(52.44 MB)
sehalonet33ts_256-87e053f9.pth(52.40 MB)
seresnet33ts_256-f8ad44d9.pth(75.64 MB)
seresnext26ts_256-6f0d74a3.pth(39.77 MB)
v0.4.12(Jun 30, 2021)
Vision Transformer AugReg weights and model defs (https://arxiv.org/abs/2106.10270)

ResMLP official weights

ECA-NFNet-L2 weights

gMLP-S weights

ResNet51-Q

Visformer, LeViT, ConViT, Twins

Many fixes, improvements, better test coverage

Source code(tar.gz)
Source code(zip)
v0.1-vt3p-weights(May 21, 2021)
A catch-all (ish) release for storing vision transformer weights adapted/rehosted from 3rd parties. Too many incoming models for one release per source...

Containing weights from:

Twins - https://github.com/Meituan-AutoML/Twins

Visformer - https://github.com/danczs/Visformer/issues/2

NesT (Aggregated Nested Transformer) - weights converted from https://github.com/google-research/nested-transformer by @alexander-soare ' script

Source code(tar.gz)
Source code(zip)
jx_nest_base-8bc41011.pth(258.39 MB)
jx_nest_small-422eaded.pth(146.34 MB)
jx_nest_tiny-e3428fb9.pth(65.09 MB)
twins_pcpvt_base-e5ecb09b.pth(167.26 MB)
twins_pcpvt_large-d273f802.pth(232.76 MB)
twins_pcpvt_small-e70e7e7a.pth(92.00 MB)
twins_svt_base-c2265010.pth(213.94 MB)
twins_svt_large-90f6aaa9.pth(378.74 MB)
twins_svt_small-42e5f78c.pth(91.82 MB)
visformer_small-839e1f5b.pth(153.55 MB)
v0.4.9(May 18, 2021)

Source code(tar.gz)
Source code(zip)
v0.1-effv2-weights(May 14, 2021)

Weights from https://github.com/google/automl/tree/master/efficientnetv2

Paper: EfficientNetV2: Smaller Models and Faster Training - https://arxiv.org/abs/2104.00298
Source code(tar.gz)
Source code(zip)
tf_efficientnetv2_b0-c7cc451f.pth(27.52 MB)
tf_efficientnetv2_b1-be6e41b0.pth(31.40 MB)
tf_efficientnetv2_b2-847de54e.pth(38.90 MB)
tf_efficientnetv2_b3-57773f13.pth(55.28 MB)
tf_efficientnetv2_l-d664b728.pth(454.28 MB)
tf_efficientnetv2_l_21ft1k-60127a9d.pth(454.28 MB)
tf_efficientnetv2_l_21k-91a19ec9.pth(556.13 MB)
tf_efficientnetv2_m-cc09e0cd.pth(207.80 MB)
tf_efficientnetv2_m_21ft1k-bf41664a.pth(207.80 MB)
tf_efficientnetv2_m_21k-361418a2.pth(309.65 MB)
tf_efficientnetv2_s-eb54923e.pth(82.55 MB)
tf_efficientnetv2_s_21ft1k-d7dafa41.pth(82.55 MB)
tf_efficientnetv2_s_21k-6337ad01.pth(184.41 MB)
tf_efficientnetv2_xl_in21ft1k-06c35c48.pth(797.17 MB)
tf_efficientnetv2_xl_in21k-fd7e8abf.pth(899.02 MB)
v0.1-rs-weights(May 4, 2021)

Weights for ResNet-RS models as per #554 . Ported from Tensorflow impl (https://github.com/tensorflow/tpu/tree/master/models/official/resnet/resnet_rs) by @amaarora
Source code(tar.gz)
Source code(zip)
resnetrs101-3e4bb55c.pth(243.20 MB)
resnetrs101_i192_ema-1509bbf6.pth(243.20 MB)
resnetrs152-b1efe56d.pth(331.18 MB)
resnetrs152_i256_ema-a9aff7f9.pth(331.18 MB)
resnetrs200-b455b791.pth(356.45 MB)
resnetrs200_ema-623d2f59.pth(356.45 MB)
resnetrs270-cafcfbc7.pth(496.60 MB)
resnetrs270_ema-b40e674c.pth(496.60 MB)
resnetrs350-06d9bfac.pth(627.01 MB)
resnetrs350_i256_ema-5a1aa8f1.pth(627.01 MB)
resnetrs420-d26764a5.pth(733.87 MB)
resnetrs420_ema-972dee69.pth(733.87 MB)
resnetrs50-7c9728e2.pth(136.41 MB)
resnetrs50_ema-6b53758b.pth(136.41 MB)
v0.1-coat-weights(Apr 28, 2021)

Weights for CoaT: Co-Scale Conv-Attentional Image Transformers (from https://github.com/mlpc-ucsd/CoaT)
Source code(tar.gz)
Source code(zip)
coat_lite_mini-d7842000.pth(42.04 MB)
coat_lite_small-fea1d5a1.pth(75.73 MB)
coat_lite_tiny-461b07a7.pth(21.86 MB)
coat_mini-2c6baf49.pth(39.51 MB)
coat_tiny-473c2a20.pth(21.05 MB)
v0.1-pit-weights(Mar 31, 2021)

Weights from https://github.com/naver-ai/pit

Copyright 2021-present NAVER Corp.

Rehosted here for easy pytorch hub downloads.
Source code(tar.gz)
Source code(zip)
pit_b_820.pth(281.45 MB)
pit_b_distill_840.pth(285.36 MB)
pit_s_809.pth(89.56 MB)
pit_s_distill_819.pth(91.76 MB)
pit_ti_730.pth(18.55 MB)
pit_ti_distill_746.pth(19.53 MB)
pit_xs_781.pth(40.56 MB)
pit_xs_distill_791.pth(42.03 MB)
v0.4.5(Mar 8, 2021)

Source code(tar.gz)
Source code(zip)
v0.1-dnf-weights(Feb 18, 2021)

Weights converted from DeepMind Haiku impl of NFNets (https://github.com/deepmind/deepmind-research/tree/master/nfnets)
Source code(tar.gz)
Source code(zip)
dm_nfnet_f0-604f9c3a.pth(272.74 MB)
dm_nfnet_f1-fc540f82.pth(506.02 MB)
dm_nfnet_f2-89875923.pth(739.30 MB)
dm_nfnet_f3-d74ab3aa.pth(972.58 MB)
dm_nfnet_f4-0ac5b10b.pth(1205.86 MB)
dm_nfnet_f5-ecb20ab1.pth(1439.15 MB)
dm_nfnet_f6-e0f12116.pth(1672.43 MB)
v0.1-repvgg-weights(Feb 9, 2021)

Checkpoints remapped from official repository at https://github.com/DingXiaoH/RepVGG
Source code(tar.gz)
Source code(zip)
repvgg_a2-c1ee6d2b.pth(107.82 MB)
repvgg_b0-80ac3f1b.pth(60.54 MB)
repvgg_b1-77ca2989.pth(219.34 MB)
repvgg_b1g4-abde5d92.pth(152.78 MB)
repvgg_b2-25b7494e.pth(339.98 MB)
repvgg_b2g4-165a85f2.pth(235.98 MB)
repvgg_b3-199bc50d.pth(469.98 MB)
repvgg_b3g4-73c370bf.pth(320.21 MB)
v0.1-ger-weights(Feb 9, 2021)

Checkpoints remapped from official repo at https://github.com/idstcv/GPU-Efficient-Networks
Source code(tar.gz)
Source code(zip)
gernet_l-f31e2e8d.pth(118.99 MB)
gernet_m-0873c53a.pth(80.94 MB)
gernet_s-756b4751.pth(31.35 MB)
v0.3.4(Jan 8, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.3(Jan 3, 2021)

Source code(tar.gz)
Source code(zip)
v0.1-vitjx(Oct 26, 2020)

Converted to PyTorch from https://github.com/google-research/vision_transformer
Source code(tar.gz)
Source code(zip)
jx_mixer_b16_224-76587d61.pth(228.44 MB)
jx_mixer_b16_224_in21k-617b3de2.pth(289.59 MB)
jx_mixer_l16_224-92f9adc4.pth(794.24 MB)
jx_mixer_l16_224_in21k-846aa33c.pth(875.74 MB)
jx_vit_base_p16_224-80ecf9dd.pth(330.25 MB)
jx_vit_base_p16_384-83fb41ba.pth(331.36 MB)
jx_vit_base_p32_384-830016f5.pth(336.84 MB)
jx_vit_base_patch16_224_in21k-e5005f0a.pth(393.64 MB)
jx_vit_base_patch32_224_in21k-8db57226.pth(399.96 MB)
jx_vit_base_resnet50_224_in21k-6f7c7740.pth(439.79 MB)
jx_vit_base_resnet50_384-9fd3c705.pth(377.51 MB)
jx_vit_large_p16_224-4ee7a4dc.pth(1160.95 MB)
jx_vit_large_p16_384-b3be5167.pth(1162.44 MB)
jx_vit_large_p32_384-9b920ba8.pth(1169.75 MB)
jx_vit_large_patch16_224_in21k-606da67d.pth(1246.45 MB)
jx_vit_large_patch32_224_in21k-9046d2e7.pth(1254.88 MB)
v0.2.1(Aug 13, 2020)
Aug 12, 2020

New/updated weights from training experiments

EfficientNet-B3 - 82.1 top-1 (vs 81.6 for official with AA and 81.9 for AdvProp)

RegNetY-3.2GF - 82.0 top-1 (78.9 from official ver)

CSPResNet50 - 79.6 top-1 (76.6 from official ver)

Add CutMix integrated w/ Mixup. See pull request for some usage examples

Some fixes for using pretrained weights with in_chans != 3 on several models.

Aug 5, 2020

Universal feature extraction, new models, new weights, new test sets.

All models support the features_only=True argument for create_model call to return a network that extracts feature maps from the deepest layer at each stride.

New models

CSPResNet, CSPResNeXt, CSPDarkNet, DarkNet

ReXNet

(Modified Aligned) Xception41/65/71 (a proper port of TF models)

New trained weights

SEResNet50 - 80.3 top-1

CSPDarkNet53 - 80.1 top-1

CSPResNeXt50 - 80.0 top-1

DPN68b - 79.2 top-1

EfficientNet-Lite0 (non-TF ver) - 75.5 (submitted by @hal-314)

Add 'real' labels for ImageNet and ImageNet-Renditions test set, see results/README.md

Test set ranking/top-n diff script by @KushajveerSingh

Train script and loader/transform tweaks to punch through more aug arguments

README and documentation overhaul. See initial (WIP) documentation at https://rwightman.github.io/pytorch-image-models/

adamp and sgdp optimizers added by @hellbell

Source code(tar.gz)
Source code(zip)
timm-0.2.1-py3-none-any.whl(219.96 KB)
v0.1-rexnet(Jul 23, 2020)

ReXNet weights from https://github.com/clovaai/rexnet#pretrained remapped for timm model changes
Source code(tar.gz)
Source code(zip)
rexnetv1_100-1b4dddf4.pth(18.51 MB)
rexnetv1_130-590d768e.pth(29.09 MB)
rexnetv1_150-bd1a6aa8.pth(37.41 MB)
rexnetv1_200-8c0b7f2d.pth(62.81 MB)
v0.1-resnest(Jun 30, 2020)

These are a mirror of weights from the official repository (https://github.com/zhanghang1989/ResNeSt ) to avoid issues with hosting changes/relocation
Source code(tar.gz)
Source code(zip)
resnest101-22405ba7.pth(184.80 MB)
resnest200-75117900.pth(268.93 MB)
resnest269-0cc87c48.pth(424.76 MB)
resnest50-528c19ca.pth(105.16 MB)
resnest50_fast_1s4x24d-d4a4f76f.pth(98.27 MB)
resnest50_fast_4s2x40d-41d14ed0.pth(116.47 MB)

PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more

Related tags

Overview

PyTorch Image Models

Sponsors

What's New

Oct 19, 2021

Aug 18, 2021

July 12, 2021

July 5-9, 2021

June 23, 2021

June 20, 2021

June 8, 2021

May 25, 2021

May 14, 2021

May 5, 2021

April 13, 2021

April 12, 2021

April 1, 2021

March 17, 2021

March 7, 2021

Feb 18, 2021

Feb 16, 2021

Feb 12, 2021

Feb 10, 2021

Feb 8, 2021

Jan 30, 2021

Jan 25, 2021

Jan 3, 2021

Introduction

Models

Features

Results

Getting Started (Documentation)

Train, Validation, Inference Scripts

Awesome PyTorch Resources

Object Detection, Instance and Semantic Segmentation

Computer Vision / Image Augmentation

Knowledge Distillation

Metric Learning

Training / Frameworks

Licenses

Code

Pretrained Weights

Pretrained on more than ImageNet

Citing

BibTeX

Latest DOI

Comments

Discussed in https://github.com/rwightman/pytorch-image-models/discussions/1020

Releases(v0.8.2dev0)

v0.8.2dev0(Dec 24, 2022)

Dec 23, 2022 🎄☃

Dec 8, 2022

Dec 6, 2022

Dec 5, 2022

Oct 15, 2022

v0.6.12(Nov 23, 2022)

Oct 10, 2022

v0.6.11(Oct 3, 2022)

Changes Since 0.6.7

Sept 23, 2022

Sept 7, 2022

Aug 29, 2022

Aug 26, 2022

Aug 15, 2022

Aug 5, 2022

July 28, 2022

v0.1-weights-maxx(Aug 24, 2022)

CoAtNet (https://arxiv.org/abs/2106.04803) and MaxVit (https://arxiv.org/abs/2204.01697) timm trained weights

v0.1-weights-morevit(Aug 17, 2022)

More weights for 3rd party ViT / ViT-CNN hybrids that needed remapping / re-hosting

EfficientFormer

GCViT

v0.6.7(Jul 27, 2022)

v0.6.5(Jul 10, 2022)

July 8, 2022

May 13, 2022

May 2, 2022

April 22, 2022

CoAtNet (https://arxiv.org/abs/2106.04803) and MaxVit (https://arxiv.org/abs/2204.01697) `timm` trained weights