SOTA Image Classification Models in PyTorch
Intended for easy to use and integrate SOTA image classification models into object detection, semantic segmentation, pose estimation, etc.
Model Zoo
Model | ImageNet-1k Top-1 Acc (%) |
Params (M) |
GFLOPs | Variants & Weights |
---|---|---|---|---|
MicroNet | 51.4| 59.4| 62.5 |
2| 2| 3 |
6M| 12M| 21M |
M1|M2|M3 |
MobileFormer | 76.7| 77.9| 79.3 |
9| 11| 14 |
214M| 294M| 508M |
214|294|508 |
GFNet | 80.1| 81.5| 82.9 |
15| 32| 54 |
2| 5| 8 |
T|S|B |
PVTv2 | 78.7| 82.0| 83.6 |
14| 25| 63 |
2| 4| 10 |
B1|B2|B4 |
ResT | 79.6| 81.6| 83.6 |
14| 30| 52 |
2| 4| 8 |
S|B|L |
Conformer | 81.3| 83.4| 84.1 |
24| 38| 83 |
5| 11| 23 |
T|S|B |
Shuffle | 82.4| 83.6| 84.0 |
28| 50| 88 |
5| 9| 16 |
T|S|B |
CSWin | 82.7| 83.6| 84.2 |
23| 35| 78 |
4| 7| 15 |
T|S|B |
CycleMLP | 81.6| 83.0| 83.2 |
27| 52| 76 |
4| 10| 12 |
B2|B4|B5 |
HireMLP | 81.8| 83.1| 83.4 |
33| 58| 96 |
4| 8| 14 |
S|B|L |
sMLP | 81.9| 83.1| 83.4 |
24| 49| 66 |
5| 10| 14 |
T|S|B |
XCiT | 80.4| 83.9| 84.3 |
12| 48| 84 |
2| 9| 16 |
T|S|M |
VOLO | 84.2| 85.2| 85.4 |
27| 59| 86 |
7| 14| 21 |
D1|D2|D3 |
Table Notes
- Image size is 224x224. EfficientNetv2 uses progressive learning (image size from 128 to 380).
- All models' weights are from official repositories.
- Only models trained on ImageNet1k are compared.
- (Parameters > 200M) Models are not included.
- PVTv2, ResT, Conformer, XCiT and CycleMLP models work with any image size.
Usage
Requirements (click to expand)
- python >= 3.6
- torch >= 1.8.1
- torchvision >= 0.9.1
Other requirements can be installed with pip install -r requirements.txt
.
Show Available Models
$ python tools/show.py
A table with model names and variants will be shown:
Model Names Model Variants
------------- --------------------------------
ResNet ['18', '34', '50', '101', '152']
MicroNet ['M1', 'M2', 'M3']
GFNet ['T', 'S', 'B']
PVTv2 ['B1', 'B2', 'B3', 'B4', 'B5']
ResT ['S', 'B', 'L']
Conformer ['T', 'S', 'B']
Shuffle ['T', 'S', 'B']
CSWin ['T', 'S', 'B', 'L']
CycleMLP ['B1', 'B2', 'B3', 'B4', 'B5']
XciT ['T', 'S', 'M', 'L']
VOLO ['D1', 'D2', 'D3', 'D4']
Inference
- Download your desired model's weights from
Model Zoo
table. - Change
MODEL
parameters andTEST
parameters in config file here. And run the the following command.
$ python tools/infer.py --cfg configs/test.yaml
You will see an output similar to this:
File: assests\dog.jpg >>>>> Golden retriever
Training (click to expand)
$ python tools/train.py --cfg configs/train.yaml
Evaluate (click to expand)
$ python tools/val.py --cfg configs/train.yaml
Fine-tune (click to expand)
Fine-tune on CIFAR-10:
$ python tools/finetune.py --cfg configs/finetune.yaml
References (click to expand)
Citations (click to expand)
@article{zhql2021ResT,
title={ResT: An Efficient Transformer for Visual Recognition},
author={Zhang, Qinglong and Yang, Yubin},
journal={arXiv preprint arXiv:2105.13677v3},
year={2021}
}
@article{peng2021conformer,
title={Conformer: Local Features Coupling Global Representations for Visual Recognition},
author={Zhiliang Peng and Wei Huang and Shanzhi Gu and Lingxi Xie and Yaowei Wang and Jianbin Jiao and Qixiang Ye},
journal={arXiv preprint arXiv:2105.03889},
year={2021},
}
@misc{dong2021cswin,
title={CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows},
author={Xiaoyi Dong and Jianmin Bao and Dongdong Chen and Weiming Zhang and Nenghai Yu and Lu Yuan and Dong Chen and Baining Guo},
year={2021},
eprint={2107.00652},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{chen2021cyclemlp,
title={CycleMLP: A MLP-like Architecture for Dense Prediction},
author={Shoufa Chen and Enze Xie and Chongjian Ge and Ding Liang and Ping Luo},
year={2021},
eprint={2107.10224},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{wang2021pvtv2,
title={PVTv2: Improved Baselines with Pyramid Vision Transformer},
author={Wenhai Wang and Enze Xie and Xiang Li and Deng-Ping Fan and Kaitao Song and Ding Liang and Tong Lu and Ping Luo and Ling Shao},
year={2021},
eprint={2106.13797},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{elnouby2021xcit,
title={XCiT: Cross-Covariance Image Transformers},
author={Alaaeldin El-Nouby and Hugo Touvron and Mathilde Caron and Piotr Bojanowski and Matthijs Douze and Armand Joulin and Ivan Laptev and Natalia Neverova and Gabriel Synnaeve and Jakob Verbeek and Hervé Jegou},
year={2021},
eprint={2106.09681},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{yuan2021volo,
title={VOLO: Vision Outlooker for Visual Recognition},
author={Li Yuan and Qibin Hou and Zihang Jiang and Jiashi Feng and Shuicheng Yan},
year={2021},
eprint={2106.13112},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{yan2020micronet,
title={MicroNet for Efficient Language Modeling},
author={Zhongxia Yan and Hanrui Wang and Demi Guo and Song Han},
year={2020},
eprint={2005.07877},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{chen2021mobileformer,
title={Mobile-Former: Bridging MobileNet and Transformer},
author={Yinpeng Chen and Xiyang Dai and Dongdong Chen and Mengchen Liu and Xiaoyi Dong and Lu Yuan and Zicheng Liu},
year={2021},
eprint={2108.05895},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{rao2021global,
title={Global Filter Networks for Image Classification},
author={Rao, Yongming and Zhao, Wenliang and Zhu, Zheng and Lu, Jiwen and Zhou, Jie},
journal={arXiv preprint arXiv:2107.00645},
year={2021}
}
@article{huang2021shuffle,
title={Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer},
author={Huang, Zilong and Ben, Youcheng and Luo, Guozhong and Cheng, Pei and Yu, Gang and Fu, Bin},
journal={arXiv preprint arXiv:2106.03650},
year={2021}
}