TransFG: A Transformer Architecture for Fine-grained Recognition
Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-grained Recognition
Implementation based on DeiT pretrained on ImageNet-1K with distillation fine-tuning will be released soon.
Framework
Dependencies:
- Python 3.7.3
- PyTorch 1.5.1
- torchvision 0.6.1
- ml_collections
Usage
1. Download Google pre-trained ViT models
- Get models in this link: ViT-B_16, ViT-B_32...
wget https://storage.googleapis.com/vit_models/imagenet21k/{MODEL_NAME}.npz
2. Prepare data
In the paper, we use data from 5 publicly available datasets:
Please download them from the official websites and put them in the corresponding folders.
3. Install required packages
Install dependencies with the following command:
pip3 install -r requirements.txt
4. Train
To train TransFG on CUB-200-2011 dataset with 4 gpus in FP-16 mode for 10000 steps run:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 train.py --dataset CUB_200_2011 --split overlap --num_steps 10000 --fp16 --name sample_run
Citation
If you find our work helpful in your research, please cite it as:
@article{he2021transfg,
title={TransFG: A Transformer Architecture for Fine-grained Recognition},
author={He, Ju and Chen, Jieneng and Liu, Shuai and Kortylewski, Adam and Yang, Cheng and Bai, Yutong and Wang, Changhu and Yuille, Alan},
journal={arXiv preprint arXiv:2103.07976},
year={2021}
}
Acknowledgement
Many thanks to ViT-pytorch for the PyTorch reimplementation of An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale