arxiv
Demysitifing Local Vision Transformer,This is the official PyTorch implementation of our paper. We simply replace local self attention by (dynamic) depth-wise convolution with lower computational cost. The performance is on par with the Swin Transformer.
Besides, the main contribution of our paper is the theorical and detailed comparison between depth-wise convolution and local self attention from three aspects: sparse connectivity, weight sharing and dynamic weight. By this paper, we want community to rethinking the local self attention and depth-wise convolution, and the basic model architeture designing rules.
Codes and models for object detection and semantic segmentation are avaliable in Detection and Segmentation.
We also give MLP based Swin Transformer models and Inhomogenous dynamic convolution in the ablation studies. These codes and models will coming soon.
Reference
@article{han2021demystifying,
title={Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight},
author={Han, Qi and Fan, Zejia and Dai, Qi and Sun, Lei and Cheng, Ming-Ming and Liu, Jiaying and Wang, Jingdong},
journal={arXiv preprint arXiv:2106.04263},
year={2021}
}
1. Requirements
torch>=1.5.0, torchvision, timm, pyyaml; apex-amp
data perpare: ImageNet dataset with the following structure:
│imagenet/
├──train/
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ......
│ ├── ......
├──val/
│ ├── n01440764
│ │ ├── ILSVRC2012_val_00000293.JPEG
│ │ ├── ILSVRC2012_val_00002138.JPEG
│ │ ├── ......
│ ├── ......
2. Trainning
For tiny model, we train with batch-size 128 on 8 GPUs. When trainning base model, we use batch-size 64 on 16 GPUs with OpenMPI to keep the total batch-size unchanged. (With the same trainning setting, the base model couldn't train with AMP due to the anomalous gradient values.)
Please change the data path in sh scripts first.
For tiny model:
bash scripts/run_dwnet_tiny_patch4_window7_224.sh
bash scripts/run_dynamic_dwnet_tiny_patch4_window7_224.sh
For base model, use multi node with OpenMPI:
bash scripts/run_dwnet_base_patch4_window7_224.sh
bash scripts/run_dynamic_dwnet_base_patch4_window7_224.sh
3. Evaluation
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --cfg configs/change_to_config_file --resume /path/to/model --data-path /path/to/imagenet --eval
4. Models
Models are provided by training on ImageNet with resolution 224.
Model | #params | FLOPs | Top1 Acc | Download |
---|---|---|---|---|
dwnet_tiny | 24M | 3.8G | 81.2 | github |
dynamic_dwnet_tiny | 51M | 3.8G | 81.8 | github |
dwnet_base | 74M | 12.9G | 83.2 | github |
dynamic_dwnet_base | 162M | 13.0G | 83.2 | github |
Detection (see Detection for details):
Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|---|
DWNet-T | ImageNet-1K | 3x | 49.9 | 43.4 | 82M | 730G | config | github |
DWNet-B | ImageNet-1K | 3x | 51.0 | 44.1 | 132M | 924G | config | github |
Dynamic-DWNet-T | ImageNet-1K | 3x | 50.5 | 43.7 | 108M | 730G | config | github |
Dynamic-DWNet-B | ImageNet-1K | 3x | 51.2 | 44.4 | 219M | 924G | config | github |
Segmentation (see Segmentation for details):
Backbone | Pretrain | Lr Schd | mIoU | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|
DWNet-T | ImageNet-1K | 160K | 45.5 | 56M | 928G | config | github |
DWNet-B | ImageNet-1K | 160K | 48.3 | 108M | 1129G | config | github |
Dynamic-DWNet-T | ImageNet-1K | 160K | 45.7 | 83M | 928G | config | github |
Dynamic-DWNet-B | ImageNet-1K | 160K | 48.0 | 195M | 1129G | config | github |
LICENSE
This repo is under the MIT license. Some codes are borrow from Swin Transformer.