Distributed Arcface Training in Pytorch

Last update: Nov 23, 2021

Related tags

Deep Learning Maske_FR

Overview

Distributed Arcface Training in Pytorch

This is a deep learning library that makes face recognition efficient, and effective, which can train tens of millions identity on a single server.

Requirements

Install pytorch (torch>=1.6.0), our doc for install.md.
pip install -r requirements.txt.
Download the dataset from https://github.com/deepinsight/insightface/tree/master/recognition/datasets .

How to Training

To train a model, run train.py with the path to the configs:

1. Single node, 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=1234 train.py configs/ms1mv3_r50

2. Multiple nodes, each node 8 GPUs:

Node 0:

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr="ip1" --master_port=1234 train.py train.py configs/ms1mv3_r50

Node 1:

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="ip1" --master_port=1234 train.py train.py configs/ms1mv3_r50

3.Training resnet2060 with 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=1234 train.py configs/ms1mv3_r2060.py

Model Zoo

The models are available for non-commercial research purposes only.
All models can be found in here.
Baidu Yun Pan: e8pw
onedrive

Performance on ICCV2021-MFR

ICCV2021-MFR testset consists of non-celebrities so we can ensure that it has very few overlap with public available face recognition training set, such as MS1M and CASIA as they mostly collected from online celebrities. As the result, we can evaluate the FAIR performance for different algorithms.

For ICCV2021-MFR-ALL set, TAR is measured on all-to-all 1:1 protocal, with FAR less than 0.000001(e-6). The globalised multi-racial testset contains 242,143 identities and 1,624,305 images.

For ICCV2021-MFR-MASK set, TAR is measured on mask-to-nonmask 1:1 protocal, with FAR less than 0.0001(e-4). Mask testset contains 6,964 identities, 6,964 masked images and 13,928 non-masked images. There are totally 13,928 positive pairs and 96,983,824 negative pairs.

Datasets	backbone	Training throughout	Size / MB	ICCV2021-MFR-MASK	ICCV2021-MFR-ALL
MS1MV3	r18	-	91	47.85	68.33
Glint360k	r18	8536	91	53.32	72.07
MS1MV3	r34	-	130	58.72	77.36
Glint360k	r34	6344	130	65.10	83.02
MS1MV3	r50	5500	166	63.85	80.53
Glint360k	r50	5136	166	70.23	87.08
MS1MV3	r100	-	248	69.09	84.31
Glint360k	r100	3332	248	75.57	90.66
MS1MV3	mobilefacenet	12185	7.8	41.52	65.26
Glint360k	mobilefacenet	11197	7.8	44.52	66.48

Performance on IJB-C and Verification Datasets

Datasets	backbone	IJBC(1e-05)	IJBC(1e-04)	agedb30	cfp_fp	lfw	log
MS1MV3	r18	92.07	94.66	97.77	97.73	99.77	log
MS1MV3	r34	94.10	95.90	98.10	98.67	99.80	log
MS1MV3	r50	94.79	96.46	98.35	98.96	99.83	log
MS1MV3	r100	95.31	96.81	98.48	99.06	99.85	log
MS1MV3	r2060	95.34	97.11	98.67	99.24	99.87	log
Glint360k	r18-0.1	93.16	95.33	97.72	97.73	99.77	log
Glint360k	r34-0.1	95.16	96.56	98.33	98.78	99.82	log
Glint360k	r50-0.1	95.61	96.97	98.38	99.20	99.83	log
Glint360k	r100-0.1	95.88	97.32	98.48	99.29	99.82	log

Speed Benchmark

Arcface Torch can train large-scale face recognition training set efficiently and quickly. When the number of classes in training sets is greater than 300K and the training is sufficient, partial fc sampling strategy will get same accuracy with several times faster training performance and smaller GPU memory. Partial FC is a sparse variant of the model parallel architecture for large sacle face recognition. Partial FC use a sparse softmax, where each batch dynamicly sample a subset of class centers for training. In each iteration, only a sparse part of the parameters will be updated, which can reduce a lot of GPU memory and calculations. With Partial FC, we can scale trainset of 29 millions identities, the largest to date. Partial FC also supports multi-machine distributed training and mixed precision training.

More details see speed_benchmark.md in docs.

1. Training speed of different parallel methods (samples / second), Tesla V100 32GB * 8. (Larger is better)

- means training failed because of gpu memory limitations.

Number of Identities in Dataset	Data Parallel	Model Parallel	Partial FC 0.1
125000	4681	4824	5004
1400000	1672	3043	4738
5500000	-	1389	3975
8000000	-	-	3565
16000000	-	-	2679
29000000	-	-	1855

2. GPU memory cost of different parallel methods (MB per GPU), Tesla V100 32GB * 8. (Smaller is better)

Number of Identities in Dataset	Data Parallel	Model Parallel	Partial FC 0.1
125000	7358	5306	4868
1400000	32252	11178	6056
5500000	-	32188	9854
8000000	-	-	12310
16000000	-	-	19950
29000000	-	-	32324

Evaluation ICCV2021-MFR and IJB-C

More details see eval.md in docs.

Test

We tested many versions of PyTorch. Please create an issue if you are having trouble.

torch 1.6.0
torch 1.7.1
torch 1.8.0
torch 1.9.0

Citation

@inproceedings{deng2019arcface,
  title={Arcface: Additive angular margin loss for deep face recognition},
  author={Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={4690--4699},
  year={2019}
}
@inproceedings{an2020partical_fc,
  title={Partial FC: Training 10 Million Identities on a Single Machine},
  author={An, Xiang and Zhu, Xuhan and Xiao, Yang and Wu, Lan and Zhang, Ming and Gao, Yuan and Qin, Bin and
  Zhang, Debing and Fu Ying},
  booktitle={Arxiv 2010.05222},
  year={2020}
}

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

163 Dec 26, 2022

Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data based on Pytorch Framework

VFedPCA+VFedAKPCA This is the official source code for the Paper: Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-

9 Sep 18, 2022

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

77 Oct 5, 2022

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

78 Dec 27, 2022

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

ActNN : Activation Compressed Training This is the official project repository for ActNN: Reducing Training Memory Footprint via 2-Bit Activation Comp

178 Jan 5, 2023

This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.

KGI (Knowledge Graph Induction) for slot filling This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code fo

72 Jan 6, 2023

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]

PINTO_model_zoo Please read the contents of the LICENSE file located directly under each folder before using the model. My model conversion scripts ar

2.4k Jan 5, 2023

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Pre-trained checkpoint and bert config json file Location of checkpoint and bert config json file This MLCommons members Google Drive location contain

SAIT (Samsung Advanced Institute of Technology)

12 Apr 27, 2022

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

191 Dec 31, 2022

Distributed Arcface Training in Pytorch

Related tags

Overview

Distributed Arcface Training in Pytorch

Requirements

How to Training

1. Single node, 8 GPUs:

2. Multiple nodes, each node 8 GPUs:

3.Training resnet2060 with 8 GPUs:

Model Zoo

Performance on ICCV2021-MFR

Performance on IJB-C and Verification Datasets

Speed Benchmark

1. Training speed of different parallel methods (samples / second), Tesla V100 32GB * 8. (Larger is better)

2. GPU memory cost of different parallel methods (MB per GPU), Tesla V100 32GB * 8. (Smaller is better)

Evaluation ICCV2021-MFR and IJB-C

Test

Citation

You might also like...

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data based on Pytorch Framework

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

Owner

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

The pure and clear PyTorch Distributed Training Framework.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Secure Distributed Training at Scale

Bagua is a flexible and performant distributed training algorithm development framework.

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

Pytorch Lightning Distributed Accelerators using Ray

Pytorch Lightning Distributed Accelerators using Ray

Distributed DataLoader For Pytorch Based On Ray