Simple tutorials on Pytorch DDP training

Ren Tianhe

Last update: Jan 6, 2023

Related tags

Overview

pytorch-distributed-training

Distribute Dataparallel (DDP) Training on Pytorch

Features

Easy to study DDP training
You can directly copy this code for a quick start
Learning Notes Sharing(with √means finished):

Good Notes

分享一些网上优质的笔记

TODO

完成DP和DDP源码解读笔记(当前进度50%)
修改代码细节, 复现实验结果

Quick start

想直接运行查看结果的可以执行以下命令, 注意一定要用--ip和--port来指定主机的ip地址以及空闲的端口，否则可能无法运行

dataparaller.py

$ python dataparallel.py --gpu 0,1,2,3

distributed.py

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py

distributed_mp.py

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py

distributed_apex.py

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py

--ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
--port=int, e.g --port=23456 来指定启动端口号
--batch_size=int, e.g --batch_size=128 设定训练batch_size
distributed_gradient_accumulation.py

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py

--ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
--port=int, e.g --port=23456 来指定启动端口号
--grad_accu_steps=int, e.g --grad_accu_steps=4' 来指定gradient_step

Comparison

结果不够准确，GPU状态不同结果可能差异较大

默认情况下都使用SyncBatchNorm, 这会导致执行速度变慢一些，因为需要增加进程之间的通讯来计算BatchNorm, 但有利于保证准确率

Concepts

apex
DP: DataParallel
DDP: DistributedDataParallel

Environments

4 × 2080Ti

model	dataset	training method	time(seconds/epoch)
resnet18	cifar100	DP	20s
resnet18	cifar100	DP+apex	18s
resnet18	cifar100	DDP	16s
resnet18	cifar100	DDP+apex	14.5s

Basic Concept

group: 表示进程组，默认情况下只有一个进程组。
world size: 全局进程个数
- 比如16张卡单卡单进程: world size = 16
- 8卡单进程: world size = 1
- 只有当连接的进程数等于world size, 程序才会执行
rank: 进程序号，用于进程间通讯，表示进程优先级，rank=0表示主进程
local_rank: 进程内，GPU编号，非显示参数，由torch.distributed.launch内部指定，rank=3, local_rank=0 表示第3个进程的第1块GPU

Usage 单机多卡

1. 获取当前进程的index

pytorch可以通过torch.distributed.lauch启动器，在命令行分布式地执行.py文件, 在执行的过程中会将当前进程的index通过参数传递给python

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int,
                    help='node rank for distributed training')
args = parser.parse_args()
print(args.local_rank)

2. 定义 main_worker 函数

主要的训练流程都写在main_worker函数中，main_worker需要接受三个参数（最后一个参数optional）:

def main_worker(local_rank, nprocs, args):
    training...

local_rank: 接受当前进程的rank值，在一机多卡的情况下对应使用的GPU号
nprocs: 进程数量
args: 自己定义的额外参数

main_worker,相当于你每个进程需要运行的函数（每个进程执行的函数内容是一致的，只不过传入的local_rank不一样）

3. main_worker函数中的整体流程

main_worker函数中完整的训练流程

import torch
import torch.distributed as dist
import torch.backends.cudnn as cudnn
def main_worker(local_rank, nprocs, args):
    args.local_rank = local_rank
    # 分布式初始化，对于每个进程来说，都需要进行初始化
    cudnn.benchmark = True
    dist.init_process_group(backend='nccl', init_method='tcp://ip:port', world_size=nprocs, rank=local_rank)
    # 模型、损失函数、优化器定义
    model = ...
    criterion = ...
    optimizer = ...
    # 设置进程对应使用的GPU
    torch.cuda.set_device(local_rank)
    model.cuda(local_rank)
    # 使用分布式函数定义模型
    model = model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
    
    # 数据集的定义，使用 DistributedSampler
    mini_batch_size = batch_size / nprocs # 手动划分 batch_size to mini-batch_size
    train_dataset = ...
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., 
                                              sampler=train_sampler)
    
    test_dataset = ...
    test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)
    testloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., 
                                             sampler=test_sampler) 
    
    # 正常的 train 流程
    for epoch in range(300):
       model.train()
       for batch_idx, (images, target) in enumerate(trainloader):
          images = images.cuda(non_blocking=True)
          target = target.cuda(non_blocking=True)
          ...
          pred = model(images)
          loss = loss_function(pred, target)
          ...
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

4. 定义main函数

import argparse
import torch
parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--batch_size','--batch-size', default=256, type=int)
parser.add_argument('--lr', default=0.1, type=float)

def main_worker(local_rank, nprocs, args):
    ...

def main():
    args = parser.parse_args()
    args.nprocs = torch.cuda.device_count()
    # 执行 main_worker
    main_worker(args.local_rank, args.nprocs, args)

if __name__ == '__main__':
    main()

5. Command Line 启动

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py

--ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
--port=int, e.g --port=23456 来指定启动端口号

参数说明:

--nnodes 表示机器的数量
--node_rank 表示当前的机器
--nproc_per_node 表示每台机器上的进程数量

参考 distributed.py

6. torch.multiprocessing

使用torch.multiprocessing来解决进程自发控制可能产生问题，这种方式比较稳定，推荐使用

import argparse
import torch
import torch.multiprocessing as mp

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--batch_size','--batch-size', default=256, type=int)
parser.add_argument('--lr', default=0.1, type=float)

def main_worker(local_rank, nprocs, args):
    ...

def main():
    args = parser.parse_args()
    args.nprocs = torch.cuda.device_count()
    # 将 main_worker 放入 mp.spawn 中
    mp.spawn(main_worker, nprocs=args.nprocs, args=(args.nprocs, args))

if __name__ == '__main__':
    main()

参考 distributed_mp.py 启动方式如下:

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py

--ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
--port=int, e.g --port=23456 来指定启动端口号

Implemented Work

参考的文章如下（如果有文章没有引用，但是内容差不多的，可以提issue给我，我会补上，实在抱歉）：

Pytorch: DDP系列
分布式训练
分布式训练（理论篇）
DistributedSampler的问题
I learned from this repo, and want to make it easier and cleaner.

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Pre-trained checkpoint and bert config json file Location of checkpoint and bert config json file This MLCommons members Google Drive location contain

SAIT (Samsung Advanced Institute of Technology)

12 Apr 27, 2022

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

191 Dec 31, 2022

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.

6.9k Jan 3, 2023

Comments

About model parameter initialization on different process

您好：

有关多个进程上对模型的实例化我有一点小疑问。

https://github.com/rentainhe/pytorch-distributed-training/blob/33d04b235bf6711b1c5fd21c685a927bd4a2fbfa/distributed_mp.py#L66-L73

如代码所示，模型的实例化是在main_worker函数中实现的。但DDP要求多个进程上的模型参数保持一致。假设我的多个进程使用不同的随机种子，模型实例化过程中对参数进行随机初始化是否会导致进程上的模型参数不一致？
enhancement good first issue amazing issue

opened by yyk-wew 10

Simple tutorials on Pytorch DDP training

Related tags

Overview

pytorch-distributed-training

Features

Good Notes

TODO

Quick start

Comparison

Basic Concept

Usage 单机多卡

1. 获取当前进程的index

2. 定义 main_worker 函数

3. main_worker函数中的整体流程

4. 定义main函数

5. Command Line 启动

6. torch.multiprocessing

Implemented Work

You might also like...

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

HashNeRF-pytorch - Pure PyTorch Implementation of NVIDIA paper on Instant Training of Neural Graphics primitives

Simple tools for logging and visualizing, loading and training

Trading and Backtesting environment for training reinforcement learning agent or simple rule base algo.

A no-BS, dead-simple training visualizer for tf-keras

A Simple Framwork for CV Pre-training Model (SOCO, VirTex, BEiT)

Comments

About model parameter initialization on different process

Owner

Ren Tianhe

Pytorch tutorials for Neural Style transfert

Pytorch Geometric Tutorials

Repository of Jupyter notebook tutorials for teaching the Deep Learning Course at the University of Amsterdam (MSc AI), Fall 2020

Useful materials and tutorials for 110-1 NTU DBME5028 (Application of Deep Learning in Medical Imaging)

Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.