PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

DistributedML

Last update: Dec 6, 2022

Related tags

Deep Learning PipeTransformer

Overview

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

This repository is the official implementation of the following paper:

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
Chaoyang He (USC), Shen Li (Facebook AI Research), Mahdi Soltanolkotabi (USC), Salman Avestimehr (USC)
Accepted to ICML 2021 (International Conference on Machine Learning 2021)

1. Introduction

The size of Transformer models is growing at an unprecedented rate. It has taken less than one year to reach trillion-level parameters since the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated elastic pipelining for efficient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on SQuAD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, PipeTransformer attains up to $2.83$-fold speedup without losing accuracy. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. Finally, we have modularized our training system with flexible APIs and made the source code publicly available.

2. Overall Design

3. Slides

https://docs.google.com/presentation/d/1t6HWL33KIQo2as0nSHeBpXYtTBcy0nXCoLiKd0EashY/edit?usp=sharing

4. Understanding PipeTransformer by Animation

https://videos.files.wordpress.com/3vsRzoiw/pipetransformer-animation_m4v_hd.mp4

5. Installation

Please follow INSTALL-CONDA.md.

6. Experiments

check README.md at

examples/image_classification

examples/question_answering

examples/text_classification

7. Citation

If you use any part of this code in your research or any engineering project, please cite our paper:

@article{he2021pipetransformer,
  title={Pipetransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models},
  author={He, Chaoyang and Li, Shen and Soltanolkotabi, Mahdi and Avestimehr, Salman},
  journal={Thirty-eighth International Conference on Machine Learning},
  year={2021}
}

8. Contacts

Chaoyang He
https://chaoyanghe.com
[email protected]
[email protected]

Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL)

LUPerson-NL Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL) The repository is for our CVPR2022 paper Large-Scale

43 Dec 26, 2022

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

290 Dec 29, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

190 Jan 3, 2023

Evaluation suite for large-scale language models.

This repo contains code for running the evaluations and reproducing the results from the Jurassic-1 Technical Paper (see blog post), with current support for running the tasks through both the AI21 Studio API and OpenAI's GPT3 API.

71 Dec 17, 2022

The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient (paper) @misc{zhang2021compress,

46 Dec 7, 2022

Official implementation of the ICML2021 paper "Elastic Graph Neural Networks"

ElasticGNN This repository includes the official implementation of ElasticGNN in the paper "Elastic Graph Neural Networks" [ICML 2021]. Xiaorui Liu, W

34 Dec 4, 2022

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

TorchMultimodal (Alpha Release) Introduction TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

663 Jan 6, 2023

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

34.7k Jan 4, 2023

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

212 Dec 25, 2022

Comments

Cifar100 Finetune NAN

When I reproduce cifar100 finetuning result on ViT. I notice that you comment this line.

I comment / uncomment this line. Both lead to NAN loss within 1 epoch.

May I ask you to provide your experimental log of cifar100 finetuning.

opened by gaow0007 4

Some question on PipeTransformer

Hello, thanks for the open-sourced project. I just ran some experiments and tried to understand your implementation. Could you please help me explain these logs? (which could help me understand your paper and code...sorry...i haven't go through the code, but will do..) 1,

_auto_balanced_elastic_partition(): {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5}
4189 2021-07-23,22:41:45.277 - {auto_pipe.py (141)} - _auto_balanced_elastic_partition(): {0: 10.194431999999999, 1: 7.087872, 2: 11.81184, 3: 9.451775999999999, 4: 11.81184, 5: 9.451775999999999, 6: 14.175744, 7: 11.890276}

does the number, such as 10.194431999999999 here represent parameter size? How about the value in the first dict {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5} ? for example 0:4, whats does 4 mean on device 0?

4189 2021-07-23,22:42:04.167 - {cv_trainer.py (76)} - train(): global_rank = 0. epoch = 0, batch index = 0/79
4189 2021-07-23,22:42:07.974 - {cv_trainer.py (92)} - train(): (epoch = 0) forward_time_per_batch = 3.789346694946289
4189 2021-07-23,22:42:09.849 - {cv_trainer.py (105)} - train(): global_rank = 0. data loading cost = 5.682324171066284
4189 2021-07-23,22:42:09.849 - {cv_trainer.py (109)} - train(): global_rank = 0. sample_num_throughput (images/second): 4466480
4189 2021-07-23,22:42:09.850 - {cv_trainer.py (112)} - train(): global_rank = 0. communication frequency (cross machine sync/second): 5065.584541
4189 2021-07-23,22:42:09.850 - {cv_trainer.py (122)} - train(): -------------------------------------
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (72)} - train(): (epoch = 0) backwards_time_per_batch = 3.137924909591675
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (74)} - train(): --------------global_rank = 0. Epoch 0, batch index 1 Statistics: 
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (76)} - train(): global_rank = 0. epoch = 0, batch index = 1/79
4189 2021-07-23,22:42:10.706 - {cv_trainer.py (92)} - train(): (epoch = 0) forward_time_per_batch = 2.006029725074768
4189 2021-07-23,22:42:11.244 - {cv_trainer.py (109)} - train(): global_rank = 0. sample_num_throughput (images/second): 916
4189 2021-07-23,22:42:11.244 - {cv_trainer.py (112)} - train(): global_rank = 0. communication frequency (cross machine sync/second): 1.433845

what does communication frequency mean here? I did observe the number at the first batch is much bigger than others, i.e., 5065.584541 here? Do you know why?

3, after the first epoch, pipetransformer obtains some frozen layers. I saw some newly added ranks. From my understanding, if some layers are frozen, the number of ranks should be reduced? Why there will be newly added ranks.

################################ Number of frozen layers: 6 
################################ Pipe length: 4/8 
################################ Newly added ranks: [1, 9]

Further more, how to interpret the frozen message here:

frozen_message = [6.0, 4.0, 14.175744, 1.0, 2.0, 1.0, 9.0, -1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

4, How do pipetransformer execute inter-node communication? My machine has RDMA enabled. But I found that it used TCP as default. Is there any way to enable using RDMA?

Thanks

opened by Young768 2

Runtime error

Hello. I'm having difficulties running the code provided. First of all, I have a question: is it possible to run your code without infiniband? I'm running as follows:

nohup sh run_tc_pipetransformer.sh 8 2 0 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
nohup sh run_tc_pipetransformer.sh 8 2 1 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &

And get the following error:

Traceback (most recent call last):
  File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
    pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
  File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
    self.auto_dp = AutoDataParallel(config)
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 46, in __init__
    self.init_rpc()
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 117, in init_rpc
    rpc.init_rpc(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in 
init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 360, in 
_tensorpipe_init_backend_handler
    api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 224, in _all_gather
    rpc_sync(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 809, in rpc_sync
    return fut.wait()
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

Do you have an idea what it can be cause by? I was thinking that maybe it's because i haven't turned infiniband on, but when I change 0 "lo" to 1 "ib0" in both scripts I get another error message:

Traceback (most recent call last):
  File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
    pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
  File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
    self.auto_dp = AutoDataParallel(config)
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 45, in __init__
    self.init_ddp()
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 86, in init_ddp
    dist.init_process_group(init_method='tcp://' + str(self.config.master_addr) + ':' + str(self.config.master_port),
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in 
init_process_group
    default_pg = _new_process_group_helper(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1670525541990/work/third_party/gloo/gloo/transport/tcp/device.cc:80] ifa != nullptr. Unable to find address for: ib0

Any help would be appreciated

opened by RealAntonVoronov 0

The example of image classification on ImageNet?

Hello~ Thanks for providing the source code, and Pipeline-Transformer is a good work of huge practical value! Currently, in your uploaded "image_classification" folder, it seems that the code runs on CIFAR-100.

How about releasing the code of ImageNet?

opened by Openning07 0

Owner

DistributedML

GitHub

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

25 Jun 20, 2021

DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

45 Jan 1, 2023

Secure Distributed Training at Scale

Secure Distributed Training at Scale This repository contains the implementation of experiments from the paper "Secure Distributed Training at Scale"

9 Jul 11, 2022

Open-AI's DALL-E for large scale training in mesh-tensorflow.

DALL-E in Mesh-Tensorflow [WIP] Open-AI's DALL-E in Mesh-Tensorflow. If this is similarly efficient to GPT-Neo, this repo should be able to train mode

432 Dec 16, 2022

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

33 Jun 27, 2021

Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

137 Dec 15, 2022

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

48 Dec 30, 2022

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

Related tags

Overview

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

1. Introduction

2. Overall Design

3. Slides

4. Understanding PipeTransformer by Animation

5. Installation

6. Experiments

7. Citation

8. Contacts

You might also like...

Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL)

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Evaluation suite for large-scale language models.

The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

Official implementation of the ICML2021 paper "Elastic Graph Neural Networks"

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Apache Spark - A unified analytics engine for large-scale data processing

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

Comments

Cifar100 Finetune NAN

Some question on PipeTransformer

Runtime error

The example of image classification on ImageNet?

Owner

DistributedML

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

DeepGNN is a framework for training machine learning models on large scale graph data.

Secure Distributed Training at Scale

Open-AI's DALL-E for large scale training in mesh-tensorflow.

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

Galileo library for large scale graph training by JD

UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training