PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

Overview

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

This repository is the official implementation of the following paper:

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
Chaoyang He (USC), Shen Li (Facebook AI Research), Mahdi Soltanolkotabi (USC), Salman Avestimehr (USC)
Accepted to ICML 2021 (International Conference on Machine Learning 2021)

1. Introduction

PipeTransformer

The size of Transformer models is growing at an unprecedented rate. It has taken less than one year to reach trillion-level parameters since the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated elastic pipelining for efficient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on SQuAD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, PipeTransformer attains up to $2.83$-fold speedup without losing accuracy. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. Finally, we have modularized our training system with flexible APIs and made the source code publicly available.

2. Overall Design

PipeTransformer

3. Slides

https://docs.google.com/presentation/d/1t6HWL33KIQo2as0nSHeBpXYtTBcy0nXCoLiKd0EashY/edit?usp=sharing

4. Understanding PipeTransformer by Animation

https://videos.files.wordpress.com/3vsRzoiw/pipetransformer-animation_m4v_hd.mp4

Animation

5. Installation

Please follow INSTALL-CONDA.md.

6. Experiments

check README.md at

examples/image_classification

examples/question_answering

examples/text_classification

7. Citation

If you use any part of this code in your research or any engineering project, please cite our paper:

@article{he2021pipetransformer,
  title={Pipetransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models},
  author={He, Chaoyang and Li, Shen and Soltanolkotabi, Mahdi and Avestimehr, Salman},
  journal={Thirty-eighth International Conference on Machine Learning},
  year={2021}
}

8. Contacts

Chaoyang He
https://chaoyanghe.com
[email protected]
[email protected]

You might also like...
Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL)

LUPerson-NL Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL) The repository is for our CVPR2022 paper Large-Scale

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training
BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

Evaluation suite for large-scale language models.

This repo contains code for running the evaluations and reproducing the results from the Jurassic-1 Technical Paper (see blog post), with current support for running the tasks through both the AI21 Studio API and OpenAI's GPT3 API.

The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.
The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient (paper) @misc{zhang2021compress,

Official implementation of the ICML2021 paper
Official implementation of the ICML2021 paper "Elastic Graph Neural Networks"

ElasticGNN This repository includes the official implementation of ElasticGNN in the paper "Elastic Graph Neural Networks" [ICML 2021]. Xiaorui Liu, W

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

TorchMultimodal (Alpha Release) Introduction TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

Comments
  • Cifar100 Finetune NAN

    Cifar100 Finetune NAN

    When I reproduce cifar100 finetuning result on ViT. I notice that you comment this line.

    I comment / uncomment this line. Both lead to NAN loss within 1 epoch.

    May I ask you to provide your experimental log of cifar100 finetuning.

    opened by gaow0007 4
  • Some question on PipeTransformer

    Some question on PipeTransformer

    Hello, thanks for the open-sourced project. I just ran some experiments and tried to understand your implementation. Could you please help me explain these logs? (which could help me understand your paper and code...sorry...i haven't go through the code, but will do..) 1,

    _auto_balanced_elastic_partition(): {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5}
    4189 2021-07-23,22:41:45.277 - {auto_pipe.py (141)} - _auto_balanced_elastic_partition(): {0: 10.194431999999999, 1: 7.087872, 2: 11.81184, 3: 9.451775999999999, 4: 11.81184, 5: 9.451775999999999, 6: 14.175744, 7: 11.890276}
    

    does the number, such as 10.194431999999999 here represent parameter size? How about the value in the first dict {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5} ? for example 0:4, whats does 4 mean on device 0?

    2,

    4189 2021-07-23,22:42:04.167 - {cv_trainer.py (76)} - train(): global_rank = 0. epoch = 0, batch index = 0/79
    4189 2021-07-23,22:42:07.974 - {cv_trainer.py (92)} - train(): (epoch = 0) forward_time_per_batch = 3.789346694946289
    4189 2021-07-23,22:42:09.849 - {cv_trainer.py (105)} - train(): global_rank = 0. data loading cost = 5.682324171066284
    4189 2021-07-23,22:42:09.849 - {cv_trainer.py (109)} - train(): global_rank = 0. sample_num_throughput (images/second): 4466480
    4189 2021-07-23,22:42:09.850 - {cv_trainer.py (112)} - train(): global_rank = 0. communication frequency (cross machine sync/second): 5065.584541
    4189 2021-07-23,22:42:09.850 - {cv_trainer.py (122)} - train(): -------------------------------------
    4189 2021-07-23,22:42:10.461 - {cv_trainer.py (72)} - train(): (epoch = 0) backwards_time_per_batch = 3.137924909591675
    4189 2021-07-23,22:42:10.461 - {cv_trainer.py (74)} - train(): --------------global_rank = 0. Epoch 0, batch index 1 Statistics: 
    4189 2021-07-23,22:42:10.461 - {cv_trainer.py (76)} - train(): global_rank = 0. epoch = 0, batch index = 1/79
    4189 2021-07-23,22:42:10.706 - {cv_trainer.py (92)} - train(): (epoch = 0) forward_time_per_batch = 2.006029725074768
    4189 2021-07-23,22:42:11.244 - {cv_trainer.py (109)} - train(): global_rank = 0. sample_num_throughput (images/second): 916
    4189 2021-07-23,22:42:11.244 - {cv_trainer.py (112)} - train(): global_rank = 0. communication frequency (cross machine sync/second): 1.433845
    

    what does communication frequency mean here? I did observe the number at the first batch is much bigger than others, i.e., 5065.584541 here? Do you know why?

    3, after the first epoch, pipetransformer obtains some frozen layers. I saw some newly added ranks. From my understanding, if some layers are frozen, the number of ranks should be reduced? Why there will be newly added ranks.

    ################################ Number of frozen layers: 6 
    ################################ Pipe length: 4/8 
    ################################ Newly added ranks: [1, 9] 
    

    Further more, how to interpret the frozen message here:

    frozen_message = [6.0, 4.0, 14.175744, 1.0, 2.0, 1.0, 9.0, -1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
    

    4, How do pipetransformer execute inter-node communication? My machine has RDMA enabled. But I found that it used TCP as default. Is there any way to enable using RDMA?

    Thanks

    opened by Young768 2
  • Runtime error

    Runtime error

    Hello. I'm having difficulties running the code provided. First of all, I have a question: is it possible to run your code without infiniband? I'm running as follows:

    nohup sh run_tc_pipetransformer.sh 8 2 0 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
    nohup sh run_tc_pipetransformer.sh 8 2 1 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
    

    And get the following error:

    Traceback (most recent call last):
      File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
        pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
      File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
        self.auto_dp = AutoDataParallel(config)
      File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 46, in __init__
        self.init_rpc()
      File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 117, in init_rpc
        rpc.init_rpc(
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
        _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
        rpc_agent = backend_registry.init_backend(
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in 
    init_backend
        return backend.value.init_backend_handler(*args, **kwargs)
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 360, in 
    _tensorpipe_init_backend_handler
        api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
        return func(*args, **kwargs)
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 224, in _all_gather
        rpc_sync(
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
        return func(*args, **kwargs)
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 809, in rpc_sync
        return fut.wait()
    RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
    

    Do you have an idea what it can be cause by? I was thinking that maybe it's because i haven't turned infiniband on, but when I change 0 "lo" to 1 "ib0" in both scripts I get another error message:

    Traceback (most recent call last):
      File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
        pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
      File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
        self.auto_dp = AutoDataParallel(config)
      File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 45, in __init__
        self.init_ddp()
      File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 86, in init_ddp
        dist.init_process_group(init_method='tcp://' + str(self.config.master_addr) + ':' + str(self.config.master_port),
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in 
    init_process_group
        default_pg = _new_process_group_helper(
      File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
        pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
    RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1670525541990/work/third_party/gloo/gloo/transport/tcp/device.cc:80] ifa != nullptr. Unable to find address for: ib0
    

    Any help would be appreciated

    opened by RealAntonVoronov 0
  • The example of image classification on ImageNet?

    The example of image classification on ImageNet?

    Hello~ Thanks for providing the source code, and Pipeline-Transformer is a good work of huge practical value! Currently, in your uploaded "image_classification" folder, it seems that the code runs on CIFAR-100.

    How about releasing the code of ImageNet?

    opened by Openning07 0
Owner
DistributedML
DistributedML
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

Aiden Nibali 25 Jun 20, 2021
DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

Microsoft 45 Jan 1, 2023
Secure Distributed Training at Scale

Secure Distributed Training at Scale This repository contains the implementation of experiments from the paper "Secure Distributed Training at Scale"

Yandex Research 9 Jul 11, 2022
Open-AI's DALL-E for large scale training in mesh-tensorflow.

DALL-E in Mesh-Tensorflow [WIP] Open-AI's DALL-E in Mesh-Tensorflow. If this is similarly efficient to GPT-Neo, this repo should be able to train mode

EleutherAI 432 Dec 16, 2022
An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

null 137 Dec 15, 2022
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
Galileo library for large scale graph training by JD

近年来,图计算在搜索、推荐和风控等场景中获得显著的效果,但也面临超大规模异构图训练,与现有的深度学习框架Tensorflow和PyTorch结合等难题。 Galileo(伽利略)是一个图深度学习框架,具备超大规模、易使用、易扩展、高性能、双后端等优点,旨在解决超大规模图算法在工业级场景的落地难题,提

JD Galileo Team 128 Nov 29, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Microsoft 7.6k Jan 1, 2023
Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

ColossalAI An integrated large-scale model training system with efficient parallelization techniques Installation PyPI pip install colossalai Install

HPC-AI Tech 7.1k Jan 3, 2023