Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

HPC-AI Tech

Last update: Dec 27, 2022

Related tags

Deep Learning SkyComputing

Overview

Sky Computing

Introduction

Sky Computing is a load-balanced framework for federated learning model parallelism. It adaptively allocate model layers to devices based on the their hardware sepcification. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT in a 64-node cluster. Our paper can be found at https://arxiv.org/abs/2202.11836

The concept sky computing was first introduced by Dr. Katarzyna Keahey et al. They used this word to describe a cross-cloud compute pattern. And later Prof. Stoica and Prof. Shenker generalized this word to geo-distributed computing. Our project is based on their definition. [1] [2]

Installation

git clone [email protected]:hpcaitech/SkyComputing.git
python -m pip install -r requirements.txt
cd ./scaelum
python -m pip install -v -e .

Experiment (using BERT)

To benchmark the Sky Computing, we prepared a single demo which you can run on your cluster to train BERT.

Prepare BERT model

Bidirectional Encoder Representations from Transformers (aka BERT) is one of the state-of-the-art deep learning models for Natural Language Processing. In the experiment part, we use BERT to run a simple benchmark.

cd $PROJECT
mkdir -p BERT/model && cd BERT/model 
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip

Prepare GLUE MNLI dataset

The General Language Understanding Evaluation (aka GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. And the Multi-Genre Natural Language Inference (aka MNLI) is one of the tasks in GLUE, it is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.

cd $PROJECT
mkdir -p BERT/data && cd BERT/data
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/1502038877f6a88c225a34450793fbc3ea87eaba/download_glue_data.py
python download_glue_data.py --data_dir ./glue_data --tasks MNLI

Configuration

To run dllb in your cluster, you need to write a config file which contains the necessary information about training, e.g. model layers, useful environment variables. We have provided a well-commentted example, and here are some most important option:

# your project path
PROJECT = os.getenv("PROJECT")

# allocation type, valid values are even, optimal and dynamic
ALLOCATE_TYPE = "even"

# num of node (including the central server)
CORE_NUM = 4

Run scripts

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We used slurm script to run our experiment.

#!/bin/sh

#SBATCH --job-name=gpu16   # Job name
#SBATCH -o gpu16.o%j       # Name of stdout output file
#SBATCH -e gpu16.e%j       # Name of stderr error file
#SBATCH -N 16              # Node numbers
#SBATCH -n 16              # GPU numbers
#SBATCH --time=02:00:00    # Run time (hh:mm:ss)

# run
python ./ip_addr.py > "./HOST"
srun python ./launch.py -c "./experiment/config.py"

Citation

@misc{zhu2022sky,
      title={Sky Computing: Accelerating Geo-distributed Computing in Federated Learning}, 
      author={Jie Zhu and Shenggui Li and Yang You},
      year={2022},
      eprint={2202.11836},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Reference

@article{keahey2009sky,
  title={Sky computing},
  author={Keahey, Katarzyna and Tsugawa, Mauricio and Matsunaga, Andrea and Fortes, Jose},
  journal={IEEE Internet Computing},
  volume={13},
  number={5},
  pages={43--51},
  year={2009},
  publisher={IEEE}
}
@inproceedings{stoica2021cloud,
  title={From cloud computing to sky computing},
  author={Stoica, Ion and Shenker, Scott},
  booktitle={Proceedings of the Workshop on Hot Topics in Operating Systems},
  pages={26--32},
  year={2021}
}

Comments

about install

I wonder if the code in the installation section cd ./scaelum should be removed ? cause it turn out an error [ERROR: File "setup.py" or "setup.cfg" not found. Directory cannot be installed in editable mode: /home/l000/SkyComputing/scaelum]

python download_glue_data.py --data_dir ./glue_data --tasks MNLI Downloading and extracting MNLI... Note (12/10/20): This script no longer downloads SNLI. You will need to manually download and format the data to use SNLI.

It seems that the error has no connected with the envrironment, so those were not put out .

opened by fermidos 0

Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

1k Jan 3, 2023

The code for the NSDI'21 paper "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing".

BMC The code for the NSDI'21 paper "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing". BibTex entry available here. B

383 Dec 16, 2022

《Geo Word Clouds》paper implementation

2 Jan 28, 2022

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

University1652-Baseline [Paper] [Slide] [Explore Drone-view Data] [Explore Satellite-view Data] [Explore Street-view Data] [Video Sample] [中文介绍] This

335 Jan 6, 2023

Weak-supervised Visual Geo-localization via Attention-based Knowledge Distillation

Weak-supervised Visual Geo-localization via Attention-based Knowledge Distillation Introduction WAKD is a PyTorch implementation for our ICPR-2022 pap

2 Oct 20, 2022

NPBG++: Accelerating Neural Point-Based Graphics

[CVPR 2022] NPBG++: Accelerating Neural Point-Based Graphics Project Page | Paper This repository contains the official Python implementation of the p

57 Dec 3, 2022

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

H2O H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Fl

6.1k Jan 5, 2023

FedJAX is a library for developing custom Federated Learning (FL) algorithms in JAX.

FedJAX: Federated learning with JAX What is FedJAX? FedJAX is a library for developing custom Federated Learning (FL) algorithms in JAX. FedJAX priori

208 Dec 14, 2022

An open framework for Federated Learning.

Welcome to Intel® Open Federated Learning Federated learning is a distributed machine learning approach that enables organizations to collaborate on m

397 Dec 27, 2022

Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Related tags

Overview

Sky Computing

Introduction

Installation

Experiment (using BERT)

Prepare BERT model

Prepare GLUE MNLI dataset

Configuration

Run scripts

Citation

Reference

You might also like...

Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

The code for the NSDI'21 paper "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing".

《Geo Word Clouds》paper implementation

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Weak-supervised Visual Geo-localization via Attention-based Knowledge Distillation

NPBG++: Accelerating Neural Point-Based Graphics

FedJAX is a library for developing custom Federated Learning (FL) algorithms in JAX.

An open framework for Federated Learning.

Comments

about install

Owner

HPC-AI Tech

Federated Learning - Including common test models for federated learning, like CNN, Resnet18 and lstm, controlled by different parser

Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data based on Pytorch Framework

Official implementation of "Accelerating Reinforcement Learning with Learned Skill Priors", Pertsch et al., CoRL 2020

A PyTorch Library for Accelerating 3D Deep Learning Research

Learn about quantum computing and algorithm on quantum computing

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

QueryDet: Cascaded Sparse Query for Accelerating High-Resolution SmallObject Detection

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).