Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Shigang Li

Last update: Oct 29, 2022

Related tags

Deep Learning Ok-Topk

Overview

Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Ok-Topk is a scheme for distributed training with sparse gradients. Ok-Topk integrates a novel sparse allreduce algorithm (less than 6k communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved theoretically and empirically.

Setup the environment

To install the required Python modules:

conda create --name py38_oktopk python=3.8

conda activate py38_oktopk

pip3 install pip==20.2.4

pip install -r requirements.txt

MPICC="cc -shared" pip install --no-binary=mpi4py mpi4py

git clone https://github.com/NVIDIA/apex

cd apex

pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Prepare Datasets

Cifar-10 for VGG

cd ./VGG/vgg_data

wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

tar -zxvf cifar-10-python.tar.gz

AN4 for LSTM

cd ./LSTM/audio_data

wget https://www.dropbox.com/s/l5w4up20u5pfjxf/an4.zip

unzip an4.zip

Wikipedia for BERT

cd ./BERT/bert/bert_data/

Prepare the dataset according to the README file.

Run jobs

We run experiments on GPU clusters with SLURM job scheduler. To evaluate the performance of Ok-Topk, Gaussiank, gtopk, topkA, topkDSA, and dense, run the jobs as follows.

To run VGG jobs

cd ./VGG

./sbatch_vgg_jobs.sh

To run LSTM jobs

cd ./LSTM

./sbatch_lstm_jobs.sh

To run BERT jobs

cd ./BERT/bert/

./sbatch_bert_jobs.sh

Publication

The work of Ok-Topk is pulished in PPoPP'22. DOI

License

See LICENSE.

Comments

Error when running vgg16_oktopk.sh

Dear authors:

Thank you for open source your work. I tried to reproduce your experiments on Ok-TopK by running sbatch vgg16_oktopk.h, but encountered the following error:

Exception in thread allreducer:
Traceback (most recent call last):
  File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "<my-dir>/frontera/Ok-Topk/VGG/allreducer.py", line 643, in run
    self._boundaries[new_name][i] = global_boundaries[i]
IndexError: index 0 is out of bounds for axis 0 with size 0
Exception in thread allreducer:
Traceback (most recent call last):
  File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "<my-dir>/<src-dir>/Ok-Topk/VGG/allreducer.py", line 643, in run
    self._boundaries[new_name][i] = global_boundaries[i]
IndexError: index 0 is out of bounds for axis 0 with size 0
Exception in thread allreducer:
Traceback (most recent call last):
  File "<my-dir2>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "<my-dir2>//anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "<my-dir2>/<src-dir>/Ok-Topk/VGG/allreducer.py", line 643, in run
    self._boundaries[new_name][i] = global_boundaries[i]
IndexError: index 0 is out of bounds for axis 0 with size 0

Do you have any ideas on what happened here? The only changes I made is decreasing the number of nodes:

#SBATCH --nodes=4
#SBATCH --ntasks=4
...
nworkers="${nworkers:-4}"

opened by EtoDemerzel0427 4

CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
How to reproduce the results of LSTM on AN4
I was interested in your work on PPoPP'22, thank you for making the code open source. I tried to run the LSTM AN4 code, but cannot achieve the results claimed in the paper(WER=0.309 or 0.368)，only reach 0.46. I know I'm using a different environment, Perhaps you can give me some suggestions to improve the WER?

Here is the environment I use:

8*A100 within one server

Horovod 0.22.1

Here are the parameters I use: horovodrun -np 8 python horovod_trainer.py --dnn lstman4 --dataset an4 --max-epochs 1000 --batch-size 2 --nworkers 8 --data-dir ./audio_data --lr 0.001 --nwpernode 8 --nsteps-update 1

In addition, I also adjusted learning rate decay rate(In dl_trainer.py/_Adjust_Learning_Rate_LSTMan4() ). The original 1.01 May not be suitable for my environment, so I changed it to 1.005.

Thank you for seeing this, do you have any suggestions?
opened by Jia-zb 2
Bump ujson from 4.0.2 to 5.2.0
Bumps ujson from 4.0.2 to 5.2.0.

Release notes

Sourced from ujson's releases.

5.2.0

Added

Support parsing NaN, Infinity and -Infinity (#514) @Erotemic

Support dynamically linking against system double-conversion library (#508) @musicinmybrain

Add env var to control stripping debug info (#507) @musicinmybrain

Add JSONDecodeError (#498) @JustAnotherArchivist

Fixed

Fix buffer overflows (CVE-2021-45958) (#519) @JustAnotherArchivist

Upgrade Black to fix Click (#515) @hugovk

simplify exception handling on integer overflow (#510) @RouquinBlanc

Remove dead code that used to handle the separate int type in Python 2 (#509) @JustAnotherArchivist

Fix exceptions on encoding list or dict elements and non-overflow errors on int handling getting silenced (#505) @JustAnotherArchivist

5.1.0

Changed

Strip debugging symbols from Linux binaries (#493) @bwoodsend

5.0.0

Added

Use cibuildwheel to build wheels (#491) @bwoodsend

Removed

Drop support for soon-EOL Python 3.6 (#490) @hugovk

Fixed

Install Twine to upload to PyPI (#492) @hugovk

4.3.0

Added

Enable Windows on ARM64 target (#488) @nsait-linaro

4.2.0

Added

Add a default keyword argument to dumps (#470) @garenchan

Add support for Python 3.10 (#472) @hugovk

Build 32-bit wheels for Windows (#481) @hugovk

Build PyPy3 wheels for manylinux (#475) @hugovk

Build wheels for musl aarch64 (aka ARM) Linux (musllinux_1_1_aarch64) (#478) @bwoodsend

Build wheels for musl Linux (musllinux_1_1_x86_64) (#476) @bwoodsend

Changed

... (truncated)

Commits

f6860f1 Remove shebang

c0ff7b1 python -m pytest

362fed3 Clearer pytest command

82917c0 actions/checkout@v3

3c095f1 Widen tests to cover more possible buffer overflows

f4d2c87 Refactor buffer reservations to ensure sufficient space on all additions

1846e08 Add fuzz test to CI/CD.

5875168 Fix some more seg-faults on encoding.

1a39406 Remove the hidden JSON_NO_EXTRA_WHITESPACE compile knob.

20aa1a6 Add a fuzzing test to search for segfaults in encoding.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Multi-Node Sparse Training Error

Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo. I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.

May I ask some suggestions about how to debug it?

Thanks.

opened by gaow0007 3

Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Related tags

Overview

Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Setup the environment

Prepare Datasets

Cifar-10 for VGG

AN4 for LSTM

Wikipedia for BERT

Run jobs

To run VGG jobs

To run LSTM jobs

To run BERT jobs

Publication

License

Comments

Error when running vgg16_oktopk.sh

CVE-2007-4559 Patch

Patching CVE-2007-4559

How to reproduce the results of LSTM on AN4

Bump ujson from 4.0.2 to 5.2.0

5.2.0

Added

Fixed

5.1.0

Changed

5.0.0

Added

Removed

Fixed

4.3.0

Added

4.2.0

Added

Changed

Multi-Node Sparse Training Error

Owner

Shigang Li

A Planar RGB-D SLAM which utilizes Manhattan World structure to provide optimal camera pose trajectory while also providing a sparse reconstruction containing points, lines and planes, and a dense surfel-based reconstruction.

Near-Duplicate Video Retrieval with Deep Metric Learning

Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch

Repository for "Improving evidential deep learning via multi-task learning," published in AAAI2022

Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

This code is a near-infrared spectrum modeling method based on PCA and pls

Exposure Time Calculator (ETC) and radial velocity precision estimator for the Near InfraRed Planet Searcher (NIRPS) spectrograph

Faune proche - Retrieval of Faune-France data near a google maps location

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

Tensorflow 2 implementation of the paper: Learning and Evaluating Representations for Deep One-class Classification published at ICLR 2021

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

Fast sparse deep learning on CPUs

Sdf sparse conv - Deep Learning on SDF for Classifying Brain Biomarkers

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more