Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Overview

Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Ok-Topk is a scheme for distributed training with sparse gradients. Ok-Topk integrates a novel sparse allreduce algorithm (less than 6k communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved theoretically and empirically.

Setup the environment

To install the required Python modules:

conda create --name py38_oktopk python=3.8

conda activate py38_oktopk

pip3 install pip==20.2.4

pip install -r requirements.txt

MPICC="cc -shared" pip install --no-binary=mpi4py mpi4py

git clone https://github.com/NVIDIA/apex

cd apex

pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Prepare Datasets

Cifar-10 for VGG

cd ./VGG/vgg_data

wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

tar -zxvf cifar-10-python.tar.gz

AN4 for LSTM

cd ./LSTM/audio_data

wget https://www.dropbox.com/s/l5w4up20u5pfjxf/an4.zip

unzip an4.zip

Wikipedia for BERT

cd ./BERT/bert/bert_data/

Prepare the dataset according to the README file.

Run jobs

We run experiments on GPU clusters with SLURM job scheduler. To evaluate the performance of Ok-Topk, Gaussiank, gtopk, topkA, topkDSA, and dense, run the jobs as follows.

To run VGG jobs

cd ./VGG

./sbatch_vgg_jobs.sh

To run LSTM jobs

cd ./LSTM

./sbatch_lstm_jobs.sh

To run BERT jobs

cd ./BERT/bert/

./sbatch_bert_jobs.sh

Publication

The work of Ok-Topk is pulished in PPoPP'22. DOI

License

See LICENSE.

Comments
  • Error when running vgg16_oktopk.sh

    Error when running vgg16_oktopk.sh

    Dear authors:

    Thank you for open source your work. I tried to reproduce your experiments on Ok-TopK by running sbatch vgg16_oktopk.h, but encountered the following error:

    Exception in thread allreducer:
    Traceback (most recent call last):
      File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
      File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
      File "<my-dir>/frontera/Ok-Topk/VGG/allreducer.py", line 643, in run
        self._boundaries[new_name][i] = global_boundaries[i]
    IndexError: index 0 is out of bounds for axis 0 with size 0
    Exception in thread allreducer:
    Traceback (most recent call last):
      File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
      File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
      File "<my-dir>/<src-dir>/Ok-Topk/VGG/allreducer.py", line 643, in run
        self._boundaries[new_name][i] = global_boundaries[i]
    IndexError: index 0 is out of bounds for axis 0 with size 0
    Exception in thread allreducer:
    Traceback (most recent call last):
      File "<my-dir2>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
      File "<my-dir2>//anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
      File "<my-dir2>/<src-dir>/Ok-Topk/VGG/allreducer.py", line 643, in run
        self._boundaries[new_name][i] = global_boundaries[i]
    IndexError: index 0 is out of bounds for axis 0 with size 0
    

    Do you have any ideas on what happened here? The only changes I made is decreasing the number of nodes:

    #SBATCH --nodes=4
    #SBATCH --ntasks=4
    ...
    nworkers="${nworkers:-4}"
    
    opened by EtoDemerzel0427 4
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • How to reproduce the results of LSTM on AN4

    How to reproduce the results of LSTM on AN4

    I was interested in your work on PPoPP'22, thank you for making the code open source. I tried to run the LSTM AN4 code, but cannot achieve the results claimed in the paper(WER=0.309 or 0.368),only reach 0.46. I know I'm using a different environment, Perhaps you can give me some suggestions to improve the WER?

    Here is the environment I use:

    • 8*A100 within one server
    • Horovod 0.22.1
    • Here are the parameters I use: horovodrun -np 8 python horovod_trainer.py --dnn lstman4 --dataset an4 --max-epochs 1000 --batch-size 2 --nworkers 8 --data-dir ./audio_data --lr 0.001 --nwpernode 8 --nsteps-update 1

    In addition, I also adjusted learning rate decay rate(In dl_trainer.py/_Adjust_Learning_Rate_LSTMan4() ). The original 1.01 May not be suitable for my environment, so I changed it to 1.005.

    Thank you for seeing this, do you have any suggestions?

    opened by Jia-zb 2
  • Bump ujson from 4.0.2 to 5.2.0

    Bump ujson from 4.0.2 to 5.2.0

    Bumps ujson from 4.0.2 to 5.2.0.

    Release notes

    Sourced from ujson's releases.

    5.2.0

    Added

    Fixed

    5.1.0

    Changed

    5.0.0

    Added

    Removed

    Fixed

    4.3.0

    Added

    4.2.0

    Added

    Changed

    ... (truncated)

    Commits
    • f6860f1 Remove shebang
    • c0ff7b1 python -m pytest
    • 362fed3 Clearer pytest command
    • 82917c0 actions/checkout@v3
    • 3c095f1 Widen tests to cover more possible buffer overflows
    • f4d2c87 Refactor buffer reservations to ensure sufficient space on all additions
    • 1846e08 Add fuzz test to CI/CD.
    • 5875168 Fix some more seg-faults on encoding.
    • 1a39406 Remove the hidden JSON_NO_EXTRA_WHITESPACE compile knob.
    • 20aa1a6 Add a fuzzing test to search for segfaults in encoding.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Multi-Node Sparse Training Error

    Multi-Node Sparse Training Error

    Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo. I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.

    May I ask some suggestions about how to debug it?

    Thanks.

    opened by gaow0007 3
Owner
Shigang Li
Shigang Li
A Planar RGB-D SLAM which utilizes Manhattan World structure to provide optimal camera pose trajectory while also providing a sparse reconstruction containing points, lines and planes, and a dense surfel-based reconstruction.

ManhattanSLAM Authors: Raza Yunus, Yanyan Li and Federico Tombari ManhattanSLAM is a real-time SLAM library for RGB-D cameras that computes the camera

null 117 Dec 28, 2022
Near-Duplicate Video Retrieval with Deep Metric Learning

Near-Duplicate Video Retrieval with Deep Metric Learning This repository contains the Tensorflow implementation of the paper Near-Duplicate Video Retr

null 2 Jan 24, 2022
Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch

Differentiable Neural Computers and family, for Pytorch Includes: Differentiable Neural Computers (DNC) Sparse Access Memory (SAM) Sparse Differentiab

ixaxaar 302 Dec 14, 2022
Repository for "Improving evidential deep learning via multi-task learning," published in AAAI2022

Improving evidential deep learning via multi task learning It is a repository of AAAI2022 paper, “Improving evidential deep learning via multi-task le

deargen 11 Nov 19, 2022
Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

This is a Python implementation of cover trees, a data structure for finding nearest neighbors in a general metric space (e.g., a 3D box with periodic

Patrick Varilly 28 Nov 25, 2022
This code is a near-infrared spectrum modeling method based on PCA and pls

Nirs-Pls-Corn This code is a near-infrared spectrum modeling method based on PCA and pls 近红外光谱分析技术属于交叉领域,需要化学、计算机科学、生物科学等多领域的合作。为此,在(北邮邮电大学杨辉华老师团队)指导下

Fu Pengyou 6 Dec 17, 2022
Exposure Time Calculator (ETC) and radial velocity precision estimator for the Near InfraRed Planet Searcher (NIRPS) spectrograph

NIRPS-ETC Exposure Time Calculator (ETC) and radial velocity precision estimator for the Near InfraRed Planet Searcher (NIRPS) spectrograph February 2

Nolan Grieves 2 Sep 15, 2022
Faune proche - Retrieval of Faune-France data near a google maps location

faune_proche Récupération des données de Faune-France près d'un lieu google maps

null 4 Feb 15, 2022
Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study Supplementary Materials for Kentaro Matsuura, Junya Honda, Imad

Kentaro Matsuura 4 Nov 1, 2022
The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

DS3L This is the code for paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020. Setups The code is implem

Guolz 36 Oct 19, 2022
Tensorflow 2 implementation of the paper: Learning and Evaluating Representations for Deep One-class Classification published at ICLR 2021

Deep Representation One-class Classification (DROC). This is not an officially supported Google product. Tensorflow 2 implementation of the paper: Lea

Google Research 137 Dec 23, 2022
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

English | 简体中文 Welcome to the PaddlePaddle GitHub. PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open

null 19.4k Jan 4, 2023
Fast sparse deep learning on CPUs

SPARSEDNN **If you want to use this repo, please send me an email: [email protected], or raise a Github issue. ** Fast sparse deep learning on CPUs

Ziheng Wang 44 Nov 30, 2022
Sdf sparse conv - Deep Learning on SDF for Classifying Brain Biomarkers

Deep Learning on SDF for Classifying Brain Biomarkers To reproduce the results f

null 1 Jan 25, 2022
Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

**Codebase and data are uploaded in progress. ** VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly ge

null 416 Jan 9, 2023
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 20.2k Jan 8, 2023
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 20.2k Jan 5, 2023
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 19.3k Feb 12, 2021