Bagua is a flexible and performant distributed training algorithm development framework.

Last update: Dec 17, 2022

Related tags

Deep Learning bagua

Overview

Bagua

Bagua is a distributed training utility developed by Kuaishou Technology and DS3 Lab@ETH. Users can extend the training on a single GPU to multi-GPUs (may across multiple machines), with excellent speedup guarantee, by simply adding a few lines of code. Bagua also provides a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. Researchers can easily develop new distributed training algorithms based on bagua, without sacrificing system performance.

So far, Bagua has integrated primitives including

Centralized Synchronous Communication (AllReduce)
Decentralized Synchronous Communication
Low Precision Communication

Its effectiveness has been verified in various scenarios, including VGG and ResNet on ImageNet, Bert Large and many industrial applications at Kuaishou.

The underlying communication execution engine is in bagua-core, a library written in Rust.

Performance

The scalability of different systems on VGG16 with up to 128 GPUs.

Epoch time of BERT-Large Finetune under different network conditions for different systems.

For more comprehensive and up to date results, refer to Bagua benchmark page.

Installation

Develop version:

pip install git+https://github.com/BaguaSys/bagua.git

Release version:

pip install bagua

Build API documentation locally

pip install -r docs/doc-requirements.txt
make html

Cite Bagua

@misc{gan2021bagua,
  title={BAGUA: Scaling up Distributed Learning with System Relaxations}, 
  author={Shaoduo Gan and Xiangru Lian and Rui Wang and Jianbin Chang and Chengjun Liu and Hongmei Shi and Shengzhuo Zhang and Xianghong Li and Tengxu Sun and Jiawei Jiang and Binhang Yuan and Sen Yang and Ji Liu and Ce Zhang},
  year={2021},
  eprint={2107.01499},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

@book{liu2020distributed,
  title={Distributed Learning Systems with First-Order Methods: An Instruction},
  author={Liu, J. and Zhang, C.},
  isbn={9781680837018},
  series={Foundations and trends in databases},
  url={https://books.google.com/books?id=vzQmzgEACAAJ},
  year={2020},
  publisher={now publishers}
}

Links

Comments

Why does FusedOptimizer has a huge impact on model precision?

I wrapped my custom optimizer with FusedOptimizer and the precision was way worse than that without FusedOptimizer. I think FusedOptimizer shouldn't be affecting the model precision. Or is there something wrong with my custom optimizer?

Here is the optimizer I use:

https://github.com/cybertronai/pytorch-lamb/blob/master/pytorch_lamb/lamb.py
bug

opened by ProHuper 11
turn off buagua-net

Hi, I want to know if I could turn off bagua-net in this script. so that I could compare with the original pytorch throughput . passing the --enable_bagua_net argument in bagua.distributed.launch does not work.

opened by CaRRotOne 8
My process has been blocked，the screen (as follows) change nothing till 30 minutes
Describe the bug A clear and concise description of what the bug is.

Environment

Your operating system and version:Ubuntu18.04

Your python version:3.8

Your PyTorch version:11.0

How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:

Have you tried using latest bagua master (python3 -m pip install git+https://github.com/BaguaSys/bagua.git -f https://repo.arrayfire.com/python/wheels/3.8.0/)?:0.8.1.post1

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Additional context Add any other context about the problem here.
opened by lixiangMindSpore 8

Problem with AttributeError 'setuptools._distutils' has no attribute 'version') with executing MNIST example

I ran the MNIST example and got the following error:

`[kqian@eu-login-04 testrun]$ python3 -m bagua.distributed.launch --nproc_per_node=8 main.py --arch resnet50 --algorithm gradient_allreduce [imagenet-folder with train and val folders]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
**AttributeError: module 'setuptools._distutils' has no attribute 'version'**
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Traceback (most recent call last):
  File "main.py", line 22, in <module>
    import bagua.torch_api as bagua
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    from .tensor import BaguaTensor  # noqa: F401
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'
Killing subprocess 26136
Killing subprocess 26137
Killing subprocess 26138
Killing subprocess 26140
Killing subprocess 26142
Killing subprocess 26144
Killing subprocess 26145
Killing subprocess 26146
Traceback (most recent call last):
  File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 342, in <module>
    main()
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 327, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 291, in sigkill_handler
    returncode=last_return_code, cmd=cmd
subprocess.CalledProcessError: Command '['/cluster/apps/nss/python/3.7.4/x86_64/bin/python3', '-u', 'main.py', '--arch', 'resnet50', '--algorithm', 'gradient_allreduce', '[imagenet-folder', 'with', 'train', 'and', 'val', 'folders]']' returned non-zero exit status 1.

`

opened by silverCore97 7

cannot find libnccl.so.2
Describe the bug A clear and concise description of what the bug is.

Environment

Your operating system and version: Ubuntu18.04

Your python version:3.8

Your PyTorch version:11.0

How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:

conda create -n torch17 python=3.8

Have you tried using latest bagua master (python3 -m pip install git+https://github.com/BaguaSys/bagua.git -f https://repo.arrayfire.com/python/wheels/3.8.0/)?:I use 0.8.1.post1

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Additional context Add any other context about the problem here.
question
opened by lixiangMindSpore 7
I use bagua with the phenomenon as follows ( bagua.broadcast(ps, 0, comm=comm) )
Describe the bug A clear and concise description of what the bug is.

Environment

Your operating system and version:Ubuntu18.04

Your python version:3.8.12

Your PyTorch version:11.0

How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:conda create -n torch17 python=3.8

Have you tried using latest bagua master (python3 -m pip install --pre bagua)?:0.8.1.post1

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Additional context Add any other context about the problem here.
question
opened by lixiangMindSpore 6

NCCL error when running backward

I ran a very simply example and got error:

WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Bootstrap : Using eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Bootstrap : Using eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Using network IB
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 00/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 01/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 02/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Setting affinity for GPU 1 to 3f,07ff0000,003e07ff
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 03/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Setting affinity for GPU 0 to 3f,07ff0000,003e07ff
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 00 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 01 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 03 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 02 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 03 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Connected all rings
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Connected all trees
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Connected all rings
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Connected all trees
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO comm 0x55bc8aee70c0 rank 1 nranks 2 cudaDev 1 busId 3d000 - Init COMPLETE
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO comm 0x555f0e926110 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
2021-11-04T14:16:06.243214Z  WARN bagua_core_internal: Parameter autotuning service not detected. Enabling it may further improve the performance. See https://tutorials.baguasys.com/performance-autotuning/ for more details.
2021-11-04T14:16:06.243246Z  WARN bagua_core_internal: Parameter autotuning service not detected. Enabling it may further improve the performance. See https://tutorials.baguasys.com/performance-autotuning/ for more details.
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Launch mode Parallel

ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [0] enqueue.cc:329 NCCL WARN Cuda failure 'invalid resource handle'
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [0] NCCL INFO enqueue.cc:1047 -> 1
fatal runtime error: Rust cannot catch foreign exceptions
Killing subprocess 93207
Killing subprocess 93208
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 342, in <module>
    main()
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 327, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 290, in sigkill_handler
    raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', 'train.py']' died with <Signals.SIGABRT: 6>.

I used nccl-2.10.3 and cuda-10.2, I'm using local nccl, but same error will encounter when i install nccl using bagua_core.install_deps, and everything works fine if I use DDP.

here's my code:

import torch
from torch.nn.modules.loss import CrossEntropyLoss
from torch.utils.data.dataloader import DataLoader
from LAMB import LAMB
from bagua.torch_api.contrib.fuse.optimizer import fuse_optimizer
import torch.nn as nn
import torch.optim
from torch.utils.data import Dataset, DataLoader
import bagua.torch_api as bagua
from bagua.torch_api.algorithms import gradient_allreduce

from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import argparse

class MyDataset(Dataset):
    def __init__(self) -> None:
        self.input = torch.randn(10000, 10)
        self.laebl = torch.randn(10000, 1)

    def __getitem__(self, index):
        return self.input[index], self.laebl[index]

    def __len__(self):
        return  10000


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int, default=-1)
    args = parser.parse_args()
    # dist.init_process_group(backend='nccl')
    bagua.init_process_group()

    model = nn.Sequential(
        nn.Linear(10, 5),
        nn.Linear(5, 2),
        nn.Linear(2, 1),
    )   

    optimizer = torch.optim.Adam(
        params=model.parameters(),
        lr=0.1,
        betas=(0.9, 0.999),
        eps=1e-06,
        weight_decay=0
    )

    algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
    model.to(bagua.get_local_rank())
    # model.to(args.local_rank)
    # model = DDP(model, device_ids=[args.local_rank])
    model = model.with_bagua(
        [optimizer],
        algorithm
    )
    dataset = MyDataset()
    dataloader = DataLoader(dataset, batch_size=5)

    for i in range(10):
        for x, y in dataloader:
            # x = x.to(args.local_rank)
            # y = y.to(args.local_rank)
            x = x.to(bagua.get_local_rank())
            y = y.to(bagua.get_local_rank())
            optimizer.zero_grad()
            output = model(x)
            loss = (output - y).pow(2).sum()
            loss.backward()
            optimizer.step()

opened by ProHuper 5

What's wrong with this? Do I need to do anything else? It will affect my result?
Describe the bug A clear and concise description of what the bug is.

Environment

Your operating system and version: Ubuntu18.04

Your python version:3.8.12

Your PyTorch version:11.1

How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:conda create -n torch python=3.8

Have you tried using latest bagua master (python3 -m pip install --pre bagua)?:0.8.1.post1

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Additional context Add any other context about the problem here.
question
opened by lixiangMindSpore 5
Format Python code with psf/black push

There appear to be some python formatting errors in a05e4e345ea9ab2f7b725a5cc2e90a827cef31ff. This pull request uses the psf/black formatter to fix these issues.
PR: unreviewed

opened by github-actions[bot] 5
Format Python code with psf/black push

There appear to be some python formatting errors in cd499b8482cd293c584b599f55ffacea94020039. This pull request uses the psf/black formatter to fix these issues.
PR: unreviewed

opened by github-actions[bot] 5
Format Python code with psf/black push

There appear to be some python formatting errors in 8117b05f0b08bc3d03dc8f48572fc95b50331ff6. This pull request uses the psf/black formatter to fix these issues.
PR: unreviewed

opened by github-actions[bot] 5
chore(deps): bump once_cell from 1.10.0 to 1.17.0 in /rust/bagua-core
Bumps once_cell from 1.10.0 to 1.17.0.

Changelog

Sourced from once_cell's changelog.

1.17.0

Add race::OnceRef for storing a &'a T.

1.16.0

Add no_std implementation based on critical-section, #195.

Deprecate atomic-polyfill feature (use the new critical-section instead)

1.15.0

Increase minimal supported Rust version to 1.56.0.

Implement UnwindSafe even if the std feature is disabled.

1.14.0

Add extension to unsync and sync Lazy mut API:

force_mut

get_mut

1.13.1

Make implementation compliant with strict provenance.

Upgrade atomic-polyfill to 1.0

1.13.0

Add Lazy::get, similar to OnceCell::get.

1.12.1

Remove incorrect debug_assert.

1.12.0

Add OnceCell::wait, a blocking variant of get.

1.11.0

Add OnceCell::with_value to create initialized OnceCell in const context.

Improve Clone implementation for OnceCell.

Rewrite parking_lot version on top of parking_lot_core, for even smaller cells!

Commits

85e372f Merge #213

fae9f73 Add once_cell::race::OnceRef

0d0dae1 Merge #212

e94003a Removed Debug impls to de-clutter example

3f750e1 Fixed LateInit example stack overflow

ba8b9fe Merge #209

ea438d8 Merge #210

f68835a fix miri tests

6d46f40 Add references to generic_once_cell

47bf9ae Readme: Sync related crates to docs

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

PR: unreviewed dependencies rust
opened by dependabot[bot] 0
chore(deps): bump libc from 0.2.125 to 0.2.139 in /rust/bagua-core
Bumps libc from 0.2.125 to 0.2.139.

Release notes

Sourced from libc's releases.

0.2.139

What's Changed

Update FreeBSD 12 CI environment to 12.4 by @asomers in rust-lang/libc#3028

Try readding all inotify flags by @carbotaniuman in rust-lang/libc#3030

linux: Add AT_SYSINFO_EHDR constant by @Phantomical in rust-lang/libc#3027

Add missing string-to-number functions from ISO C by @LegionMammal978 in rust-lang/libc#3036

Add support for QNX/Neutrino 7.1 by @gh-tr in rust-lang/libc#3038

Add misc constants and functions for android by @fkm3 in rust-lang/libc#2758

adding KERNEL_VERSION macro for linux. by @devnexen in rust-lang/libc#3041

Prepare 0.2.139 release by @flba-eb in rust-lang/libc#3042

New Contributors

@Phantomical made their first contribution in rust-lang/libc#3027

@LegionMammal978 made their first contribution in rust-lang/libc#3036

@gh-tr made their first contribution in rust-lang/libc#3038

@fkm3 made their first contribution in rust-lang/libc#2758

Full Changelog: https://github.com/rust-lang/libc/compare/0.2.138...0.2.139

0.2.138

What's Changed

Fix typo: readfs -> readfds by @giraffate in rust-lang/libc#2981

linux: Add POSIX_SPAWN_SETSID flag by @HarveyHunt in rust-lang/libc#2983

Enforce order of any s_*! macro calls by @flba-eb in rust-lang/libc#2985

Add FICLONE ioctl for linux aarch64 by @dtolnay in rust-lang/libc#2986

Revive x86_64-linux-android CI with an old nightly by @JohnTitor in rust-lang/libc#2990

fix wrong definitions of getpwent_r and getgrent_r on solarish os by @SteveLauC in rust-lang/libc#2914

add extended attributes constants on NetBSD by @SteveLauC in rust-lang/libc#2988

Use an old emulator to fix Android CI by @JohnTitor in rust-lang/libc#2989

Add ucontext and clone_args for loongarch64 by @zhaixiaojuan in rust-lang/libc#2993

Add Android uinput bindings by @spencercw in rust-lang/libc#2984

add extattr_list_xxx() on NetBSD by @SteveLauC in rust-lang/libc#2991

freebsd procctl flags update by @devnexen in rust-lang/libc#2992

Report the actual error when failing to get the rustc version by @bjorn3 in rust-lang/libc#3000

ci: Read test output from stderr by @JohnTitor in rust-lang/libc#3005

freebsd subset of memstat api addition by @devnexen in rust-lang/libc#2998

Add rand48 functions by @carbotaniuman in rust-lang/libc#2995

Add sys/ucontext.h signatures for linux aarch64 glibc by @dtolnay in rust-lang/libc#3001

Add pull request template by @JohnTitor in rust-lang/libc#3006

Add kexec_file_load system call for arm64 linux by @dtolnay in rust-lang/libc#3009

Migrate from highfive to triagebot by @ehuss in rust-lang/libc#3018

mips32: fix missing __s64 type definition by @cppcoffee in rust-lang/libc#3016

Revert "Auto merge of #3018 - ehuss:highfive-triagebot, r=JohnTitor" by @JohnTitor in rust-lang/libc#3019

Fix the loongarch64 kernel ABI by @xen0n in rust-lang/libc#3007

adding SYS_pidfd_send_signal/SYS_pidfd_getfd constants to linux uclib… by @devnexen in rust-lang/libc#3012

handle c circular dependence (linux gnu) by @BelovDV in rust-lang/libc#3013

Migrate from highfive to triagebot by @ehuss in rust-lang/libc#3020

Rearrange sockaddr_storage padding/alignment fields on Linux and Fuchsia by @stevenengler in rust-lang/libc#3010

linux musl adding PIDFD_NONBLOCK constant. by @devnexen in rust-lang/libc#3003

Add more capsicum functions for FreeBSD by @asomers in rust-lang/libc#3022

... (truncated)

Commits

f4bc851 Auto merge of #3042 - flba-eb:release_0.2.139, r=JohnTitor

dc3d43c Prepare 0.2.139 release

c59ca73 Auto merge of #3041 - devnexen:linux_kernel_version, r=JohnTitor

88d6a1f adding KERNEL_VERSION macro for linux.

45b431a Auto merge of #2758 - fkm3:master, r=JohnTitor

572e11b Add misc constants and functions for android

318dccc Auto merge of #3038 - gh-tr:rebased/20221216, r=JohnTitor

07636f6 Auto merge of #3036 - LegionMammal978:iso-c-funcs, r=JohnTitor

720151f Add support for QNX/Neutrino 7.1

6a58758 Add ISO C functions atof, atol, atoll, strtoll, strtoull

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

PR: unreviewed dependencies rust
opened by dependabot[bot] 0
chore(deps): update scikit-learn requirement from !=1.0,<=1.0.1,>=0.24 to >=0.24,!=1.0,<1.2.1
Updates the requirements on scikit-learn to permit the latest version.

Release notes

Sourced from scikit-learn's releases.

Scikit-learn 1.2.0

We're happy to announce the 1.2.0 release.

You can read the release highlights under https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_2_0.html and the long version of the change log under https://scikit-learn.org/stable/whats_new/v1.2.html

This version supports Python versions 3.8 to 3.11.

Commits

dc580a8 Release 1.2.0 [cd build] (#25121)

c7d5f58 Release 1.2.0rc1 [cd build] (#25045)

c1cfc4d ENH Extend PDP for nominal categorical features (#18298)

7d1c318 DOC Update "Parallelism, resource management, and configuration" section (#24...

df14322 ENH Add interaction constraint shortcuts to HistGradientBoosting* (#24849)

f76ea1b DOC Add more details regarding the improved efficiency in 1.1 and 1.2 (#25043)

2459331 Fixes broken link for yu-shi MSC (#25040)

f196344 ENH Improves error message for mixed types for feature names (#25018)

1912ae5 DOC add displays in highlights (#25042)

40d7d88 FEA add PredictionErrorDisplay (#18020)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

PR: unreviewed dependencies python
opened by dependabot[bot] 0

Programs get blocked when using multiple nodes training.

Describe the bug A clear and concise description of what the bug is.

Programs get blocked when using multiple nodes. By setting export LOG_LEVEL=DEBUG, I can see that it got stuck at BaguaSingleCommunicator, since it prints

2022-11-21T12:40:23.673510Z DEBUG bagua_core_internal::communicators: creating communicator, nccl_unique_id AgCwgcCQEwkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=, rank 8, nranks 16, device_id 0, stream_ptr 94639511762624

but fail to print

al communicator initialized at XXX.

When I set --node_rank=0, the program can run smoothly.

Environment

Your operating system and version: Linux node-162 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Your python version: Python 3.8.13 (default, Mar 28 2022, 11:38:47)
Your PyTorch version: 1.12.1
NCCL version: 2.10.3
How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: conda
Have you tried using latest bagua master (python3 -m pip install --pre bagua)?: yes

Reproducing

Please provide a minimal working example. This means the runnable code.

import argparse
from ast import arg
from curses import baudrate
import os
import random
import shutil
import time
import warnings
import logging

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
from torch.utils.tensorboard import SummaryWriter
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
import bagua.torch_api as bagua
from bisect import bisect_right
from pathlib import Path

model_names = sorted(
    name
    for name in models.__dict__
    if name.islower() and not name.startswith("__") and callable(models.__dict__[name])
)

parser = argparse.ArgumentParser(description="PyTorch ImageNet Training")
parser.add_argument("data", metavar="DIR", help="path to dataset")
parser.add_argument(
    "-a",
    "--arch",
    metavar="ARCH",
    default="resnet18",
    choices=model_names,
    help="model architecture: " + " | ".join(model_names) + " (default: resnet18)",
)
parser.add_argument(
    "-j",
    "--workers",
    default=4,
    type=int,
    metavar="N",
    help="number of data loading workers (default: 4)",
)
parser.add_argument(
    "--epochs", default=90, type=int, metavar="N", help="number of total epochs to run"
)
parser.add_argument(
    "--warmup-epochs", type=float, default=5, help="number of warmup epochs"
)
parser.add_argument(
    "--start-epoch",
    default=0,
    type=int,
    metavar="N",
    help="manual epoch number (useful on restarts)",
)
parser.add_argument(
    "-b",
    "--batch-size",
    default=256,
    type=int,
    metavar="N",
    help="mini-batch size (default: 256), this is the total "
    "batch size of all GPUs on the current node when "
    "using Data Parallel or Distributed Data Parallel",
)
parser.add_argument(
    "--lr",
    "--learning-rate",
    default=0.1,
    type=float,
    metavar="LR",
    help="initial learning rate",
    dest="lr",
)
parser.add_argument("--momentum", default=0.9, type=float, metavar="M", help="momentum")
parser.add_argument(
    "--wd",
    "--weight-decay",
    default=1e-4,
    type=float,
    metavar="W",
    help="weight decay (default: 1e-4)",
    dest="weight_decay",
)
parser.add_argument(
    "--milestones",
    default="60,70,80",
    type=str,
    help="multi-step learning rate scheduler milestones",
)
parser.add_argument(
    "--gama",
    type=float,
    default=0.2,
    help="multiplicative factor of learning rate decay",
)
parser.add_argument(
    "-p",
    "--print-freq",
    default=10,
    type=int,
    metavar="N",
    help="print frequency (default: 10)",
)
parser.add_argument(
    "--resume",
    default="",
    type=str,
    metavar="PATH",
    help="path to latest checkpoint (default: none)",
)
parser.add_argument(
    "--save-checkpoint", action="store_true", default=False, help="save checkpoint"
)
parser.add_argument(
    "-e",
    "--evaluate",
    dest="evaluate",
    action="store_true",
    help="evaluate model on validation set",
)
parser.add_argument(
    "--pretrained", dest="pretrained", action="store_true", help="use pre-trained model"
)
parser.add_argument(
    "--seed", default=None, type=int, help="seed for initializing training. "
)
parser.add_argument(
    "--amp",
    action="store_true",
    default=False,
    help="use amp",
)

parser.add_argument(
    "--prof", default=-1, type=int, help="Only run 10 iterations for profiling."
)

parser.add_argument(
    "--algorithm",
    type=str,
    default="gradient_allreduce",
    help="distributed algorithm: {gradient_allreduce, bytegrad, decentralized, low_precision_decentralized, qadam, async}",
)

parser.add_argument(
    "--async-sync-interval",
    default=500,
    type=int,
    help="Model synchronization interval(ms) for async algorithm",
)

parser.add_argument(
    "--async-warmup-steps",
    default=100,
    type=int,
    help="Warmup(allreduce) steps for async algorithm",
)

parser.add_argument(
    "--ckpt-dir",
    default="./ckpt/ckpt",
    type=str,
    help="The floder saving ckpt file",
)

parser.add_argument(
    "--log-dir",
    default="./log/log",
    type=str,
    help="The floder saving tensorboard log",
)

best_acc1 = 0
summary_writer = None
my_global_step = 0

def main():
    args = parser.parse_args()

    if args.seed is not None:
        random.seed(args.seed)
        torch.manual_seed(args.seed)
        cudnn.deterministic = True
        warnings.warn(
            "You have chosen to seed training. "
            "This will turn on the CUDNN deterministic setting, "
            "which can slow down your training considerably! "
            "You may see unexpected behavior when restarting "
            "from checkpoints."
        )

    torch.cuda.set_device(bagua.get_local_rank())
    bagua.init_process_group()
    args.distributed = bagua.get_world_size() > 1

    logging.basicConfig(
        format="rank-{} %(asctime)s,%(msecs)d %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s".format(
            bagua.get_rank()
        ),
        datefmt="%Y-%m-%d:%H:%M:%S",
        level=logging.ERROR,
    )

    if bagua.get_rank() == 0:
        logging.getLogger().setLevel(logging.INFO)

    main_worker(args)


def main_worker(args):
    global best_acc1
    global summary_writer

    summary_writer = SummaryWriter(log_dir=args.log_dir)

    # create model
    if args.pretrained:
        print("=> using pre-trained model '{}'".format(args.arch))
        model = models.__dict__[args.arch](pretrained=True)
    else:
        print("=> creating model '{}'".format(args.arch))
        model = models.__dict__[args.arch]()

    model = model.cuda()

    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        args.lr,
        momentum=args.momentum,
        weight_decay=args.weight_decay,
    )

    if args.algorithm == "gradient_allreduce":
        from bagua.torch_api.algorithms import gradient_allreduce

        algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
    else:
        raise NotImplementedError

    scaler = torch.cuda.amp.GradScaler(enabled=args.amp)

    # optionally resume from a checkpoint
    if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            # Map model to be loaded to specified single gpu.
            loc = "cuda:{}".format(bagua.get_local_rank())
            checkpoint = torch.load(args.resume, map_location=loc)
            args.start_epoch = checkpoint["epoch"]
            best_acc1 = checkpoint["best_acc1"]
            if bagua.get_local_rank() is not None:
                # best_acc1 may be from a checkpoint from a different GPU
                best_acc1 = best_acc1.to(bagua.get_local_rank())
            model.load_state_dict(checkpoint["state_dict"])
            optimizer.load_state_dict(checkpoint["optimizer"])
            print(
                "=> loaded checkpoint '{}' (epoch {})".format(
                    args.resume, checkpoint["epoch"]
                )
            )
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

    if args.distributed:
        _test_rank = bagua.get_rank()
        model = model.with_bagua(
            [optimizer],
            algorithm,
        )

    cudnn.benchmark = True

    # Data loading code
    traindir = os.path.join(args.data, "train")
    valdir = os.path.join(args.data, "val")
    normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
    )

    train_dataset = datasets.ImageFolder(
        traindir,
        transforms.Compose(
            [
                transforms.RandomResizedCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                normalize,
            ]
        ),
    )
    val_dataset = datasets.ImageFolder(
        valdir,
        transforms.Compose(
            [
                transforms.Resize(256),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                normalize,
            ]
        ),
    )

    if args.distributed:
        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    else:
        train_sampler = None

    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=args.batch_size,
        shuffle=(train_sampler is None),
        num_workers=args.workers,
        pin_memory=True,
        sampler=train_sampler,
    )

    val_loader = torch.utils.data.DataLoader(
        val_dataset,
        batch_size=args.batch_size,
        shuffle=False,
        num_workers=args.workers,
        pin_memory=True,
    )

    if args.evaluate:
        validate(val_loader, model, criterion, 0, args)
        return

    for epoch in range(args.start_epoch, args.epochs):
        if args.distributed:
            train_sampler.set_epoch(epoch)

        if args.algorithm == "async":
            model.bagua_algorithm.resume(model)

        # train for one epoch
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        train(train_loader, model, criterion, optimizer, scaler, epoch, args)
        end.record()

        # Waits for everything to finish running
        torch.cuda.synchronize()
        elapsed_time = start.elapsed_time(end)
        write_scalar(tag='train/epoch_training_time', scalar_value=elapsed_time, global_step=epoch)

        if args.algorithm == "async":
            model.bagua_algorithm.abort(model)

        # evaluate on validation set
        acc1 = validate(val_loader, model, criterion, epoch, args)

        # remember best acc@1 and save checkpoint
        is_best = acc1 > best_acc1
        best_acc1 = max(acc1, best_acc1)

        if bagua.get_rank() == 0 and args.save_checkpoint:
            save_checkpoint(
                {
                    "epoch": epoch + 1,
                    "arch": args.arch,
                    "state_dict": model.state_dict(),
                    "best_acc1": best_acc1,
                    "optimizer": optimizer.state_dict(),
                },
                is_best,
                dir=args.ckpt_dir
            )

def train(train_loader, model, criterion, optimizer, scaler, epoch, args):
    global my_global_step

    batch_time = AverageMeter("Time", ":6.3f")
    data_time = AverageMeter("Data", ":6.3f")
    losses = AverageMeter("Loss", ":.4e")
    top1 = AverageMeter("Acc@1", ":6.2f")
    top5 = AverageMeter("Acc@5", ":6.2f")
    progress = ProgressMeter(
        len(train_loader),
        [batch_time, data_time, losses, top1, top5],
        prefix="Epoch: [{}]".format(epoch),
    )

    # switch to train mode
    model.train()

    end = time.time()
    for i, (images, target) in enumerate(train_loader):

        if args.prof >= 0 and i == args.prof:
            print("Profiling begun at iteration {}".format(i))
            torch.cuda.cudart().cudaProfilerStart()

        if args.prof >= 0:
            torch.cuda.nvtx.range_push("Body of iteration {}".format(i))

        # measure data loading time
        data_time.update(time.time() - end)

        if torch.cuda.is_available():
            images = images.cuda(bagua.get_local_rank(), non_blocking=True)
            target = target.cuda(bagua.get_local_rank(), non_blocking=True)

        adjust_learning_rate(optimizer, epoch, i, len(train_loader), args)

        optimizer.zero_grad()

        if args.prof >= 0:
            torch.cuda.nvtx.range_push("forward")

        with torch.cuda.amp.autocast(enabled=args.amp):
            # compute output
            output = model(images)
            loss = criterion(output, target)

        if args.prof >= 0:
            torch.cuda.nvtx.range_pop()

        # measure accuracy and record loss
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        losses.update(loss.item(), images.size(0))
        top1.update(acc1[0], images.size(0))
        top5.update(acc5[0], images.size(0))

        if args.prof >= 0:
            torch.cuda.nvtx.range_push("backward")

        # compute gradient and do SGD step
        scaler.scale(loss).backward()

        if args.prof >= 0:
            torch.cuda.nvtx.range_pop()

        if args.prof >= 0:
            torch.cuda.nvtx.range_push("optimizer.step()")

        scaler.step(optimizer)
        scaler.update()

        if args.prof >= 0:
            torch.cuda.nvtx.range_pop()

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            progress.display(i)
            write_scalar(tag='train/acc_top1', scalar_value=top1.get_avg(), global_step=my_global_step)
            write_scalar(tag='train/acc_top5', scalar_value=top5.get_avg(), global_step=my_global_step)

        # Pop range "Body of iteration {}".format(i)
        if args.prof >= 0:
            torch.cuda.nvtx.range_pop()

        if args.prof >= 0 and i == args.prof + 10:
            print("Profiling ended at iteration {}".format(i))
            torch.cuda.cudart().cudaProfilerStop()

            if args.algorithm == "async":
                model.bagua_algorithm.abort(model)
            quit()


def validate(val_loader, model, criterion, epoch, args):
    batch_time = AverageMeter("Time", ":6.3f")
    losses = AverageMeter("Loss", ":.4e")
    top1 = AverageMeter("Acc@1", ":6.2f")
    top5 = AverageMeter("Acc@5", ":6.2f")
    progress = ProgressMeter(
        len(val_loader), [batch_time, losses, top1, top5], prefix="Test: "
    )

    # switch to evaluate mode
    model.eval()

    with torch.no_grad():
        end = time.time()
        for i, (images, target) in enumerate(val_loader):
            if torch.cuda.is_available():
                images = images.cuda(bagua.get_local_rank(), non_blocking=True)
                target = target.cuda(bagua.get_local_rank(), non_blocking=True)

            # compute output
            output = model(images)
            loss = criterion(output, target)

            # measure accuracy and record loss
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            losses.update(loss.item(), images.size(0))
            top1.update(acc1[0], images.size(0))
            top5.update(acc5[0], images.size(0))

            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()

            if i % args.print_freq == 0:
                progress.display(i)

        # TODO: this should also be done with the ProgressMeter
        logging.info(
            " * TEST Epoch {} Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}".format(
                epoch, top1=top1, top5=top5
            )
        )
        write_scalar(tag='validation/acc_top1', scalar_value=top1.get_avg(), global_step=epoch)
        write_scalar(tag='validation/acc_top5', scalar_value=top5.get_avg(), global_step=epoch)


    return top1.avg

def write_scalar(tag, scalar_value, global_step):
    global summary_writer
    if bagua.get_rank() == 0:
        summary_writer.add_scalar(tag=tag, scalar_value=scalar_value, global_step=global_step)

def save_checkpoint(state, is_best, dir="./ckpt/dir"):
    dir = Path(dir)
    if not dir.exists():
        dir.mkdir(parents=True)
    
    file_name = dir / "checkpoint.pth.tar"
    torch.save(state, file_name)
    if is_best:
        shutil.copyfile(file_name, dir / "model_best.pth.tar")

class AverageMeter(object):
    """Computes and stores the average and current value"""

    def __init__(self, name, fmt=":f"):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = "{name} {val" + self.fmt + "} ({avg" + self.fmt + "})"
        return fmtstr.format(**self.__dict__)
    
    def get_avg(self):
        return self.avg


class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        logging.info("\t".join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = "{:" + str(num_digits) + "d}"
        return "[" + fmt + "/" + fmt.format(num_batches) + "]"


def adjust_learning_rate(optimizer, epoch, step, len_epoch, args):
    """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
    # lr = args.lr * (0.1 ** (epoch // 30))
    # for param_group in optimizer.param_groups:
    #     param_group["lr"] = lr
    milestones = [int(i) for i in args.milestones.split(",")]
    lr = args.lr * (args.gama ** bisect_right(milestones, epoch))

    """Warmup"""
    if epoch < args.warmup_epochs:
        lr = (
            lr
            * float(1 + step + epoch * len_epoch)
            / float(args.warmup_epochs * len_epoch)
        )

    # logging.info("epoch = {}, step = {}, lr = {}".format(epoch, step, lr))

    for param_group in optimizer.param_groups:
        param_group["lr"] = lr


def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


if __name__ == "__main__":
    main()

Please also write what exact commands are required to reproduce your results.

python -m bagua.distributed.launch \
        --nproc_per_node=8 --nnodes=1 --node_rank=0 \
        --master_addr="10.154.34.164" --master_port=34498 \
        main.py \
        --arch=resnet50 \
        --save-checkpoint \
        --lr 0.2 \
        --batch-size 64 \
        --print-freq 100 \
        --algorithm gradient_allreduce \
        --resume ./ckpt/multi_node_gradient_allreduce \
        --ckpt-dir ./ckpt/multi_node_gradient_allreduce \
        --log-dir ./log/multi_node_gradient_allreduce \
        $DATA_PATH

Additional context Add any other context about the problem here.

opened by zhaone 0

Releases(v0.9.0)

v0.9.0(Jan 17, 2022)
Bug Fixes

Other

Reuse fused parameter tensors in fuse_step (#410)

Call step closure in qadam optimizer step (#432)

Fix need_reset condition (#454)

Do negotiation in async native op (#447)

Fix find_unused_parameters (#452)

Fix qadam non-deterministic (#459)

Add LIBRARY_PATH env in install_master.sh (#465)

Fix typo in install_master.sh (#471)

Python

CUDA 11.5 can't get nccl package (#415)

Fix process group compatibility with torch 1.6.0 (#413)

Fix ci random fail (#445)

Fix async algorithm (#479)

Features

Core

Initial support for C interface (#325)

Other

Support NODE_RANK environment variable (#426)

Choose bagua service port dynamically (#431)

Use bagua_module_name to identify different modules (#438)

Add algorithm registry (#433)

Add compatibility for NCCL version under 2.10 (#449)

Add broadcast object api (#437)

Support qadam in fused optimizer (#477)

Python

Support PyTorch DDP compatible distributed training API (#312)

Support torch-api-compatiable all_reduce (#377)

Associate PyTorch Process Group with Bagua Process Group using cache (#402)

Support find_unused_parameters on BaguaDDP (#409)

Add BAGUA_AUTOTUNE_SERVER_WAIT_TIME env (#474)

Source code(tar.gz)
Source code(zip)
v0.8.2(Nov 10, 2021)
Features

Python

Support switching between different algorithms (#299)

Support separate algorithm declaration and implementation (#246)

Python, core

Support process group in with_bagua, support hierarchical communication in bytegrad algorithm (#300)

Support mutable bucket tensors (#271)

Support all_to_all_single (#361)

Bug Fixes

Other

Fuse optimizer oom and make it stateless (#207)

to_bagua_tensor compatibility with torch 1.6.0 (#355)

Python

Use separate process group for async communication thread to avoid potential hangs (#298)

Do not fail if checkpoints path exist (#305)

Fix is_moe_param (#306)

Change to_bagua_tensor API to support PyTorch 1.10 (#338)

Fix fused optimizer with multiple param groups (#356)

Source code(tar.gz)
Source code(zip)
v0.8.1.post1(Oct 16, 2021)
Bug Fixes

Process group not yet supported in with_bagua

Use separate process group for async communication thread to avoid potential hangs (#298)

Source code(tar.gz)
Source code(zip)
v0.8.1(Oct 16, 2021)
[0.8.1] - 2021-10-16

Features

Support moe (#208)

Support checkpointing for moe (#242)

Use single bucket for decentralized algorithm to improve performance (#275)

Support process group (#228)

Add barrier api (#290)

Source code(tar.gz)
Source code(zip)
v0.8.0(Sep 26, 2021)
[0.8.0] - 2021-09-26

Bug Fixes

Ci

Only run publish once on git tag

Core

Fix compressed buffer can not be scattered to odd number of ranks

Other

Fix ci pypi versioning

Remove init.py and python version, use cargo version

Move import bagua_install_library to install library function

Merge bagua_install_library and setup.py, remove nccl<=2.6 support

Fix alltoall_v parameter (#17)

Reduce and allgather python interface

Fix decompress incorrect pointer and typo in error msg

Fix python gil deadlock during getting data ptr

Fix benchmark script requirements

Fix alltoall_v parameter types (#27)

Always mark bagua padding tensor as ready

Make compress/decompress of BaguaTensor method string consistent (#33)

Fix scatter and reduce_scatter implementation (#40)

Substract overflow error for decentralized op (#39)

Fix QADAM params (#17)

Fix assert precision (#18)

Replace mutex with atomic bool for async op and add Aluminum submodule update (#67)

Fix duplicated dependency downloading during installation (#77)

Fix async algorithm aborting and hanging (#78, #81)

Fix qadam algorithm call (#20)

Fix missing symbols in the zip library (#24)

Fix random autotune server hang (#206)

Bagua-net library path mismatch, make --enable_bagua_net argument style consistent with other args (#218)

Python

Fix random autotune-service hang

Handle conflicts caused by sklearn upgrade (#225)

Features

Ci

Only publish pypi for master commits

Other

Add async model average algorithm (#110)

Add cached dataset wrapper (#148)

Support sync batchnorm (#151)

Add --enable-bagua-net option in launcher (#183)

Add pytorch examples for MNIST, ImageNet, SQuAD training (#1)

Add requirements.txt, only download dataset on local rank 0 (#2)

Add python packaging related files

Add __version__ variable

Install nccl deps in bagua core and add generated __version__ variable

Add version.py placeholder to prevent file not found error

Initial support for python op (#2)

Add 5 min timeout for buckets' comm op (#5)

Replace NCCL with Aluminum (#7)

Add synethetic benchmark script (#5)

Add elastic training example (#7)

Support alltoall_v (vector alltoall) (#14)

Add reduce and allgather python interface

Support reduce and allgather op with Reduction op enum

Support creating BaguaTensor by passing torch tensor directly (#19)

Compatible mode for getting pytorch tensor info with Python interpreter

Better debug log including tensor info when executing ops

Add native low precision decentralized operator (#26)

Add (scatter, gather, scatter_reduce) and all inplace version communication primitives (#37)

Make full precision decentralized op stateless (#36)

Add communication_primitives example (#12)

Use nccl 2.10 avg op for all algorithms using averaging (#46, #45)

Add opentelemetry to report tensor ready order (#42)

Add deterministic flag (#15)

Add native async model average algorithm (#41)

Add examples for async model average algorithm (#14)

Support packet splitting and multi-stream parallel transmission (#5)

Support ncclnet v3 and remove the dependency on nccl in the installation environment (#17)

Add sync interval param to async examples (#19)

Suppport tokio backend (#21)

Support bagua-net (#89)

Source code(tar.gz)
Source code(zip)
v0.7.0(Aug 16, 2021)
Bug Fixes

Autotune api conflict (#131)

Features

Add low precision decentralized algorithm (#103)

Add all communication primitives such as send recv to communication module (#128)

Make full precision decentralized op stateless (#126)

Support nccl 2.10 ReduceOp.AVG (#149)

Add support for reporting tensor completion order (#146)

Source code(tar.gz)
Source code(zip)
v0.7.0-rc2(Jul 22, 2021)

Source code(tar.gz)
Source code(zip)
v0.7.0-rc1(Jul 22, 2021)

Source code(tar.gz)
Source code(zip)
v0.6.3(Jul 8, 2021)
Features

support different ssh port on different nodes (#93) 6810245

support multiple models in one training script (#113) 312bcc0 (#107) 0aec789

Fixes

autotune service defaults with a fixed random seed (#117) a58c2de

Others

sort q_adam variables for better performance (#102) f277549

improve autotune speed metrics measurement for better accuracy (#86) e4ee5ee

install.sh upgrades existing bagua package bc69890

install.sh will not install Rust if already exist on the system 67e1efe

Source code(tar.gz)
Source code(zip)
v0.6.2(Jul 2, 2021)
Fixes

fix QAdam gradient is not BaguaTensor during first stage 1d4dc82

Source code(tar.gz)
Source code(zip)
v0.6.1(Jul 2, 2021)
Features

add QAdam algorithm (#92) 0dafd24

broadcast model parameters on every algorithm reset e5b36dc

wrap python op in communication stream context by default 51eb656

add append op methods to python BaguaBucket class (#87) 84d8cbc

Fixes

BaguaBucket.tensors should only contain original passed in tensors c4ff05f

fix append python op callable reference 04019cc

fix BaguaBacket.clear_ops() return value 8cb9f54

Source code(tar.gz)
Source code(zip)
v0.6.0(Jul 1, 2021)
⚠ BREAKING CHANGE

Now end users should use model.with_bagua(...) API to use Bagua for communication. Algorithm developers can use bagua.torch_api.algorithms.Algorithm to easily develop new algorithms. Installation requires bagua-core >=0.3 now.

Features

add algorithm import in bagua.torch_api ee73edc

support reduction op and reduce ac8632c

auto installation support centos (#50) 073a59e

Fixes

fix algoirthm pre forward hook not returned e6c7c8d

the environment variable LOCAL_SIZE has been renamed in LOCAL_WORLD_SIZE (#51) 801b25a

Source code(tar.gz)
Source code(zip)
0.5.0(Jun 25, 2021)
⚠ BREAKING CHANGE

contrib: load balancing dataloader and fused optimizer are now in bagua.torch_api.contrib module

baguaelastic/distributed/launch.py now moved to bagua/distributed/run.py

Features

add dependency installation script for ubuntu (#41) 4d820ab

Elastic training (#31) 1a5964c

add broadcast_buffer in bagua_init (#29) e761cc6

support bagua-core 0.2 (#26) f1d2bfa

Fixes

autotune: fix bucket size switch not effective (#48) 30b490a

remove logging in load balancing dataloader to avoid deadlock (#35) e900383

torch_api.distributed: cycle dependence (#16) 0314e24

fix setup.py for low version setuptools (#14) 7d315c0

fix baguaelastic launch script b069cd4=

Source code(tar.gz)
Source code(zip)
0.4.0(Jun 17, 2021)

Initial public release of Bagua :fireworks:
Source code(tar.gz)
Source code(zip)