Bagua is a flexible and performant distributed training algorithm development framework.

Related tags

Deep Learning bagua
Overview

Bagua

Generic badge Documentation Status PyPI version Docker GitHub license

Bagua is a distributed training utility developed by Kuaishou Technology and DS3 Lab@ETH. Users can extend the training on a single GPU to multi-GPUs (may across multiple machines), with excellent speedup guarantee, by simply adding a few lines of code. Bagua also provides a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. Researchers can easily develop new distributed training algorithms based on bagua, without sacrificing system performance.

So far, Bagua has integrated primitives including

  • Centralized Synchronous Communication (AllReduce)
  • Decentralized Synchronous Communication
  • Low Precision Communication

Its effectiveness has been verified in various scenarios, including VGG and ResNet on ImageNet, Bert Large and many industrial applications at Kuaishou.

The underlying communication execution engine is in bagua-core, a library written in Rust.

Performance

The scalability of different systems on VGG16 with up to 128 GPUs.



Epoch time of BERT-Large Finetune under different network conditions for different systems.

For more comprehensive and up to date results, refer to Bagua benchmark page.

Installation

Develop version:

pip install git+https://github.com/BaguaSys/bagua.git

Release version:

pip install bagua

Build API documentation locally

pip install -r docs/doc-requirements.txt
make html

Cite Bagua

@misc{gan2021bagua,
  title={BAGUA: Scaling up Distributed Learning with System Relaxations}, 
  author={Shaoduo Gan and Xiangru Lian and Rui Wang and Jianbin Chang and Chengjun Liu and Hongmei Shi and Shengzhuo Zhang and Xianghong Li and Tengxu Sun and Jiawei Jiang and Binhang Yuan and Sen Yang and Ji Liu and Ce Zhang},
  year={2021},
  eprint={2107.01499},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

@book{liu2020distributed,
  title={Distributed Learning Systems with First-Order Methods: An Instruction},
  author={Liu, J. and Zhang, C.},
  isbn={9781680837018},
  series={Foundations and trends in databases},
  url={https://books.google.com/books?id=vzQmzgEACAAJ},
  year={2020},
  publisher={now publishers}
}

Links

Comments
  • Why does FusedOptimizer has a huge impact on model precision?

    Why does FusedOptimizer has a huge impact on model precision?

    I wrapped my custom optimizer with FusedOptimizer and the precision was way worse than that without FusedOptimizer. I think FusedOptimizer shouldn't be affecting the model precision. Or is there something wrong with my custom optimizer?

    Here is the optimizer I use:

    https://github.com/cybertronai/pytorch-lamb/blob/master/pytorch_lamb/lamb.py

    bug 
    opened by ProHuper 11
  • turn off buagua-net

    turn off buagua-net

    Hi, I want to know if I could turn off bagua-net in this script. so that I could compare with the original pytorch throughput . passing the --enable_bagua_net argument in bagua.distributed.launch does not work.

    opened by CaRRotOne 8
  • My process has been blocked,the screen (as follows) change nothing till 30 minutes

    My process has been blocked,the screen (as follows) change nothing till 30 minutes

    Describe the bug A clear and concise description of what the bug is.

    Environment

    • Your operating system and version:Ubuntu18.04
    • Your python version:3.8
    • Your PyTorch version:11.0
    • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:
    • Have you tried using latest bagua master (python3 -m pip install git+https://github.com/BaguaSys/bagua.git -f https://repo.arrayfire.com/python/wheels/3.8.0/)?:0.8.1.post1

    Reproducing

    Please provide a minimal working example. This means the runnable code.

    Please also write what exact commands are required to reproduce your results.

    Additional context Add any other context about the problem here.

    opened by lixiangMindSpore 8
  • Problem with AttributeError  'setuptools._distutils' has no attribute 'version') with executing MNIST example

    Problem with AttributeError 'setuptools._distutils' has no attribute 'version') with executing MNIST example

    I ran the MNIST example and got the following error:

    `[kqian@eu-login-04 testrun]$ python3 -m bagua.distributed.launch --nproc_per_node=8 main.py --arch resnet50 --algorithm gradient_allreduce [imagenet-folder with train and val folders]
    *****************************************
    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
    *****************************************
    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        import bagua.torch_api as bagua
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
        from .tensor import BaguaTensor  # noqa: F401
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
        LooseVersion = distutils.version.LooseVersion
    **AttributeError: module 'setuptools._distutils' has no attribute 'version'**
    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        import bagua.torch_api as bagua
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        from .tensor import BaguaTensor  # noqa: F401
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        LooseVersion = distutils.version.LooseVersion
    AttributeError: module 'setuptools._distutils' has no attribute 'version'
        import bagua.torch_api as bagua
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
        import bagua.torch_api as bagua
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
        from .tensor import BaguaTensor  # noqa: F401
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
        from .tensor import BaguaTensor  # noqa: F401
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
        LooseVersion = distutils.version.LooseVersion
    AttributeError: module 'setuptools._distutils' has no attribute 'version'
        LooseVersion = distutils.version.LooseVersion
    AttributeError: module 'setuptools._distutils' has no attribute 'version'
    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        import bagua.torch_api as bagua
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
        from .tensor import BaguaTensor  # noqa: F401
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
        LooseVersion = distutils.version.LooseVersion
    AttributeError: module 'setuptools._distutils' has no attribute 'version'
    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        import bagua.torch_api as bagua
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
        from .tensor import BaguaTensor  # noqa: F401
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
        LooseVersion = distutils.version.LooseVersion
    AttributeError: module 'setuptools._distutils' has no attribute 'version'
    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        import bagua.torch_api as bagua
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
        from .tensor import BaguaTensor  # noqa: F401
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
        LooseVersion = distutils.version.LooseVersion
    AttributeError: module 'setuptools._distutils' has no attribute 'version'
    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        import bagua.torch_api as bagua
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/__init__.py", line 52, in <module>
        from .tensor import BaguaTensor  # noqa: F401
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/torch_api/tensor.py", line 10, in <module>
        LooseVersion = distutils.version.LooseVersion
    AttributeError: module 'setuptools._distutils' has no attribute 'version'
    Killing subprocess 26136
    Killing subprocess 26137
    Killing subprocess 26138
    Killing subprocess 26140
    Killing subprocess 26142
    Killing subprocess 26144
    Killing subprocess 26145
    Killing subprocess 26146
    Traceback (most recent call last):
      File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 342, in <module>
        main()
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 327, in main
        sigkill_handler(signal.SIGTERM, None)  # not coming back
      File "/cluster/home/kqian/.local/lib/python3.7/site-packages/bagua/distributed/launch.py", line 291, in sigkill_handler
        returncode=last_return_code, cmd=cmd
    subprocess.CalledProcessError: Command '['/cluster/apps/nss/python/3.7.4/x86_64/bin/python3', '-u', 'main.py', '--arch', 'resnet50', '--algorithm', 'gradient_allreduce', '[imagenet-folder', 'with', 'train', 'and', 'val', 'folders]']' returned non-zero exit status 1.
    
    `
    
    opened by silverCore97 7
  • cannot find libnccl.so.2

    cannot find libnccl.so.2

    Describe the bug A clear and concise description of what the bug is. image

    Environment

    • Your operating system and version: Ubuntu18.04
    • Your python version:3.8
    • Your PyTorch version:11.0
    • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:
    • conda create -n torch17 python=3.8
    • Have you tried using latest bagua master (python3 -m pip install git+https://github.com/BaguaSys/bagua.git -f https://repo.arrayfire.com/python/wheels/3.8.0/)?:I use 0.8.1.post1

    Reproducing

    Please provide a minimal working example. This means the runnable code.

    Please also write what exact commands are required to reproduce your results.

    Additional context Add any other context about the problem here.

    question 
    opened by lixiangMindSpore 7
  • I use bagua with the phenomenon as follows    (   bagua.broadcast(ps, 0, comm=comm)   )

    I use bagua with the phenomenon as follows ( bagua.broadcast(ps, 0, comm=comm) )

    Describe the bug A clear and concise description of what the bug is. image

    Environment

    • Your operating system and version:Ubuntu18.04
    • Your python version:3.8.12
    • Your PyTorch version:11.0
    • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:conda create -n torch17 python=3.8
    • Have you tried using latest bagua master (python3 -m pip install --pre bagua)?:0.8.1.post1

    Reproducing

    Please provide a minimal working example. This means the runnable code.

    Please also write what exact commands are required to reproduce your results.

    Additional context Add any other context about the problem here.

    question 
    opened by lixiangMindSpore 6
  • NCCL error when running backward

    NCCL error when running backward

    I ran a very simply example and got error:

    WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
    WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Bootstrap : Using eth1:11.214.158.37<0>
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth1:11.214.158.37<0>
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Using network IB
    NCCL version 2.10.3+cuda10.2
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Bootstrap : Using eth1:11.214.158.37<0>
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth1:11.214.158.37<0>
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Using network IB
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 00/04 :    0   1
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 01/04 :    0   1
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 02/04 :    0   1
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Setting affinity for GPU 1 to 3f,07ff0000,003e07ff
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 03/04 :    0   1
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Setting affinity for GPU 0 to 3f,07ff0000,003e07ff
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[3d000] via P2P/IPC
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[3d000] via P2P/IPC
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 00 : 1[3d000] -> 0[1a000] via P2P/IPC
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[3d000] via P2P/IPC
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 01 : 1[3d000] -> 0[1a000] via P2P/IPC
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 03 : 0[1a000] -> 1[3d000] via P2P/IPC
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 02 : 1[3d000] -> 0[1a000] via P2P/IPC
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 03 : 1[3d000] -> 0[1a000] via P2P/IPC
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Connected all rings
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Connected all trees
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Connected all rings
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Connected all trees
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO comm 0x55bc8aee70c0 rank 1 nranks 2 cudaDev 1 busId 3d000 - Init COMPLETE
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO comm 0x555f0e926110 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
    2021-11-04T14:16:06.243214Z  WARN bagua_core_internal: Parameter autotuning service not detected. Enabling it may further improve the performance. See https://tutorials.baguasys.com/performance-autotuning/ for more details.
    2021-11-04T14:16:06.243246Z  WARN bagua_core_internal: Parameter autotuning service not detected. Enabling it may further improve the performance. See https://tutorials.baguasys.com/performance-autotuning/ for more details.
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Launch mode Parallel
    
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [0] enqueue.cc:329 NCCL WARN Cuda failure 'invalid resource handle'
    ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [0] NCCL INFO enqueue.cc:1047 -> 1
    fatal runtime error: Rust cannot catch foreign exceptions
    Killing subprocess 93207
    Killing subprocess 93208
    Traceback (most recent call last):
      File "/root/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/root/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 342, in <module>
        main()
      File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 327, in main
        sigkill_handler(signal.SIGTERM, None)  # not coming back
      File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 290, in sigkill_handler
        raise subprocess.CalledProcessError(
    subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', 'train.py']' died with <Signals.SIGABRT: 6>.
    

    I used nccl-2.10.3 and cuda-10.2, I'm using local nccl, but same error will encounter when i install nccl using bagua_core.install_deps, and everything works fine if I use DDP.

    here's my code:

    import torch
    from torch.nn.modules.loss import CrossEntropyLoss
    from torch.utils.data.dataloader import DataLoader
    from LAMB import LAMB
    from bagua.torch_api.contrib.fuse.optimizer import fuse_optimizer
    import torch.nn as nn
    import torch.optim
    from torch.utils.data import Dataset, DataLoader
    import bagua.torch_api as bagua
    from bagua.torch_api.algorithms import gradient_allreduce
    
    from torch.nn.parallel import DistributedDataParallel as DDP
    import torch.distributed as dist
    import argparse
    
    class MyDataset(Dataset):
        def __init__(self) -> None:
            self.input = torch.randn(10000, 10)
            self.laebl = torch.randn(10000, 1)
    
        def __getitem__(self, index):
            return self.input[index], self.laebl[index]
    
        def __len__(self):
            return  10000
    
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument("--local_rank", type=int, default=-1)
        args = parser.parse_args()
        # dist.init_process_group(backend='nccl')
        bagua.init_process_group()
    
        model = nn.Sequential(
            nn.Linear(10, 5),
            nn.Linear(5, 2),
            nn.Linear(2, 1),
        )   
    
        optimizer = torch.optim.Adam(
            params=model.parameters(),
            lr=0.1,
            betas=(0.9, 0.999),
            eps=1e-06,
            weight_decay=0
        )
    
        algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
        model.to(bagua.get_local_rank())
        # model.to(args.local_rank)
        # model = DDP(model, device_ids=[args.local_rank])
        model = model.with_bagua(
            [optimizer],
            algorithm
        )
        dataset = MyDataset()
        dataloader = DataLoader(dataset, batch_size=5)
    
        for i in range(10):
            for x, y in dataloader:
                # x = x.to(args.local_rank)
                # y = y.to(args.local_rank)
                x = x.to(bagua.get_local_rank())
                y = y.to(bagua.get_local_rank())
                optimizer.zero_grad()
                output = model(x)
                loss = (output - y).pow(2).sum()
                loss.backward()
                optimizer.step()
    
    opened by ProHuper 5
  • What's wrong with this? Do I need to do anything else? It will affect my result?

    What's wrong with this? Do I need to do anything else? It will affect my result?

    Describe the bug A clear and concise description of what the bug is. image

    Environment

    • Your operating system and version: Ubuntu18.04
    • Your python version:3.8.12
    • Your PyTorch version:11.1
    • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:conda create -n torch python=3.8
    • Have you tried using latest bagua master (python3 -m pip install --pre bagua)?:0.8.1.post1

    Reproducing

    Please provide a minimal working example. This means the runnable code.

    Please also write what exact commands are required to reproduce your results.

    Additional context Add any other context about the problem here.

    question 
    opened by lixiangMindSpore 5
  • Format Python code with psf/black push

    Format Python code with psf/black push

    There appear to be some python formatting errors in a05e4e345ea9ab2f7b725a5cc2e90a827cef31ff. This pull request uses the psf/black formatter to fix these issues.

    PR: unreviewed 
    opened by github-actions[bot] 5
  • Format Python code with psf/black push

    Format Python code with psf/black push

    There appear to be some python formatting errors in cd499b8482cd293c584b599f55ffacea94020039. This pull request uses the psf/black formatter to fix these issues.

    PR: unreviewed 
    opened by github-actions[bot] 5
  • Format Python code with psf/black push

    Format Python code with psf/black push

    There appear to be some python formatting errors in 8117b05f0b08bc3d03dc8f48572fc95b50331ff6. This pull request uses the psf/black formatter to fix these issues.

    PR: unreviewed 
    opened by github-actions[bot] 5
  • chore(deps): bump once_cell from 1.10.0 to 1.17.0 in /rust/bagua-core

    chore(deps): bump once_cell from 1.10.0 to 1.17.0 in /rust/bagua-core

    Bumps once_cell from 1.10.0 to 1.17.0.

    Changelog

    Sourced from once_cell's changelog.

    1.17.0

    • Add race::OnceRef for storing a &'a T.

    1.16.0

    • Add no_std implementation based on critical-section, #195.
    • Deprecate atomic-polyfill feature (use the new critical-section instead)

    1.15.0

    • Increase minimal supported Rust version to 1.56.0.
    • Implement UnwindSafe even if the std feature is disabled.

    1.14.0

    • Add extension to unsync and sync Lazy mut API:
      • force_mut
      • get_mut

    1.13.1

    • Make implementation compliant with strict provenance.
    • Upgrade atomic-polyfill to 1.0

    1.13.0

    • Add Lazy::get, similar to OnceCell::get.

    1.12.1

    • Remove incorrect debug_assert.

    1.12.0

    • Add OnceCell::wait, a blocking variant of get.

    1.11.0

    • Add OnceCell::with_value to create initialized OnceCell in const context.
    • Improve Clone implementation for OnceCell.
    • Rewrite parking_lot version on top of parking_lot_core, for even smaller cells!
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    PR: unreviewed dependencies rust 
    opened by dependabot[bot] 0
  • chore(deps): bump libc from 0.2.125 to 0.2.139 in /rust/bagua-core

    chore(deps): bump libc from 0.2.125 to 0.2.139 in /rust/bagua-core

    Bumps libc from 0.2.125 to 0.2.139.

    Release notes

    Sourced from libc's releases.

    0.2.139

    What's Changed

    New Contributors

    Full Changelog: https://github.com/rust-lang/libc/compare/0.2.138...0.2.139

    0.2.138

    What's Changed

    ... (truncated)

    Commits
    • f4bc851 Auto merge of #3042 - flba-eb:release_0.2.139, r=JohnTitor
    • dc3d43c Prepare 0.2.139 release
    • c59ca73 Auto merge of #3041 - devnexen:linux_kernel_version, r=JohnTitor
    • 88d6a1f adding KERNEL_VERSION macro for linux.
    • 45b431a Auto merge of #2758 - fkm3:master, r=JohnTitor
    • 572e11b Add misc constants and functions for android
    • 318dccc Auto merge of #3038 - gh-tr:rebased/20221216, r=JohnTitor
    • 07636f6 Auto merge of #3036 - LegionMammal978:iso-c-funcs, r=JohnTitor
    • 720151f Add support for QNX/Neutrino 7.1
    • 6a58758 Add ISO C functions atof, atol, atoll, strtoll, strtoull
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    PR: unreviewed dependencies rust 
    opened by dependabot[bot] 0
  • chore(deps): update scikit-learn requirement from !=1.0,<=1.0.1,>=0.24 to >=0.24,!=1.0,<1.2.1

    chore(deps): update scikit-learn requirement from !=1.0,<=1.0.1,>=0.24 to >=0.24,!=1.0,<1.2.1

    Updates the requirements on scikit-learn to permit the latest version.

    Release notes

    Sourced from scikit-learn's releases.

    Scikit-learn 1.2.0

    We're happy to announce the 1.2.0 release.

    You can read the release highlights under https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_2_0.html and the long version of the change log under https://scikit-learn.org/stable/whats_new/v1.2.html

    This version supports Python versions 3.8 to 3.11.

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    PR: unreviewed dependencies python 
    opened by dependabot[bot] 0
  • Programs get blocked when using multiple nodes training.

    Programs get blocked when using multiple nodes training.

    Describe the bug A clear and concise description of what the bug is.

    Programs get blocked when using multiple nodes. By setting export LOG_LEVEL=DEBUG, I can see that it got stuck at BaguaSingleCommunicator, since it prints

    2022-11-21T12:40:23.673510Z DEBUG bagua_core_internal::communicators: creating communicator, nccl_unique_id AgCwgcCQEwkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=, rank 8, nranks 16, device_id 0, stream_ptr 94639511762624

    but fail to print

    al communicator initialized at XXX.

    When I set --node_rank=0, the program can run smoothly.

    Environment

    • Your operating system and version: Linux node-162 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
    • Your python version: Python 3.8.13 (default, Mar 28 2022, 11:38:47)
    • Your PyTorch version: 1.12.1
    • NCCL version: 2.10.3
    • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: conda
    • Have you tried using latest bagua master (python3 -m pip install --pre bagua)?: yes

    Reproducing

    Please provide a minimal working example. This means the runnable code.

    import argparse
    from ast import arg
    from curses import baudrate
    import os
    import random
    import shutil
    import time
    import warnings
    import logging
    
    import torch
    import torch.nn as nn
    import torch.nn.parallel
    import torch.backends.cudnn as cudnn
    import torch.optim
    import torch.utils.data
    import torch.utils.data.distributed
    from torch.utils.tensorboard import SummaryWriter
    import torchvision.transforms as transforms
    import torchvision.datasets as datasets
    import torchvision.models as models
    import bagua.torch_api as bagua
    from bisect import bisect_right
    from pathlib import Path
    
    model_names = sorted(
        name
        for name in models.__dict__
        if name.islower() and not name.startswith("__") and callable(models.__dict__[name])
    )
    
    parser = argparse.ArgumentParser(description="PyTorch ImageNet Training")
    parser.add_argument("data", metavar="DIR", help="path to dataset")
    parser.add_argument(
        "-a",
        "--arch",
        metavar="ARCH",
        default="resnet18",
        choices=model_names,
        help="model architecture: " + " | ".join(model_names) + " (default: resnet18)",
    )
    parser.add_argument(
        "-j",
        "--workers",
        default=4,
        type=int,
        metavar="N",
        help="number of data loading workers (default: 4)",
    )
    parser.add_argument(
        "--epochs", default=90, type=int, metavar="N", help="number of total epochs to run"
    )
    parser.add_argument(
        "--warmup-epochs", type=float, default=5, help="number of warmup epochs"
    )
    parser.add_argument(
        "--start-epoch",
        default=0,
        type=int,
        metavar="N",
        help="manual epoch number (useful on restarts)",
    )
    parser.add_argument(
        "-b",
        "--batch-size",
        default=256,
        type=int,
        metavar="N",
        help="mini-batch size (default: 256), this is the total "
        "batch size of all GPUs on the current node when "
        "using Data Parallel or Distributed Data Parallel",
    )
    parser.add_argument(
        "--lr",
        "--learning-rate",
        default=0.1,
        type=float,
        metavar="LR",
        help="initial learning rate",
        dest="lr",
    )
    parser.add_argument("--momentum", default=0.9, type=float, metavar="M", help="momentum")
    parser.add_argument(
        "--wd",
        "--weight-decay",
        default=1e-4,
        type=float,
        metavar="W",
        help="weight decay (default: 1e-4)",
        dest="weight_decay",
    )
    parser.add_argument(
        "--milestones",
        default="60,70,80",
        type=str,
        help="multi-step learning rate scheduler milestones",
    )
    parser.add_argument(
        "--gama",
        type=float,
        default=0.2,
        help="multiplicative factor of learning rate decay",
    )
    parser.add_argument(
        "-p",
        "--print-freq",
        default=10,
        type=int,
        metavar="N",
        help="print frequency (default: 10)",
    )
    parser.add_argument(
        "--resume",
        default="",
        type=str,
        metavar="PATH",
        help="path to latest checkpoint (default: none)",
    )
    parser.add_argument(
        "--save-checkpoint", action="store_true", default=False, help="save checkpoint"
    )
    parser.add_argument(
        "-e",
        "--evaluate",
        dest="evaluate",
        action="store_true",
        help="evaluate model on validation set",
    )
    parser.add_argument(
        "--pretrained", dest="pretrained", action="store_true", help="use pre-trained model"
    )
    parser.add_argument(
        "--seed", default=None, type=int, help="seed for initializing training. "
    )
    parser.add_argument(
        "--amp",
        action="store_true",
        default=False,
        help="use amp",
    )
    
    parser.add_argument(
        "--prof", default=-1, type=int, help="Only run 10 iterations for profiling."
    )
    
    parser.add_argument(
        "--algorithm",
        type=str,
        default="gradient_allreduce",
        help="distributed algorithm: {gradient_allreduce, bytegrad, decentralized, low_precision_decentralized, qadam, async}",
    )
    
    parser.add_argument(
        "--async-sync-interval",
        default=500,
        type=int,
        help="Model synchronization interval(ms) for async algorithm",
    )
    
    parser.add_argument(
        "--async-warmup-steps",
        default=100,
        type=int,
        help="Warmup(allreduce) steps for async algorithm",
    )
    
    parser.add_argument(
        "--ckpt-dir",
        default="./ckpt/ckpt",
        type=str,
        help="The floder saving ckpt file",
    )
    
    parser.add_argument(
        "--log-dir",
        default="./log/log",
        type=str,
        help="The floder saving tensorboard log",
    )
    
    best_acc1 = 0
    summary_writer = None
    my_global_step = 0
    
    def main():
        args = parser.parse_args()
    
        if args.seed is not None:
            random.seed(args.seed)
            torch.manual_seed(args.seed)
            cudnn.deterministic = True
            warnings.warn(
                "You have chosen to seed training. "
                "This will turn on the CUDNN deterministic setting, "
                "which can slow down your training considerably! "
                "You may see unexpected behavior when restarting "
                "from checkpoints."
            )
    
        torch.cuda.set_device(bagua.get_local_rank())
        bagua.init_process_group()
        args.distributed = bagua.get_world_size() > 1
    
        logging.basicConfig(
            format="rank-{} %(asctime)s,%(msecs)d %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s".format(
                bagua.get_rank()
            ),
            datefmt="%Y-%m-%d:%H:%M:%S",
            level=logging.ERROR,
        )
    
        if bagua.get_rank() == 0:
            logging.getLogger().setLevel(logging.INFO)
    
        main_worker(args)
    
    
    def main_worker(args):
        global best_acc1
        global summary_writer
    
        summary_writer = SummaryWriter(log_dir=args.log_dir)
    
        # create model
        if args.pretrained:
            print("=> using pre-trained model '{}'".format(args.arch))
            model = models.__dict__[args.arch](pretrained=True)
        else:
            print("=> creating model '{}'".format(args.arch))
            model = models.__dict__[args.arch]()
    
        model = model.cuda()
    
        # define loss function (criterion) and optimizer
        criterion = nn.CrossEntropyLoss().cuda()
    
        optimizer = torch.optim.SGD(
            model.parameters(),
            args.lr,
            momentum=args.momentum,
            weight_decay=args.weight_decay,
        )
    
        if args.algorithm == "gradient_allreduce":
            from bagua.torch_api.algorithms import gradient_allreduce
    
            algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
        else:
            raise NotImplementedError
    
        scaler = torch.cuda.amp.GradScaler(enabled=args.amp)
    
        # optionally resume from a checkpoint
        if args.resume:
            if os.path.isfile(args.resume):
                print("=> loading checkpoint '{}'".format(args.resume))
                # Map model to be loaded to specified single gpu.
                loc = "cuda:{}".format(bagua.get_local_rank())
                checkpoint = torch.load(args.resume, map_location=loc)
                args.start_epoch = checkpoint["epoch"]
                best_acc1 = checkpoint["best_acc1"]
                if bagua.get_local_rank() is not None:
                    # best_acc1 may be from a checkpoint from a different GPU
                    best_acc1 = best_acc1.to(bagua.get_local_rank())
                model.load_state_dict(checkpoint["state_dict"])
                optimizer.load_state_dict(checkpoint["optimizer"])
                print(
                    "=> loaded checkpoint '{}' (epoch {})".format(
                        args.resume, checkpoint["epoch"]
                    )
                )
            else:
                print("=> no checkpoint found at '{}'".format(args.resume))
    
        if args.distributed:
            _test_rank = bagua.get_rank()
            model = model.with_bagua(
                [optimizer],
                algorithm,
            )
    
        cudnn.benchmark = True
    
        # Data loading code
        traindir = os.path.join(args.data, "train")
        valdir = os.path.join(args.data, "val")
        normalize = transforms.Normalize(
            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
        )
    
        train_dataset = datasets.ImageFolder(
            traindir,
            transforms.Compose(
                [
                    transforms.RandomResizedCrop(224),
                    transforms.RandomHorizontalFlip(),
                    transforms.ToTensor(),
                    normalize,
                ]
            ),
        )
        val_dataset = datasets.ImageFolder(
            valdir,
            transforms.Compose(
                [
                    transforms.Resize(256),
                    transforms.CenterCrop(224),
                    transforms.ToTensor(),
                    normalize,
                ]
            ),
        )
    
        if args.distributed:
            train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
        else:
            train_sampler = None
    
        train_loader = torch.utils.data.DataLoader(
            train_dataset,
            batch_size=args.batch_size,
            shuffle=(train_sampler is None),
            num_workers=args.workers,
            pin_memory=True,
            sampler=train_sampler,
        )
    
        val_loader = torch.utils.data.DataLoader(
            val_dataset,
            batch_size=args.batch_size,
            shuffle=False,
            num_workers=args.workers,
            pin_memory=True,
        )
    
        if args.evaluate:
            validate(val_loader, model, criterion, 0, args)
            return
    
        for epoch in range(args.start_epoch, args.epochs):
            if args.distributed:
                train_sampler.set_epoch(epoch)
    
            if args.algorithm == "async":
                model.bagua_algorithm.resume(model)
    
            # train for one epoch
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)
    
            start.record()
            train(train_loader, model, criterion, optimizer, scaler, epoch, args)
            end.record()
    
            # Waits for everything to finish running
            torch.cuda.synchronize()
            elapsed_time = start.elapsed_time(end)
            write_scalar(tag='train/epoch_training_time', scalar_value=elapsed_time, global_step=epoch)
    
            if args.algorithm == "async":
                model.bagua_algorithm.abort(model)
    
            # evaluate on validation set
            acc1 = validate(val_loader, model, criterion, epoch, args)
    
            # remember best acc@1 and save checkpoint
            is_best = acc1 > best_acc1
            best_acc1 = max(acc1, best_acc1)
    
            if bagua.get_rank() == 0 and args.save_checkpoint:
                save_checkpoint(
                    {
                        "epoch": epoch + 1,
                        "arch": args.arch,
                        "state_dict": model.state_dict(),
                        "best_acc1": best_acc1,
                        "optimizer": optimizer.state_dict(),
                    },
                    is_best,
                    dir=args.ckpt_dir
                )
    
    def train(train_loader, model, criterion, optimizer, scaler, epoch, args):
        global my_global_step
    
        batch_time = AverageMeter("Time", ":6.3f")
        data_time = AverageMeter("Data", ":6.3f")
        losses = AverageMeter("Loss", ":.4e")
        top1 = AverageMeter("Acc@1", ":6.2f")
        top5 = AverageMeter("Acc@5", ":6.2f")
        progress = ProgressMeter(
            len(train_loader),
            [batch_time, data_time, losses, top1, top5],
            prefix="Epoch: [{}]".format(epoch),
        )
    
        # switch to train mode
        model.train()
    
        end = time.time()
        for i, (images, target) in enumerate(train_loader):
    
            if args.prof >= 0 and i == args.prof:
                print("Profiling begun at iteration {}".format(i))
                torch.cuda.cudart().cudaProfilerStart()
    
            if args.prof >= 0:
                torch.cuda.nvtx.range_push("Body of iteration {}".format(i))
    
            # measure data loading time
            data_time.update(time.time() - end)
    
            if torch.cuda.is_available():
                images = images.cuda(bagua.get_local_rank(), non_blocking=True)
                target = target.cuda(bagua.get_local_rank(), non_blocking=True)
    
            adjust_learning_rate(optimizer, epoch, i, len(train_loader), args)
    
            optimizer.zero_grad()
    
            if args.prof >= 0:
                torch.cuda.nvtx.range_push("forward")
    
            with torch.cuda.amp.autocast(enabled=args.amp):
                # compute output
                output = model(images)
                loss = criterion(output, target)
    
            if args.prof >= 0:
                torch.cuda.nvtx.range_pop()
    
            # measure accuracy and record loss
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            losses.update(loss.item(), images.size(0))
            top1.update(acc1[0], images.size(0))
            top5.update(acc5[0], images.size(0))
    
            if args.prof >= 0:
                torch.cuda.nvtx.range_push("backward")
    
            # compute gradient and do SGD step
            scaler.scale(loss).backward()
    
            if args.prof >= 0:
                torch.cuda.nvtx.range_pop()
    
            if args.prof >= 0:
                torch.cuda.nvtx.range_push("optimizer.step()")
    
            scaler.step(optimizer)
            scaler.update()
    
            if args.prof >= 0:
                torch.cuda.nvtx.range_pop()
    
            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()
    
            if i % args.print_freq == 0:
                progress.display(i)
                write_scalar(tag='train/acc_top1', scalar_value=top1.get_avg(), global_step=my_global_step)
                write_scalar(tag='train/acc_top5', scalar_value=top5.get_avg(), global_step=my_global_step)
    
            # Pop range "Body of iteration {}".format(i)
            if args.prof >= 0:
                torch.cuda.nvtx.range_pop()
    
            if args.prof >= 0 and i == args.prof + 10:
                print("Profiling ended at iteration {}".format(i))
                torch.cuda.cudart().cudaProfilerStop()
    
                if args.algorithm == "async":
                    model.bagua_algorithm.abort(model)
                quit()
    
    
    def validate(val_loader, model, criterion, epoch, args):
        batch_time = AverageMeter("Time", ":6.3f")
        losses = AverageMeter("Loss", ":.4e")
        top1 = AverageMeter("Acc@1", ":6.2f")
        top5 = AverageMeter("Acc@5", ":6.2f")
        progress = ProgressMeter(
            len(val_loader), [batch_time, losses, top1, top5], prefix="Test: "
        )
    
        # switch to evaluate mode
        model.eval()
    
        with torch.no_grad():
            end = time.time()
            for i, (images, target) in enumerate(val_loader):
                if torch.cuda.is_available():
                    images = images.cuda(bagua.get_local_rank(), non_blocking=True)
                    target = target.cuda(bagua.get_local_rank(), non_blocking=True)
    
                # compute output
                output = model(images)
                loss = criterion(output, target)
    
                # measure accuracy and record loss
                acc1, acc5 = accuracy(output, target, topk=(1, 5))
                losses.update(loss.item(), images.size(0))
                top1.update(acc1[0], images.size(0))
                top5.update(acc5[0], images.size(0))
    
                # measure elapsed time
                batch_time.update(time.time() - end)
                end = time.time()
    
                if i % args.print_freq == 0:
                    progress.display(i)
    
            # TODO: this should also be done with the ProgressMeter
            logging.info(
                " * TEST Epoch {} Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}".format(
                    epoch, top1=top1, top5=top5
                )
            )
            write_scalar(tag='validation/acc_top1', scalar_value=top1.get_avg(), global_step=epoch)
            write_scalar(tag='validation/acc_top5', scalar_value=top5.get_avg(), global_step=epoch)
    
    
        return top1.avg
    
    def write_scalar(tag, scalar_value, global_step):
        global summary_writer
        if bagua.get_rank() == 0:
            summary_writer.add_scalar(tag=tag, scalar_value=scalar_value, global_step=global_step)
    
    def save_checkpoint(state, is_best, dir="./ckpt/dir"):
        dir = Path(dir)
        if not dir.exists():
            dir.mkdir(parents=True)
        
        file_name = dir / "checkpoint.pth.tar"
        torch.save(state, file_name)
        if is_best:
            shutil.copyfile(file_name, dir / "model_best.pth.tar")
    
    class AverageMeter(object):
        """Computes and stores the average and current value"""
    
        def __init__(self, name, fmt=":f"):
            self.name = name
            self.fmt = fmt
            self.reset()
    
        def reset(self):
            self.val = 0
            self.avg = 0
            self.sum = 0
            self.count = 0
    
        def update(self, val, n=1):
            self.val = val
            self.sum += val * n
            self.count += n
            self.avg = self.sum / self.count
    
        def __str__(self):
            fmtstr = "{name} {val" + self.fmt + "} ({avg" + self.fmt + "})"
            return fmtstr.format(**self.__dict__)
        
        def get_avg(self):
            return self.avg
    
    
    class ProgressMeter(object):
        def __init__(self, num_batches, meters, prefix=""):
            self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
            self.meters = meters
            self.prefix = prefix
    
        def display(self, batch):
            entries = [self.prefix + self.batch_fmtstr.format(batch)]
            entries += [str(meter) for meter in self.meters]
            logging.info("\t".join(entries))
    
        def _get_batch_fmtstr(self, num_batches):
            num_digits = len(str(num_batches // 1))
            fmt = "{:" + str(num_digits) + "d}"
            return "[" + fmt + "/" + fmt.format(num_batches) + "]"
    
    
    def adjust_learning_rate(optimizer, epoch, step, len_epoch, args):
        """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
        # lr = args.lr * (0.1 ** (epoch // 30))
        # for param_group in optimizer.param_groups:
        #     param_group["lr"] = lr
        milestones = [int(i) for i in args.milestones.split(",")]
        lr = args.lr * (args.gama ** bisect_right(milestones, epoch))
    
        """Warmup"""
        if epoch < args.warmup_epochs:
            lr = (
                lr
                * float(1 + step + epoch * len_epoch)
                / float(args.warmup_epochs * len_epoch)
            )
    
        # logging.info("epoch = {}, step = {}, lr = {}".format(epoch, step, lr))
    
        for param_group in optimizer.param_groups:
            param_group["lr"] = lr
    
    
    def accuracy(output, target, topk=(1,)):
        """Computes the accuracy over the k top predictions for the specified values of k"""
        with torch.no_grad():
            maxk = max(topk)
            batch_size = target.size(0)
    
            _, pred = output.topk(maxk, 1, True, True)
            pred = pred.t()
            correct = pred.eq(target.view(1, -1).expand_as(pred))
    
            res = []
            for k in topk:
                correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
                res.append(correct_k.mul_(100.0 / batch_size))
            return res
    
    
    if __name__ == "__main__":
        main()
    

    Please also write what exact commands are required to reproduce your results.

    python -m bagua.distributed.launch \
            --nproc_per_node=8 --nnodes=1 --node_rank=0 \
            --master_addr="10.154.34.164" --master_port=34498 \
            main.py \
            --arch=resnet50 \
            --save-checkpoint \
            --lr 0.2 \
            --batch-size 64 \
            --print-freq 100 \
            --algorithm gradient_allreduce \
            --resume ./ckpt/multi_node_gradient_allreduce \
            --ckpt-dir ./ckpt/multi_node_gradient_allreduce \
            --log-dir ./log/multi_node_gradient_allreduce \
            $DATA_PATH
    

    Additional context Add any other context about the problem here.

    opened by zhaone 0
Releases(v0.9.0)
  • v0.9.0(Jan 17, 2022)

    Bug Fixes

    Other

    • Reuse fused parameter tensors in fuse_step (#410)
    • Call step closure in qadam optimizer step (#432)
    • Fix need_reset condition (#454)
    • Do negotiation in async native op (#447)
    • Fix find_unused_parameters (#452)
    • Fix qadam non-deterministic (#459)
    • Add LIBRARY_PATH env in install_master.sh (#465)
    • Fix typo in install_master.sh (#471)

    Python

    • CUDA 11.5 can't get nccl package (#415)
    • Fix process group compatibility with torch 1.6.0 (#413)
    • Fix ci random fail (#445)
    • Fix async algorithm (#479)

    Features

    Core

    • Initial support for C interface (#325)

    Other

    • Support NODE_RANK environment variable (#426)
    • Choose bagua service port dynamically (#431)
    • Use bagua_module_name to identify different modules (#438)
    • Add algorithm registry (#433)
    • Add compatibility for NCCL version under 2.10 (#449)
    • Add broadcast object api (#437)
    • Support qadam in fused optimizer (#477)

    Python

    • Support PyTorch DDP compatible distributed training API (#312)
    • Support torch-api-compatiable all_reduce (#377)
    • Associate PyTorch Process Group with Bagua Process Group using cache (#402)
    • Support find_unused_parameters on BaguaDDP (#409)
    • Add BAGUA_AUTOTUNE_SERVER_WAIT_TIME env (#474)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.2(Nov 10, 2021)

    Features

    Python

    • Support switching between different algorithms (#299)
    • Support separate algorithm declaration and implementation (#246)

    Python, core

    • Support process group in with_bagua, support hierarchical communication in bytegrad algorithm (#300)
    • Support mutable bucket tensors (#271)
    • Support all_to_all_single (#361)

    Bug Fixes

    Other

    • Fuse optimizer oom and make it stateless (#207)
    • to_bagua_tensor compatibility with torch 1.6.0 (#355)

    Python

    • Use separate process group for async communication thread to avoid potential hangs (#298)
    • Do not fail if checkpoints path exist (#305)
    • Fix is_moe_param (#306)
    • Change to_bagua_tensor API to support PyTorch 1.10 (#338)
    • Fix fused optimizer with multiple param groups (#356)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1.post1(Oct 16, 2021)

    Bug Fixes

    • Process group not yet supported in with_bagua
    • Use separate process group for async communication thread to avoid potential hangs (#298)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Oct 16, 2021)

    [0.8.1] - 2021-10-16

    Features

    • Support moe (#208)
    • Support checkpointing for moe (#242)
    • Use single bucket for decentralized algorithm to improve performance (#275)
    • Support process group (#228)
    • Add barrier api (#290)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Sep 26, 2021)

    [0.8.0] - 2021-09-26

    Bug Fixes

    Ci

    • Only run publish once on git tag

    Core

    • Fix compressed buffer can not be scattered to odd number of ranks

    Other

    • Fix ci pypi versioning
    • Remove init.py and python version, use cargo version
    • Move import bagua_install_library to install library function
    • Merge bagua_install_library and setup.py, remove nccl<=2.6 support
    • Fix alltoall_v parameter (#17)
    • Reduce and allgather python interface
    • Fix decompress incorrect pointer and typo in error msg
    • Fix python gil deadlock during getting data ptr
    • Fix benchmark script requirements
    • Fix alltoall_v parameter types (#27)
    • Always mark bagua padding tensor as ready
    • Make compress/decompress of BaguaTensor method string consistent (#33)
    • Fix scatter and reduce_scatter implementation (#40)
    • Substract overflow error for decentralized op (#39)
    • Fix QADAM params (#17)
    • Fix assert precision (#18)
    • Replace mutex with atomic bool for async op and add Aluminum submodule update (#67)
    • Fix duplicated dependency downloading during installation (#77)
    • Fix async algorithm aborting and hanging (#78, #81)
    • Fix qadam algorithm call (#20)
    • Fix missing symbols in the zip library (#24)
    • Fix random autotune server hang (#206)
    • Bagua-net library path mismatch, make --enable_bagua_net argument style consistent with other args (#218)

    Python

    • Fix random autotune-service hang
    • Handle conflicts caused by sklearn upgrade (#225)

    Features

    Ci

    • Only publish pypi for master commits

    Other

    • Add async model average algorithm (#110)
    • Add cached dataset wrapper (#148)
    • Support sync batchnorm (#151)
    • Add --enable-bagua-net option in launcher (#183)
    • Add pytorch examples for MNIST, ImageNet, SQuAD training (#1)
    • Add requirements.txt, only download dataset on local rank 0 (#2)
    • Add python packaging related files
    • Add __version__ variable
    • Install nccl deps in bagua core and add generated __version__ variable
    • Add version.py placeholder to prevent file not found error
    • Initial support for python op (#2)
    • Add 5 min timeout for buckets' comm op (#5)
    • Replace NCCL with Aluminum (#7)
    • Add synethetic benchmark script (#5)
    • Add elastic training example (#7)
    • Support alltoall_v (vector alltoall) (#14)
    • Add reduce and allgather python interface
    • Support reduce and allgather op with Reduction op enum
    • Support creating BaguaTensor by passing torch tensor directly (#19)
    • Compatible mode for getting pytorch tensor info with Python interpreter
    • Better debug log including tensor info when executing ops
    • Add native low precision decentralized operator (#26)
    • Add (scatter, gather, scatter_reduce) and all inplace version communication primitives (#37)
    • Make full precision decentralized op stateless (#36)
    • Add communication_primitives example (#12)
    • Use nccl 2.10 avg op for all algorithms using averaging (#46, #45)
    • Add opentelemetry to report tensor ready order (#42)
    • Add deterministic flag (#15)
    • Add native async model average algorithm (#41)
    • Add examples for async model average algorithm (#14)
    • Support packet splitting and multi-stream parallel transmission (#5)
    • Support ncclnet v3 and remove the dependency on nccl in the installation environment (#17)
    • Add sync interval param to async examples (#19)
    • Suppport tokio backend (#21)
    • Support bagua-net (#89)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Aug 16, 2021)

    Bug Fixes

    • Autotune api conflict (#131)

    Features

    • Add low precision decentralized algorithm (#103)
    • Add all communication primitives such as send recv to communication module (#128)
    • Make full precision decentralized op stateless (#126)
    • Support nccl 2.10 ReduceOp.AVG (#149)
    • Add support for reporting tensor completion order (#146)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0-rc2(Jul 22, 2021)

  • v0.7.0-rc1(Jul 22, 2021)

  • v0.6.3(Jul 8, 2021)

    Features

    • support different ssh port on different nodes (#93) 6810245
    • support multiple models in one training script (#113) 312bcc0 (#107) 0aec789

    Fixes

    • autotune service defaults with a fixed random seed (#117) a58c2de

    Others

    • sort q_adam variables for better performance (#102) f277549
    • improve autotune speed metrics measurement for better accuracy (#86) e4ee5ee
    • install.sh upgrades existing bagua package bc69890
    • install.sh will not install Rust if already exist on the system 67e1efe
    Source code(tar.gz)
    Source code(zip)
  • v0.6.2(Jul 2, 2021)

  • v0.6.1(Jul 2, 2021)

    Features

    • add QAdam algorithm (#92) 0dafd24
    • broadcast model parameters on every algorithm reset e5b36dc
    • wrap python op in communication stream context by default 51eb656
    • add append op methods to python BaguaBucket class (#87) 84d8cbc

    Fixes

    • BaguaBucket.tensors should only contain original passed in tensors c4ff05f
    • fix append python op callable reference 04019cc
    • fix BaguaBacket.clear_ops() return value 8cb9f54
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Jul 1, 2021)

    ⚠ BREAKING CHANGE

    • Now end users should use model.with_bagua(...) API to use Bagua for communication. Algorithm developers can use bagua.torch_api.algorithms.Algorithm to easily develop new algorithms. Installation requires bagua-core >=0.3 now.

    Features

    • add algorithm import in bagua.torch_api ee73edc
    • support reduction op and reduce ac8632c
    • auto installation support centos (#50) 073a59e

    Fixes

    • fix algoirthm pre forward hook not returned e6c7c8d
    • the environment variable LOCAL_SIZE has been renamed in LOCAL_WORLD_SIZE (#51) 801b25a
    Source code(tar.gz)
    Source code(zip)
  • 0.5.0(Jun 25, 2021)

    ⚠ BREAKING CHANGE

    • contrib: load balancing dataloader and fused optimizer are now in bagua.torch_api.contrib module
    • baguaelastic/distributed/launch.py now moved to bagua/distributed/run.py

    Features

    • add dependency installation script for ubuntu (#41) 4d820ab
    • Elastic training (#31) 1a5964c
    • add broadcast_buffer in bagua_init (#29) e761cc6
    • support bagua-core 0.2 (#26) f1d2bfa

    Fixes

    • autotune: fix bucket size switch not effective (#48) 30b490a
    • remove logging in load balancing dataloader to avoid deadlock (#35) e900383
    • torch_api.distributed: cycle dependence (#16) 0314e24
    • fix setup.py for low version setuptools (#14) 7d315c0
    • fix baguaelastic launch script b069cd4=
    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Jun 17, 2021)

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 20.2k Jan 8, 2023
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 20.2k Jan 5, 2023
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 19.3k Feb 12, 2021
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Master Docs License Apache MXNet (incubating) is a deep learning framework designed for both efficiency an

ROCm Software Platform 29 Nov 16, 2022
YOLTv4 builds upon YOLT and SIMRDWN, and updates these frameworks to use the most performant version of YOLO, YOLOv4

YOLTv4 builds upon YOLT and SIMRDWN, and updates these frameworks to use the most performant version of YOLO, YOLOv4. YOLTv4 is designed to detect objects in aerial or satellite imagery in arbitrarily large images that far exceed the ~600×600 pixel size typically ingested by deep learning object detection frameworks.

Adam Van Etten 161 Jan 6, 2023
Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

This repository contains code for the following two papers: VisualBERT: A Simple and Performant Baseline for Vision and Language (arxiv) with a short

Natural Language Processing @UCLA 463 Dec 9, 2022
Neural Ensemble Search for Performant and Calibrated Predictions

Neural Ensemble Search Introduction This repo contains the code accompanying the paper: Neural Ensemble Search for Performant and Calibrated Predictio

AutoML-Freiburg-Hannover 26 Dec 12, 2022
CompilerGym is a library of easy to use and performant reinforcement learning environments for compiler tasks

CompilerGym is a library of easy to use and performant reinforcement learning environments for compiler tasks

Facebook Research 721 Jan 3, 2023
Performant, differentiable reinforcement learning

deluca Performant, differentiable reinforcement learning Notes This is pre-alpha software and is undergoing a number of core changes. Updates to follo

Google 114 Dec 27, 2022
The pure and clear PyTorch Distributed Training Framework.

The pure and clear PyTorch Distributed Training Framework. Introduction Requirements and Usage Dependency Dataset Basic Usage Slurm Cluster Usage Base

WILL LEE 208 Dec 20, 2022
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | 简体中文 Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

Alibaba 185 Dec 21, 2022
Simple codebase for flexible neural net training

neural-modular Simple codebase for flexible neural net training. Allows for seamless exchange of models, dataset, and optimizers. Uses hydra for confi

Jannik Kossen 7 Apr 5, 2022
UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Unified Multi-modal Transformers This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Vi

Applied Research Center (ARC), Tencent PCG 84 Jan 4, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.

NVIDIA Corporation 6.9k Jan 3, 2023
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Microsoft 8.4k Jan 1, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This is a Python package available on PyPI for NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pyto

Artit 'Art' Wangperawong 5 Sep 29, 2021
Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Softlearning Softlearning is a deep reinforcement learning toolbox for training maximum entropy policies in continuous domains. The implementation is

Robotic AI & Learning Lab Berkeley 997 Dec 30, 2022
Secure Distributed Training at Scale

Secure Distributed Training at Scale This repository contains the implementation of experiments from the paper "Secure Distributed Training at Scale"

Yandex Research 9 Jul 11, 2022
Distributed Arcface Training in Pytorch

Distributed Arcface Training in Pytorch

null 3 Nov 23, 2021