A high performance and generic framework for distributed DNN training

Overview

BytePS

Build Status License Pypi

BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on either TCP or RDMA network.

BytePS outperforms existing open-sourced distributed training frameworks by a large margin. For example, on BERT-large training, BytePS can achieve ~90% scaling efficiency with 256 GPUs (see below), which is much higher than Horovod+NCCL. In certain scenarios, BytePS can double the training speed compared with Horovod+NCCL.

News

  • BytePS paper has been accepted to OSDI'20. The code to reproduce the end-to-end evaluation is available here.
  • Support gradient compression.
  • v0.2.4
    • Fix compatibility issue with tf2 + standalone keras
    • Add support for tensorflow.keras
    • Improve robustness of broadcast
  • v0.2.3
    • Add DistributedDataParallel module for PyTorch
    • Fix the problem of different CPU tensor using the same name
    • Add skip_synchronize api for PyTorch
    • Add the option for lazy/non-lazy init
  • v0.2.0
    • Largely improve RDMA performance by enforcing page aligned memory.
    • Add IPC support for RDMA. Now support colocating servers and workers without sacrificing much performance.
    • Fix a hanging bug in BytePS server.
    • Fix RDMA-related segmentation fault problem during fork() (e.g., used by PyTorch data loader).
    • New feature: Enable mixing use of colocate and non-colocate servers, along with a smart tensor allocation strategy.
    • New feature: Add bpslaunch as the command to launch tasks.
    • Add support for pip install: pip3 install byteps

Performance

We show our experiment on BERT-large training, which is based on GluonNLP toolkit. The model uses mixed precision.

We use Tesla V100 32GB GPUs and set batch size equal to 64 per GPU. Each machine has 8 V100 GPUs (32GB memory) with NVLink-enabled. Machines are inter-connected with 100 Gbps RDMA network. This is the same hardware setup you can get on AWS.

BytePS achieves ~90% scaling efficiency for BERT-large with 256 GPUs. The code is available here. As a comparison, Horovod+NCCL has only ~70% scaling efficiency even after expert parameter tunning.

BERT-Large

With slower network, BytePS offers even more performance advantages -- up to 2x of Horovod+NCCL. You can find more evaluation results at performance.md.

Goodbye MPI, Hello Cloud

How can BytePS outperform Horovod by so much? One of the main reasons is that BytePS is designed for cloud and shared clusters, and throws away MPI.

MPI was born in the HPC world and is good for a cluster built with homogeneous hardware and for running a single job. However, cloud (or in-house shared clusters) is different.

This leads us to rethink the best communication strategy, as explained in here. In short, BytePS only uses NCCL inside a machine, while re-implements the inter-machine communication.

BytePS also incorporates many acceleration techniques such as hierarchical strategy, pipelining, tensor partitioning, NUMA-aware local communication, priority-based scheduling, etc.

Quick Start

We provide a step-by-step tutorial for you to run benchmark training tasks. The simplest way to start is to use our docker images. Refer to Documentations for how to launch distributed jobs and more detailed configurations. After you can start BytePS, read best practice to get the best performance.

Below, we explain how to install BytePS by yourself. There are two options.

Install by pip

pip3 install byteps

Build from source code

You can try out the latest features by directly installing from master branch:

git clone --recursive https://github.com/bytedance/byteps
cd byteps
python3 setup.py install

Notes for above two options:

  • BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet.
  • BytePS depends on CUDA and NCCL. You should specify the NCCL path with export BYTEPS_NCCL_HOME=/path/to/nccl. By default it points to /usr/local/nccl.
  • The installation requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try yum install devtoolset-7 before everything else. In general, we recommend using gcc 4.9 for best compatibility (how to pin gcc).
  • RDMA support: During setup, the script will automatically detect the RDMA header file. If you want to use RDMA, make sure your RDMA environment has been properly installed and tested before install (install on Ubuntu-18.04).

Examples

Basic examples are provided under the example folder.

To reproduce the end-to-end evaluation in our OSDI'20 paper, find the code at this repo.

Use BytePS in Your Code

Though being totally different at its core, BytePS is highly compatible with Horovod interfaces (Thank you, Horovod community!). We chose Horovod interfaces in order to minimize your efforts for testing BytePS.

If your tasks only rely on Horovod's allreduce and broadcast, you should be able to switch to BytePS in 1 minute. Simply replace import horovod.tensorflow as hvd by import byteps.tensorflow as bps, and then replace all hvd in your code by bps. If your code invokes hvd.allreduce directly, you should also replace it by bps.push_pull.

Many of our examples were copied from Horovod and modified in this way. For instance, compare the MNIST example for BytePS and Horovod.

BytePS also supports other native APIs, e.g., PyTorch Distributed Data Parallel and TensorFlow Mirrored Strategy. See DistributedDataParallel.md and MirroredStrategy.md for usage.

Limitations and Future Plans

BytePS does not support pure CPU training for now. One reason is that the cheap PS assumption of BytePS do not hold for CPU training. Consequently, you need CUDA and NCCL to build and run BytePS.

We would like to have below features, and there is no fundamental difficulty to implement them in BytePS architecture. However, they are not implemented yet:

  • Sparse model training
  • Fault-tolerance
  • Straggler-mitigation

Publications

  1. [OSDI'20] "A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters". Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo.

  2. [SOSP'19] "A Generic Communication Scheduler for Distributed DNN Training Acceleration". Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, Chuanxiong Guo. (Code is at bytescheduler branch)

Comments
  • gradient compression support

    gradient compression support

    Motivation

    Currently BytePS does not fully support gradient compression. The compression it supports lies in each plugin in Python. Such design may ease the difficulty of the implementation but leads to major inabilities for more aggressive compression. This is because NCCL only supports limited reduction operations such as Sum, Prod etc but these operations are meaningless for the compressed data which have been highly bit-wisely packed. For example, for signSGD, one of the most popular methods for gradient compression due to its simplicity and effectiveness, each bit represents a signbit of an element in the original data tensor, making reduction operations like summation totally meaningless. But reduction is necessary for multi-GPU devices.

    Another problem is that compared to inter-node communication, intra-node communication is not the bottleneck. Furthermore, too much compression at first will lose much information, which may cause low accuracy. So there is no need to make too radical compression before running into BytePS core in worker nodes.

    Therefore, changes need to be made.

    Design Overview

    In light of the problems mentioned above, we propose two-level gradient compression:

    1. intra-node: This is just an alias for the current implementation, named after its communication property. Transform FP32 tensors into FP16 on each GPU, reduce them across multi-GPUs via NCCL, and copy them to the CPU buffer waiting for next-level compression. The purpose of the compression is to reduce intra-node communication overhead introduced by multi-GPUs. Since intra-node communication is very fast, especially with NCCL, only mild compression methods will be applied, most of which is type-conversion. It is framework-specific and will be implemented in each plugin.

    2. inter-node: Usually inter-node communication is a bottleneck, so more drastically gradient compression algorithms will be applied here. This is framework-agnostic and will be implemented in BytePS core.

    It is worth mentioning that our design supports all frameworks.

    architecture

    Interface

    Only a few changes to be made for users. Users only have to add a few LOC in the script to specify which compression algorithm to be used and the parameters needed by the algorithm. Take MXNet for example.

    compression_params = {
                "compressor": opt.compressor,
                "ef": opt.ef,
                "momentum": opt.compress_momentum,
                "scaling": opt.onebit_scaling,
                "k": opt.k
    }
    
    trainer = bps.DistributedTrainer(params, optimizer, optimizer_params, compression_params=compression_params)
    

    Here we prescribe some keys. Users can lookup documentations to determine which key should be used. Here are some common keys.

    | KEYS | DESC | | --- | --- | | compressor | compression algorithms, including onebit / dithering / topk / randomk | | k | an integer, must be specified when using dithering / topk / randomk | | scaling | optional, whether to enable scaling for onebit, default is false | | ef | error-feedback algorithms, e.g. vanilla | | momentum | momentum algorithms, e.g. nesterov | | seed | random seed |

    If the user's input is not correct, it will give a warning and abort.

    Implementation

    Parameter Data Structure

    To offer users a unified interface to use, we have to address the registration problem. parameters vary from different kinds of compression algorithms. For example, topk and randomk algorithms need parameter k to be specified while onebit algorithm may need to input whether to enable scaling flag. Some parameters are optional but others are not. So parameter passing is a challenge.

    We address this challenge using string-string dictionary (std::unorded_map<std::string, std::string> for C++ or dict for Python) as our unified data structure to pass parameters. As mentioned above, we prescribe specific strings as keys, so the dictionary will look like:

    {"byteps_compressor_type": "topk", "byteps_compressor_k": "3", "byteps_error_feedback_type": "vanilla"}
    

    Python

    For MXNet users, the dictionary can be an attribute of ParameterDict. We can filter out those parameters by leveraging the prefix "byteps". For example,

    for i, param in enumerate(self._params):
               byteps_declare_tensor("parameter_" + str(i))
               if param.grad_req != 'null':
                   byteps_params = dict(
                       filter(lambda attr: attr[0].startswith(
                           "byteps_",), param.__dict__.items())
                   )
                   byteps_declare_tensor("gradient_" + str(i), **byteps_params)
    

    C++

    Using ctypes, we can pass the dictionary conveniently. For example,

    extern "C" void byteps_mxnet_declare_tensor(char* name, int num_params,
                                               char** param_keys,
                                               char** param_vals) {
     ...
    
     std::unordered_map<std::string, std::string> param_dict;
     std::string key, val;
     std::string::size_type pos;
     for (int i = 0; i < num_params; ++i) {
       key = param_keys[i];
       val = param_vals[i];
       param_dict[key] = val;
     }
    
     ...
    }
    

    Compressor - Development API

    We want developers to develop their own gradient compression algorithms without fully understanding how BytePS works. What they only need to know is development API. We currently implement some commonly used gradient compression algorithms, but in the future, we hope more novel algorithms will be implemented under our API. We abstract compression algorithms into compressor. The Compressor looks like this:

    class Compressor {
     public:
      Compressor(size_t size, DataType dtype)
          : _size(size),
            _dtype(dtype),
            _buf(new byte_t[size]),
            _cpu_reducer(new CpuReducer(nullptr)){};
      virtual ~Compressor() = default;
    
      virtual tensor_t Compress(tensor_t grad) = 0;
    
      virtual tensor_t Decompress(tensor_t compressed) = 0;
    
      virtual void FastUpdateError(tensor_t error, tensor_t corrected,
                                   tensor_t compressed) {
        BPS_LOG(FATAL) << "FastUpdateError is not implemented";
      };
    
      std::unique_ptr<byte_t[]> _buf;
    
      size_t _size;
    
      DataType _dtype;
    
      std::unique_ptr<CpuReducer> _cpu_reducer;
    };
    

    In order to make less modifications to BytePS core, we want compressors to be as general as possible. In the best case, the base compressor pointer/reference can represent all kinds of compressors and only need to expose two operations to users: Compress and Decompress. This is quite challenging because there are some optional features for gradient compression, such as error-feedback and momentum. These are two common methods to correct the bias and accelerate the training process respectively. For example, with error-feedback, before being compressed, gradients are first corrected with errors which refer to the information loss during the last compression, and then errors are re-calculated. Therefore, the workflow is different from only using vanilla gradient compression.

    In order to support all these features and expose a unified API at the same time, we use the decorator pattern. We regard error-feedback as an additional behavior of compressors. We want a unified API, which means compressors with error-feedback should expose the same method as those without error-feedback. But in that case we have to create a subclass for each compressor, which is too redundant. So the decorator pattern just solves our problem. We create a decorator class named ErrorFeedback to inherit BaseCompressor while at the same time also keeping a member of BaseCompressor. For example,

    class ErrorFeedback : public Compressor {
     public:
      ErrorFeedback(size_t size, DataType dtype, std::unique_ptr<Compressor> cptr)
          : Compressor(size, dtype),
            _cptr(std::move(cptr)),
            _error(new byte_t[size]()) {}
      virtual ~ErrorFeedback() = default;
    
      virtual tensor_t Compress(tensor_t grad) final;
    
      virtual tensor_t Decompress(tensor_t compressed) final;
    
     protected:
    
      virtual void UpdateGradient(tensor_t grad) = 0;
    
      virtual void UpdateError(tensor_t corrected, tensor_t compressed);
    
     protected:
      std::unique_ptr<byte_t[]> _error;
    
     private:
      std::unique_ptr<Compressor> _cptr;
    };
    

    And the workflow is implemented in Compress and Decompress. For example,

    tensor_t ErrorFeedback::Compress(tensor_t grad) {
      // 1. grad <- grad + error
      UpdateGradient(grad);
    
      // 2. c <- Compress(grad)
      auto compressed = _cptr->Compress(grad);
    
      // 3. e <- grad - Decompress(c)
      UpdateError(grad, compressed);
    
      return compressed;
    }
    
    tensor_t ErrorFeedback::Decompress(tensor_t compressed) {
      // directly forward to internal compressor
      return _cptr->Decompress(compressed);
    }
    

    Momentum is implemented in the same way. ErrorFeedBack and Momentum are also base classes to inherit. In this way, error-feedback and momentum becomes optional features to be added to any vanilla gradient compression algorithms.

    BTW, momentum is not applied to servers.

    Exps

    CIFAR100

    End-to-End Training

    We conduct the experiment in distributed training ResNet18_v2 on the CIFAR100 datasets with 4 AWS P3.16xlarge instances, each equipped with 8 V100 GPUs and 25Gbps network. The compression algorithms benchmarked here are also equipped with error-feedback and nesterov momentum. We set k = 1 for topk and k = 8 for randomk. We train it for 200 epochs.

    image

    image

    | f888c8d8f9e8483e46acd00042ed262e30c6856e | VAl ACC | TIME(s) | | -- | -- | -- | |baseline| 0.713799| 703.1527987500002| |onebit| 0.705601| 629.4210848750001| |randomk| 0.6991| 501.99770550000005| |topk| 0.704202| 507.90769437499966|

    The results show that compression can reduce up to 28.6% end-to-end training time without accuracy loss.

    Slow Network

    Gradient compression is more beneficial in slower network. Therefore we limit the network bandwidth to 100Mbps (both downlink and uplink) and keep all other settings not changed. The results show that we can achieve up to 6x reduciton in training time.

    image

    | b382f996d159fbe4d48c1135290f5c4183fc6b46 | TIME(s) | | -- | -- | |baseline| 518.321322125| |onebit| 195.236724875| |randomk| 89.672168625| |topk| 83.9287285|

    IMAGENET

    To save time, we only tested 1bit algorithm. Topk and randomk are not guaranteed to converge on IMAGENET.

    Workload Breakdown

    In this experiment, we measure the workload breakdown into computation and communication. We use 8 Amazon EC2 p3.2xlarge instances, each of which is shipped with one Nvidia V100 GPU and 10Gbps Ethernet. We train two CNN models: Resnet-50_v2 and VGG-16. We first measure the computation time by collecting the elapsed time of running 50 iterations (t0) on one node. Then we measure the total training time for running 50 iterations (t1) on 8 nodes. Then, we get an estimate of communication time using t1 − t0.

    As the figure shows, dist-EF-SGDM can reduce communication to varying degrees. For ResNet50_v2, the drop is trivial (17.6% decrease), mainly due to the smaller model size. In contrast, a remarkable decline (73.2% decrease) occurs using dist-EF-SGDM for VGG-16, since VGG-16 has larger model size (528M).

    [ResNet50_v2] image

    [VGG-16] image

    Scaling Efficiency

    We also measure scaling efficiency when the number of nodes varies from 1 to 8. We follow the same setup as in the above experiment. The figure shows that gradient compression improves the scaling efficiency. The efficiency gain in gradient compression is much higher for VGG-16 than ResNet-50_v2, since ResNet50_v2 has smaller communication overhead.

    [ResNet50_v2] image

    [VGG-16] image


    The above two sub-experiments were conducted 2 months ago. There have been large updates since then. So the results are a little outdated. They are just for reference.

    End-to-End Training

    Finally, we train ResNet50_v2 and VGG-16 end-to-end to measure total reduction in training time. For such large batch training, warmup and linear scaling learning rate are used to avoid generalization gap. We set the number of warmup epochs to 5. We also leverage cosine annealing strategy for learning rate decay. For ResNet50_v2 we use 8 AWS EC2 P3.16xlarge instances while for VGG-16, we use 4 AWS EC2 P3.16xlarge.

    [ResNet50_v2] image image

    As the figure shows, we reduce the trianing time by 8.0% without accuracy loss for ResNet50_v2.

    | 6c44049fd49e532781af96add6a02a0427e6a1a8 | VAl ACC | TIME(h) | | -- | -- | -- | |sgdm| 0.76914465625| 2.6505945833029516| |dist-ef-sgdm| 0.7632242968749999|2.4378090010373263 |

    [VGG-16] image image

    The above figure shows that our implementation of dist-EF-SGDM reduces the training time for 100 epochs by 39.04% compared to the full-precision SGDM. We note that there is a small gap in accuracy between dist-EF-SGDM and SGDM. We will investigate this problem in the future.

    TODO

    • [x] support inter-node compression
    • [x] support intra-node for MXNet
    • [x] support onebit compressor
    • [x] support error-feedback
    • [x] support momentum
    • [x] support other compressors
    • [x] support FP16
    • [ ] support PyTorch and Tensorflow

    Precautions

    1. To run successfully, ps-lite should change one LOC. see the PR here. https://github.com/dmlc/ps-lite/pull/168
    2. We only support Gluon for MXNet now. Raw MXNet's API does not support it.
    3. Since gradient compression also has some overhead, this is a trade-off. It is only suitable for some cases, e.g. slow network or large models. In other cases, gradient compression will even harm performance.
    4. Momentum here is the same as the framework's momentum. Why do we have to implement momentum again? This is because for some algorithms like dist-EF-SGDM , momentum should be added first but many frameworks like MXNet exchange gradient first and then add the momentum. So we have to implement momentum inside BytePS. When inside momentum is used, outside momentum should be disabled (set \mu = 0) in the users' scripts.
    5. FP16 is not supported now.

    Acknowledgement

    Thanks @eric-haibin-lin @szhengac for guidance! They have been giving many valuable suggestions!

    opened by jasperzhong 37
  • Can not run Distributed Training with RDMA

    Can not run Distributed Training with RDMA

    Describe the bug I use byteps according to Distributed Training with RDMA of byteps/docs/step-by-step-tutorial. I use the latest images : bytepsimage/tensorflow and get error. But using bytepsimage/tensorflow_rdma and bytepsimage/server_rdma can success. Using one scheduler, one server, two workers with one respective gpu in a physical worker.

    To Reproduce Steps to reproduce the behavior:

    1. For the scheduler: docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash export DMLC_ENABLE_RDMA=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=scheduler export DMLC_NUM_SERVER=1 export DMLC_INTERFACE=eth0 export DMLC_PS_ROOT_URI=xxx.xx.xx.xx export DMLC_PS_ROOT_PORT=9008 bpslaunch

    2. For the server: docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash export DMLC_ENABLE_RDMA=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=1 export DMLC_INTERFACE=eth0 export DMLC_PS_ROOT_URI=xxx.xx.xx.xx export DMLC_PS_ROOT_PORT=9008 bpslaunch

    3. For worker-0: nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash export NVIDIA_VISIBLE_DEVICES=0 export DMLC_ENABLE_RDMA=1 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_INTERFACE=eth0 export DMLC_PS_ROOT_URI=xxx.xx.xx.xx export DMLC_PS_ROOT_PORT=9008 bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000

    4. For worker-1: nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash export NVIDIA_VISIBLE_DEVICES=7 export DMLC_ENABLE_RDMA=1 export DMLC_WORKER_ID=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_INTERFACE=eth0 export DMLC_PS_ROOT_URI=xxx.xx.xx.xx export DMLC_PS_ROOT_PORT=9008 bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000

    5. scheduler see error BytePS launching scheduler [02:34:49] byteps/server/server.cc:339: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [02:34:49] src/postoffice.cc:20: enable RDMA for networking [02:34:49] src/./rdma_van.h:40: Shared memory IPC has been disabled [02:34:49] src/./rdma_van.h:801: OnConnect to Node 1 with Transport=RDMA [02:34:49] src/./rdma_van.h:207: Connect to Node 1 with Transport=RDMA [02:34:58] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [02:35:23] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [02:35:38] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [02:35:38] src/./rdma_van.h:207: Connect to Node 9 with Transport=RDMA [02:35:38] src/./rdma_van.h:207: Connect to Node 8 with Transport=RDMA [02:35:38] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:35:38] src/./rdma_transport.h:130: Check failed: mr Stack trace returned 7 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1b98c) [0x7f788672398c] [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1bdad) [0x7f7886723dad] [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x40fb8) [0x7f7886748fb8] [bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x57dbb) [0x7f788675fdbb] [bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f7885e0866f] [bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f78891356db] [bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f788946e88f]

    Expected behavior run sucess

    Screenshots scheduler image server image worker0 image worker1 image

    Environment physical machine(please complete the following information):

    • OS:16.04.2-Ubuntu
    • GCC version: 5.4.0
    • CUDA and NCCL version:10.1
    • Framework (TF, PyTorch, MXNet):TF
    opened by mengkai94 27
  • CUDA runtime error when running with pytorch benchmark_byteps.py

    CUDA runtime error when running with pytorch benchmark_byteps.py

    Describe the bug Got cuda runtime error when running with pytorch benchmark_byteps.py.

    Error info:

    BytePS launching worker
    running benchmark...
    Model: resnet50
    Batch size: 32
    Number of GPUs: 1
    Running warmup...
    THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
    Traceback (most recent call last):
      File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 109, in <module>
        timeit.timeit(benchmark_step, number=args.num_warmup_batches)
      File "/usr/lib/python2.7/timeit.py", line 237, in timeit
        return Timer(stmt, setup, timer).timeit(number)
      File "/usr/lib/python2.7/timeit.py", line 202, in timeit
        timing = self.inner(it, self.timer)
      File "/usr/lib/python2.7/timeit.py", line 100, in inner
        _func()
      File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 90, in benchmark_step
        output = model(data)
      File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
        x = self.conv1(x)
      File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
        self.padding, self.dilation, self.groups)
    RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
    

    To Reproduce Steps to reproduce the behavior: Following the step by step tutorial, and I use the bytepsimage/worker_pytorch image from official.

    Environment (please complete the following information): same as byteps official pytorch worker image.

    Additional context Add any other context about the problem here.

    opened by un-knight 22
  • Unable to load tensorflow plugin for bytescheduler

    Unable to load tensorflow plugin for bytescheduler

    I can build TF 1.13.2 from source with the patch. However when I try to run the tf_cnn_benchmarks.py I get a error when tf.load_library('libplugin.so') is called.

    [libprotobuf ERROR external/protobuf_archive/src/google/protobuf/descriptor_database.cc:58] File already exists in database: tensorflow/core/kernels/boosted_trees/boosted_trees.proto [libprotobuf FATAL external/protobuf_archive/src/google/protobuf/descriptor.cc:1358] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): Aborted (core dumped)

    Environment:

    • OS: Ubuntu 16.04.6 LTS (Xenial Xerus)
    • GCC version: gcc (Ubuntu 4.9.3-13ubuntu2) 4.9.3
    • CUDA and NCCL version: Cuda V10.0.130
    • TF 1.13.2 (installed from source with)

    Any help would be appreciated. Thanks!

    bug bytescheduler 
    opened by bhetherman 21
  • AttributeError: module 'byteps.torch' has no attribute 'push_pull'

    AttributeError: module 'byteps.torch' has no attribute 'push_pull'

    I used hps.allreduce and an error was raised AttributeError: module 'byteps.torch' has no attribute 'allreduce' However, I replace hvd.allreduce with bps.push_pull, there was alos an error AttributeError: module 'byteps.torch' has no attribute 'push_pull'

    bug enhancement single machine 
    opened by boscotsang 21
  • BytePS w/ MXNet doesn't work w/o docker container

    BytePS w/ MXNet doesn't work w/o docker container

    Describe the bug If i try to run on bare machine i cannot run the MXNet example: https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#mxnet

    But if I use the container provided then I am able to run the example.

    To Reproduce Steps to reproduce the behavior:

    1. Ubuntu 16.04 DLAMI EC2 instance p2.8xlarge (8 k80gpus)
    2. pip install mxnet-cu100mkl
    3. pip install byteps==0.2.0
    4. git clone --recursive https://github.com/bytedance/byteps.git ~/byteps
    5. run following commands on shell:
    
    export NVIDIA_VISIBLE_DEVICES=0,1,2,3  # gpus list
    export DMLC_WORKER_ID=0 # your worker id
    export DMLC_NUM_WORKER=1 # one worker
    export DMLC_ROLE=worker 
    
    export DMLC_NUM_SERVER=1 
    export DMLC_PS_ROOT_URI=10.0.0.1 
    export DMLC_PS_ROOT_PORT=1234 
    
    bpslaunch python3 ~/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32  
    
    1. See error as shown in logs

    Expected behavior To run mxnet example

    Logs

    (mx_byteps) ubuntu@ip-172-31-85-4:~$ bpslaunch python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
    BytePS launching worker
    INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
    INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
    INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
    INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
    INFO:root:Launch BytePS process on GPU-2
    learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
    INFO:root:Launch BytePS process on GPU-0
    INFO:root:Launch BytePS process on GPU-1
    learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
    learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
    environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '2', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============2
    INFO:root:Launch BytePS process on GPU-3
    environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '1', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============1
    learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
    environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '0', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============0
    environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '3', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============3
    
    Segmentation fault: 11
    
    
    Segmentation fault: 11
    
    Stack trace:
      [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fe075c25100]
      [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fe1021c34b0]
      [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7fe102561d44]
      [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7fe07531a737]
      [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7fe07531d863]
      [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7fe075313551]
      [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7fe075279a67]
      [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7fdff9762970]
      [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fe1012dfec0]
    Stack trace:
      [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f3556934100]
      [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f35e2ed24b0]
      [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f35e3270d44]
      [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7f3556029737]
      [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7f355602c863]
      [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f3556022551]
      [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7f3555f88a67]
      [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7f34e1762970]
      [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f35e1feeec0]
    [2020-03-16 21:41:59*** Error in `.956268: F byteps/common/core_loops.cc:299] Check failed: r == ncclSuccess NCCL error: unhandled cuda error
    
    Segmentation fault: 11
    
    Stack trace:
      [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fefa3442100]
      [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7ff02f9e04b0]
      [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7ff02fd7ed44]
      [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7fefa2b37737]
      [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7fefa2b3a863]
      [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7fefa2b30551]
      [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7fefa2a96a67]
      [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7fef2d762970]
      [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7ff02eafcec0]
    
    Segmentation fault: 11
    
    
    Segmentation fault: 11
    
    Stack trace:
      [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100]
      [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0]
      [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f15832b5d44]
      [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7f14f606e737]
      [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7f14f6071863]
      [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551]
      [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7f14f5fcda67]
      [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7f1481762970]
      [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1582033ec0]
    
    Segmentation fault: 11
    
    
    Segmentation fault: 11
    
    Stack trace:
      [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100]
      [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0]
      [bt] (2) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389f261) [0x7f14f6070261]
      [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0611) [0x7f14f6071611]
      [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551]
      [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38974a4) [0x7f14f60684a4]
      [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f14f629056a]
      [bt] (7) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6d860a) [0x7f14f2ea960a]
      [bt] (8) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ab7101) [0x7f14f6288101]
    Stack trace:
      [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100]
      [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0]
      [bt] (2) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389f261) [0x7f14f6070261]
      [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0611) [0x7f14f6071611]
      [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551]
      [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38974a4) [0x7f14f60684a4]
      [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f14f629056a]
      [bt] (7) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6d860a) [0x7f14f2ea960a]
      [bt] (8) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ab7101) [0x7f14f6288101]
    Aborted (core dumped)
    Exception in thread Thread-4:
    Traceback (most recent call last):
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
        self.run()
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
        self._target(*self._args, **self._kwargs)
      File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
        subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 134.
    
    Segmentation fault (core dumped)
    Exception in thread Thread-3:
    Traceback (most recent call last):
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
        self.run()
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
        self._target(*self._args, **self._kwargs)
      File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
        subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.
    
    Segmentation fault (core dumped)
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
        self.run()
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
        self._target(*self._args, **self._kwargs)
      File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
        subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.
    
    Segmentation fault (core dumped)
    Exception in thread Thread-2:
    Traceback (most recent call last):
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
        self.run()
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
        self._target(*self._args, **self._kwargs)
      File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
        subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
      File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.
    

    Environment (please complete the following information):

    • OS: Ubuntu 16.04
    • GCC version: 5.4.0
    • CUDA and NCCL version: CUDA 10.0 and NCCL 2.4.7
    • Framework (TF, PyTorch, MXNet): MXNet
    bug 
    opened by access2rohit 19
  • Run the distributed training on Kubernetes

    Run the distributed training on Kubernetes

    After the successful single run on Kubernetes with the workaround, I tried to run the distributed train with 2 workers on Kubernetes. However there is only one worker running, and the another one hangs always. I assigned just 1 device (with 0 as device tag), but the running worker said 2 GPUS benchmarking. The running worker has 2 GPUs, and hanging worker has 1 GPU only.

    1. How did you benchmark? bare-mental or Kubernetes?
    2. Does it work if the worker just has 1 GPU? and is there any requirement on the GPU model?
    3. Is there any Kubernetes operator to setup bytePS?
    distributed 
    opened by compete369 19
  • Using ByteScheduler is not as fast as Ring-allreduce

    Using ByteScheduler is not as fast as Ring-allreduce

    Hi, I applied ByteScheduler to Horovod+Pytorch, and ran pytorch_horovod_benchmark.py to compare the speed with Horovod+Pytorch without ByteScheduler, and found:

    Restnet50 models: ByteScheduler+Horovod+Pytorch: 180-200 img/sec per GPU Horovod+Pytorch: 254 img/sec per GPU

    The speed is much worse. Remember that ByteScheduler mentioned in the paper can increase by 11%-15%. Could you tell me where the parameters are adjusted incorrectly? I try to adjust BYTESCHEDULER_CREDIT and BYTESCHEDULER_PARTITION It is found that the speed improvement is not great. Is there a recommended configuration size for these two parameters?

    bytescheduler 
    opened by Richie-yan 18
  • some question about to start server.  Check failed: mr ibv_reg_mr failed: Cannot allocate memory

    some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory

    I want to 1 worker and 1 server, but when I use the following command to start server, I have some error, can anyone meet the same error?

    export BYTEPS_LOG_LEVEL=INFO export BYTEPS_ENABLE_IPC=1 export DMLC_ENABLE_RDMA=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=1

    export DMLC_INTERFACE=ens6f1 export DMLC_PS_ROOT_URI=172.168.30.25 export DMLC_PS_ROOT_PORT=9000

    bpslaunch

    the error is as below: terminate called after throwing an instance of 'dmlc::Error' what(): [16:23:05] src/./rdma_transport.h:130: Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=941, kMempoolChunkSize=56

    i don't need to docker. maybe i must need to use docker pull bytepsimage/tensorflow to correctly start? can i start byteps without docker?

    opened by DeruiLiu 17
  • Distributed training with RDMA errors

    Distributed training with RDMA errors

    Excuse me! When I use step-by-step-tutorial to training multi-machines training of RDMA, the error as follows: image

    However,when I run Horovod RDMA training with same machines, it can works normally! What factors can cause this error to occur?

    opened by wuyujiji 16
  • one weird trick to reduce PULL time significantly

    one weird trick to reduce PULL time significantly

    I happened to read an article about openmp yesterday. https://zhuanlan.zhihu.com/p/118604153

    I followed the suggestions in the article and set OMP_WAIT_POLICY=PASSIVE for both workers and servers. and the throughput surged around 25%. the profile shows this is because a huge reduction in time taken by PULL operation. so i suppose the flag is beneficial to servers. i guess it eases the contention of cpus between server's threads and openmp's threads. the setting is 4 x p3.16xlarge as workers and 4 x c5dn.xlarge as servers. i wonder whether it works in other cases.

    good first issue 
    opened by jasperzhong 16
  • 安装问题

    安装问题

    在纯CPU的机器上(拟作为server端、scheduler端)上安装byteps: (1)还需要安装cuda,cudnn,nccl吗? (2)对于pytorch的版本有要求吗? 我目前安装了cuda11.3,以及pytorch10.0-CPU版本,然后执行python setup.py install, 除了报了许多warning外,报错: fatal error: THC/THC.h: 没有那个文件或目录

    opened by QingQingR 0
  • broadcast and is_initialized api are not supported with pytorch.

    broadcast and is_initialized api are not supported with pytorch.

    AttributeError: module 'byteps.torch' has no attribute 'broadcast' AttributeError: module 'byteps.torch' has no attribute 'is_initialized'

    which is not compatible with horovord in pytorch.

    opened by HangJie720 0
  • support for fault tolerance and straggler mitigation

    support for fault tolerance and straggler mitigation

    Hi i have noticed that there is a plan for Fault-tolerance and straggler mitigation support in the future plan section. So how is the progress going right now?

    Also, there is related paper from your team said that they have made the implementation based on BytePS. "Elastic Parameter Server Load Distribution in Deep Learning Clusters"

    opened by youshaox 0
  • update shm naming scheme

    update shm naming scheme

    use the hex representation of the tensor key in shm names. It's easier to tell the operation type, tensor id and partition number etc from the hex representation.

    Signed-off-by: yulu.jia [email protected]

    opened by pleasantrabbit 0
  • 安装报错

    安装报错

    安装报错,是什么依赖安装的不对吗,辛苦看下 python 3.7.5 tensorflow 2.5.0

    ` x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DEIGEN_MPL2_ONLY=1 -I3rdparty/ps-lite/include -I/usr/local/nccl/include -I/usr/include/python3.7m -c byteps/common/compressor/impl/dithering.cc -o build/temp.linux-x86_64-3.7/byteps/common/compressor/impl/dithering.o -std=c++11 -fPIC -Ofast -Wall -fopenmp -march=native -mno-avx512f -D_GLIBCXX_USE_CXX11_ABI=0 -DBYTEPS_BUILDING_SERVER In file included from byteps/common/compressor/impl/../compressor.h:23:0, from byteps/common/compressor/impl/../compressor_registry.h:19, from byteps/common/compressor/impl/dithering.cc:19: byteps/common/compressor/impl/dithering.cc: In member function ‘virtual byteps::common::compressor::tensor_t byteps::common::compressor::DitheringCompressor::Compress(byteps::common::compressor::tensor_t)’: byteps/common/compressor/impl/../common.h:48:42: error: ISO C++ forbids declaration of ‘type name’ with no type [-fpermissive] reinterpret_cast<const half_t*>(src),
    ^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:42: error: expected ‘>’ before ‘half_t’ reinterpret_cast<const half_t*>(src),
    ^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:42: error: expected ‘(’ before ‘half_t’ reinterpret_cast<const half_t*>(src),
    ^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:42: error: ‘half_t’ was not declared in this scope reinterpret_cast<const half_t*>(src),
    ^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:42: note: suggested alternative: ‘off_t’ reinterpret_cast<const half_t*>(src),
    ^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:49: error: expected primary-expression before ‘>’ token reinterpret_cast<const half_t*>(src),
    ^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/dithering.cc: In member function ‘virtual byteps::common::compressor::tensor_t byteps::common::compressor::DitheringCompressor::Decompress(byteps::common::compressor::tensor_t)’: byteps/common/compressor/impl/../common.h:64:36: error: ‘half_t’ does not name a type; did you mean ‘off_t’? return func(reinterpret_cast<half_t*>(dst),
    ^ byteps/common/compressor/impl/dithering.cc:170:3: note: in expansion of macro ‘DECOMPRESS_IMPL_SWITCH’ DECOMPRESS_IMPL_SWITCH(_dtype, DecompressImpl, dst, compressed.data, ^~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:64:42: error: expected ‘>’ before ‘’ token return func(reinterpret_cast<half_t>(dst),
    ^ byteps/common/compressor/impl/dithering.cc:170:3: note: in expansion of macro ‘DECOMPRESS_IMPL_SWITCH’ DECOMPRESS_IMPL_SWITCH(_dtype, DecompressImpl, dst, compressed.data, ^~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:64:42: error: expected ‘(’ before ‘’ token return func(reinterpret_cast<half_t>(dst),
    ^ byteps/common/compressor/impl/dithering.cc:170:3: note: in expansion of macro ‘DECOMPRESS_IMPL_SWITCH’ DECOMPRESS_IMPL_SWITCH(_dtype, DecompressImpl, dst, compressed.data, ^~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:64:43: error: expected primary-expression before ‘>’ token return func(reinterpret_cast<half_t*>(dst),
    ^ byteps/common/compressor/impl/dithering.cc:170:3: note: in expansion of macro ‘DECOMPRESS_IMPL_SWITCH’ DECOMPRESS_IMPL_SWITCH(_dtype, DecompressImpl, dst, compressed.data, ^~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/dithering.cc: In member function ‘virtual void byteps::common::compressor::DitheringCompressor::FastUpdateError(byteps::common::compressor::tensor_t, byteps::common::compressor::tensor_t, byteps::common::compressor::tensor_t)’: byteps/common/compressor/impl/../common.h:80:36: error: ‘half_t’ does not name a type; did you mean ‘off_t’? return func(reinterpret_cast<half_t*>(dst),
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:80:42: error: expected ‘>’ before ‘’ token return func(reinterpret_cast<half_t>(dst),
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:80:42: error: expected ‘(’ before ‘’ token return func(reinterpret_cast<half_t>(dst),
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:80:43: error: expected primary-expression before ‘>’ token return func(reinterpret_cast<half_t*>(dst),
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:81:36: error: ‘half_t’ does not name a type; did you mean ‘off_t’? reinterpret_cast<half_t*>(src1),
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:81:42: error: expected ‘>’ before ‘’ token reinterpret_cast<half_t>(src1),
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:81:42: error: expected ‘(’ before ‘’ token reinterpret_cast<half_t>(src1),
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:81:43: error: expected primary-expression before ‘>’ token reinterpret_cast<half_t*>(src1),
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:82:76: error: expected ‘)’ before ‘;’ token reinterpret_cast<const uint16_t*>(src2), compressed_size);
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:82:76: error: return-statement with a value, in function returning 'void' [-fpermissive] reinterpret_cast<const uint16_t*>(src2), compressed_size);
    ^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: An ERROR occured while building the server module.

    Traceback (most recent call last):
      File "/usr/lib/python3.7/distutils/unixccompiler.py", line 118, in _compile
        extra_postargs)
      File "/usr/lib/python3.7/distutils/ccompiler.py", line 910, in spawn
        spawn(cmd, dry_run=self.dry_run)
      File "/usr/lib/python3.7/distutils/spawn.py", line 36, in spawn
        _spawn_posix(cmd, search_path, dry_run=dry_run)
      File "/usr/lib/python3.7/distutils/spawn.py", line 159, in _spawn_posix
        % (cmd, exit_status))
    distutils.errors.DistutilsExecError: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/tmp/pip-install-gg9uk_gv/byteps_d2daed5e88db4e3dbcc5a12776b83bdf/setup.py", line 944, in build_extensions
        build_server(self, options)
      File "/tmp/pip-install-gg9uk_gv/byteps_d2daed5e88db4e3dbcc5a12776b83bdf/setup.py", line 337, in build_server
        build_ext.build_extension(server_lib)
      File "/usr/local/lib/python3.7/dist-packages/setuptools/command/build_ext.py", line 196, in build_extension
        _build_ext.build_extension(self, ext)
      File "/usr/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
        depends=ext.depends)
      File "/usr/lib/python3.7/distutils/ccompiler.py", line 574, in compile
        self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
      File "/usr/lib/python3.7/distutils/unixccompiler.py", line 120, in _compile
        raise CompileError(msg)
    distutils.errors.CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
    ----------------------------------------
    

    ERROR: Command errored out with exit status 1: /usr/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-gg9uk_gv/byteps_d2daed5e88db4e3dbcc5a12776b83bdf/setup.py'"'"'; file='"'"'/tmp/pip-install-gg9uk_gv/byteps_d2daed5e88db4e3dbcc5a12776b83bdf/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-6671etx_/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7/byteps Check the logs for full command output. `

    opened by llplay 1
Releases(v0.2)
  • v0.2(Feb 19, 2020)

    0.2.0 (2020-02)

    • Re-implementing the RDMA transport and largely improve RDMA performance
    • Add IPC support intra-server communication.
    • Fix a hanging bug in BytePS server.
    • Fix RDMA-related segmentation fault problem during fork() (e.g., used by PyTorch data loader).
    • New feature: Enable mixing use of colocate and non-colocate servers, along with a smart tensor allocation strategy.
    • New feature: Add bpslaunch as the command to launch tasks.
    • Add support for pip install: pip3 install byteps
    • Updated documents and example docker images
    Source code(tar.gz)
    Source code(zip)
Owner
Bytedance Inc.
Bytedance Inc.
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 92 Dec 14, 2022
PyTorch extensions for high performance and large scale training.

Description FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library ext

Facebook Research 2k Dec 28, 2022
nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Microsoft 241 Dec 26, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

null 164 Jan 4, 2023
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

Horovod 12.9k Jan 7, 2023
A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

null 3 Nov 24, 2021
Uber Open Source 1.6k Dec 31, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

Microsoft 8.4k Dec 30, 2022
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

null 27 Aug 19, 2022
WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can be initiated without requiring that all the processes enter it. It partially reduces the data within non-overlapping groups of process, improving the parallel scalability.

Shigang Li 6 Jun 18, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 7, 2022
High performance Python GLMs with all the features!

High performance Python GLMs with all the features!

QuantCo 200 Dec 14, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

null 4.1k Jan 9, 2023
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 3, 2023
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

null 2.5k Dec 28, 2022
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022