A high performance and generic framework for distributed DNN training

Bytedance Inc.

Last update: Dec 28, 2022

Related tags

Machine Learning machine-learning deep-learning mxnet tensorflow keras pytorch distributed-training

Overview

BytePS

BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on either TCP or RDMA network.

BytePS outperforms existing open-sourced distributed training frameworks by a large margin. For example, on BERT-large training, BytePS can achieve ~90% scaling efficiency with 256 GPUs (see below), which is much higher than Horovod+NCCL. In certain scenarios, BytePS can double the training speed compared with Horovod+NCCL.

News

BytePS paper has been accepted to OSDI'20. The code to reproduce the end-to-end evaluation is available here.
Support gradient compression.
v0.2.4
- Fix compatibility issue with tf2 + standalone keras
- Add support for tensorflow.keras
- Improve robustness of broadcast
v0.2.3
- Add DistributedDataParallel module for PyTorch
- Fix the problem of different CPU tensor using the same name
- Add skip_synchronize api for PyTorch
- Add the option for lazy/non-lazy init
v0.2.0
- Largely improve RDMA performance by enforcing page aligned memory.
- Add IPC support for RDMA. Now support colocating servers and workers without sacrificing much performance.
- Fix a hanging bug in BytePS server.
- Fix RDMA-related segmentation fault problem during fork() (e.g., used by PyTorch data loader).
- New feature: Enable mixing use of colocate and non-colocate servers, along with a smart tensor allocation strategy.
- New feature: Add bpslaunch as the command to launch tasks.
- Add support for pip install: pip3 install byteps

Performance

We show our experiment on BERT-large training, which is based on GluonNLP toolkit. The model uses mixed precision.

We use Tesla V100 32GB GPUs and set batch size equal to 64 per GPU. Each machine has 8 V100 GPUs (32GB memory) with NVLink-enabled. Machines are inter-connected with 100 Gbps RDMA network. This is the same hardware setup you can get on AWS.

BytePS achieves ~90% scaling efficiency for BERT-large with 256 GPUs. The code is available here. As a comparison, Horovod+NCCL has only ~70% scaling efficiency even after expert parameter tunning.

With slower network, BytePS offers even more performance advantages -- up to 2x of Horovod+NCCL. You can find more evaluation results at performance.md.

Goodbye MPI, Hello Cloud

How can BytePS outperform Horovod by so much? One of the main reasons is that BytePS is designed for cloud and shared clusters, and throws away MPI.

MPI was born in the HPC world and is good for a cluster built with homogeneous hardware and for running a single job. However, cloud (or in-house shared clusters) is different.

This leads us to rethink the best communication strategy, as explained in here. In short, BytePS only uses NCCL inside a machine, while re-implements the inter-machine communication.

BytePS also incorporates many acceleration techniques such as hierarchical strategy, pipelining, tensor partitioning, NUMA-aware local communication, priority-based scheduling, etc.

Quick Start

We provide a step-by-step tutorial for you to run benchmark training tasks. The simplest way to start is to use our docker images. Refer to Documentations for how to launch distributed jobs and more detailed configurations. After you can start BytePS, read best practice to get the best performance.

Below, we explain how to install BytePS by yourself. There are two options.

Install by pip

pip3 install byteps

Build from source code

You can try out the latest features by directly installing from master branch:

git clone --recursive https://github.com/bytedance/byteps
cd byteps
python3 setup.py install

Notes for above two options:

BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet.
BytePS depends on CUDA and NCCL. You should specify the NCCL path with export BYTEPS_NCCL_HOME=/path/to/nccl. By default it points to /usr/local/nccl.
The installation requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try yum install devtoolset-7 before everything else. In general, we recommend using gcc 4.9 for best compatibility (how to pin gcc).
RDMA support: During setup, the script will automatically detect the RDMA header file. If you want to use RDMA, make sure your RDMA environment has been properly installed and tested before install (install on Ubuntu-18.04).

Examples

Basic examples are provided under the example folder.

To reproduce the end-to-end evaluation in our OSDI'20 paper, find the code at this repo.

Use BytePS in Your Code

Though being totally different at its core, BytePS is highly compatible with Horovod interfaces (Thank you, Horovod community!). We chose Horovod interfaces in order to minimize your efforts for testing BytePS.

If your tasks only rely on Horovod's allreduce and broadcast, you should be able to switch to BytePS in 1 minute. Simply replace import horovod.tensorflow as hvd by import byteps.tensorflow as bps, and then replace all hvd in your code by bps. If your code invokes hvd.allreduce directly, you should also replace it by bps.push_pull.

Many of our examples were copied from Horovod and modified in this way. For instance, compare the MNIST example for BytePS and Horovod.

BytePS also supports other native APIs, e.g., PyTorch Distributed Data Parallel and TensorFlow Mirrored Strategy. See DistributedDataParallel.md and MirroredStrategy.md for usage.

Limitations and Future Plans

BytePS does not support pure CPU training for now. One reason is that the cheap PS assumption of BytePS do not hold for CPU training. Consequently, you need CUDA and NCCL to build and run BytePS.

We would like to have below features, and there is no fundamental difficulty to implement them in BytePS architecture. However, they are not implemented yet:

Sparse model training
Fault-tolerance
Straggler-mitigation

Publications

[OSDI'20] "A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters". Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo.
[SOSP'19] "A Generic Communication Scheduler for Distributed DNN Training Acceleration". Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, Chuanxiong Guo. (Code is at bytescheduler branch)

Comments

gradient compression support
Motivation

Currently BytePS does not fully support gradient compression. The compression it supports lies in each plugin in Python. Such design may ease the difficulty of the implementation but leads to major inabilities for more aggressive compression. This is because NCCL only supports limited reduction operations such as Sum, Prod etc but these operations are meaningless for the compressed data which have been highly bit-wisely packed. For example, for signSGD, one of the most popular methods for gradient compression due to its simplicity and effectiveness, each bit represents a signbit of an element in the original data tensor, making reduction operations like summation totally meaningless. But reduction is necessary for multi-GPU devices.

Another problem is that compared to inter-node communication, intra-node communication is not the bottleneck. Furthermore, too much compression at first will lose much information, which may cause low accuracy. So there is no need to make too radical compression before running into BytePS core in worker nodes.

Therefore, changes need to be made.

Design Overview

In light of the problems mentioned above, we propose two-level gradient compression:

intra-node: This is just an alias for the current implementation, named after its communication property. Transform FP32 tensors into FP16 on each GPU, reduce them across multi-GPUs via NCCL, and copy them to the CPU buffer waiting for next-level compression. The purpose of the compression is to reduce intra-node communication overhead introduced by multi-GPUs. Since intra-node communication is very fast, especially with NCCL, only mild compression methods will be applied, most of which is type-conversion. It is framework-specific and will be implemented in each plugin.

inter-node: Usually inter-node communication is a bottleneck, so more drastically gradient compression algorithms will be applied here. This is framework-agnostic and will be implemented in BytePS core.

It is worth mentioning that our design supports all frameworks.

Interface

Only a few changes to be made for users. Users only have to add a few LOC in the script to specify which compression algorithm to be used and the parameters needed by the algorithm. Take MXNet for example.

compression_params = { "compressor": opt.compressor, "ef": opt.ef, "momentum": opt.compress_momentum, "scaling": opt.onebit_scaling, "k": opt.k } trainer = bps.DistributedTrainer(params, optimizer, optimizer_params, compression_params=compression_params)

Here we prescribe some keys. Users can lookup documentations to determine which key should be used. Here are some common keys.

| KEYS | DESC | | --- | --- | | compressor | compression algorithms, including onebit / dithering / topk / randomk | | k | an integer, must be specified when using dithering / topk / randomk | | scaling | optional, whether to enable scaling for onebit, default is false | | ef | error-feedback algorithms, e.g. vanilla | | momentum | momentum algorithms, e.g. nesterov | | seed | random seed |

If the user's input is not correct, it will give a warning and abort.

Implementation

Parameter Data Structure

To offer users a unified interface to use, we have to address the registration problem. parameters vary from different kinds of compression algorithms. For example, topk and randomk algorithms need parameter k to be specified while onebit algorithm may need to input whether to enable scaling flag. Some parameters are optional but others are not. So parameter passing is a challenge.

We address this challenge using string-string dictionary (std::unorded_map<std::string, std::string> for C++ or dict for Python) as our unified data structure to pass parameters. As mentioned above, we prescribe specific strings as keys, so the dictionary will look like:

{"byteps_compressor_type": "topk", "byteps_compressor_k": "3", "byteps_error_feedback_type": "vanilla"}

Python

For MXNet users, the dictionary can be an attribute of ParameterDict. We can filter out those parameters by leveraging the prefix "byteps". For example,

for i, param in enumerate(self._params): byteps_declare_tensor("parameter_" + str(i)) if param.grad_req != 'null': byteps_params = dict( filter(lambda attr: attr[0].startswith( "byteps_",), param.__dict__.items()) ) byteps_declare_tensor("gradient_" + str(i), **byteps_params)

C++

Using ctypes, we can pass the dictionary conveniently. For example,

extern "C" void byteps_mxnet_declare_tensor(char* name, int num_params, char** param_keys, char** param_vals) { ... std::unordered_map<std::string, std::string> param_dict; std::string key, val; std::string::size_type pos; for (int i = 0; i < num_params; ++i) { key = param_keys[i]; val = param_vals[i]; param_dict[key] = val; } ... }

Compressor - Development API

We want developers to develop their own gradient compression algorithms without fully understanding how BytePS works. What they only need to know is development API. We currently implement some commonly used gradient compression algorithms, but in the future, we hope more novel algorithms will be implemented under our API. We abstract compression algorithms into compressor. The Compressor looks like this:

class Compressor { public: Compressor(size_t size, DataType dtype) : _size(size), _dtype(dtype), _buf(new byte_t[size]), _cpu_reducer(new CpuReducer(nullptr)){}; virtual ~Compressor() = default; virtual tensor_t Compress(tensor_t grad) = 0; virtual tensor_t Decompress(tensor_t compressed) = 0; virtual void FastUpdateError(tensor_t error, tensor_t corrected, tensor_t compressed) { BPS_LOG(FATAL) << "FastUpdateError is not implemented"; }; std::unique_ptr<byte_t[]> _buf; size_t _size; DataType _dtype; std::unique_ptr<CpuReducer> _cpu_reducer; };

In order to make less modifications to BytePS core, we want compressors to be as general as possible. In the best case, the base compressor pointer/reference can represent all kinds of compressors and only need to expose two operations to users: Compress and Decompress. This is quite challenging because there are some optional features for gradient compression, such as error-feedback and momentum. These are two common methods to correct the bias and accelerate the training process respectively. For example, with error-feedback, before being compressed, gradients are first corrected with errors which refer to the information loss during the last compression, and then errors are re-calculated. Therefore, the workflow is different from only using vanilla gradient compression.

In order to support all these features and expose a unified API at the same time, we use the decorator pattern. We regard error-feedback as an additional behavior of compressors. We want a unified API, which means compressors with error-feedback should expose the same method as those without error-feedback. But in that case we have to create a subclass for each compressor, which is too redundant. So the decorator pattern just solves our problem. We create a decorator class named ErrorFeedback to inherit BaseCompressor while at the same time also keeping a member of BaseCompressor. For example,

class ErrorFeedback : public Compressor { public: ErrorFeedback(size_t size, DataType dtype, std::unique_ptr<Compressor> cptr) : Compressor(size, dtype), _cptr(std::move(cptr)), _error(new byte_t[size]()) {} virtual ~ErrorFeedback() = default; virtual tensor_t Compress(tensor_t grad) final; virtual tensor_t Decompress(tensor_t compressed) final; protected: virtual void UpdateGradient(tensor_t grad) = 0; virtual void UpdateError(tensor_t corrected, tensor_t compressed); protected: std::unique_ptr<byte_t[]> _error; private: std::unique_ptr<Compressor> _cptr; };

And the workflow is implemented in Compress and Decompress. For example,

tensor_t ErrorFeedback::Compress(tensor_t grad) { // 1. grad <- grad + error UpdateGradient(grad); // 2. c <- Compress(grad) auto compressed = _cptr->Compress(grad); // 3. e <- grad - Decompress(c) UpdateError(grad, compressed); return compressed; } tensor_t ErrorFeedback::Decompress(tensor_t compressed) { // directly forward to internal compressor return _cptr->Decompress(compressed); }

Momentum is implemented in the same way. ErrorFeedBack and Momentum are also base classes to inherit. In this way, error-feedback and momentum becomes optional features to be added to any vanilla gradient compression algorithms.

BTW, momentum is not applied to servers.

Exps

CIFAR100

End-to-End Training

We conduct the experiment in distributed training ResNet18_v2 on the CIFAR100 datasets with 4 AWS P3.16xlarge instances, each equipped with 8 V100 GPUs and 25Gbps network. The compression algorithms benchmarked here are also equipped with error-feedback and nesterov momentum. We set k = 1 for topk and k = 8 for randomk. We train it for 200 epochs.

| f888c8d8f9e8483e46acd00042ed262e30c6856e | VAl ACC | TIME(s) | | -- | -- | -- | |baseline| 0.713799| 703.1527987500002| |onebit| 0.705601| 629.4210848750001| |randomk| 0.6991| 501.99770550000005| |topk| 0.704202| 507.90769437499966|

The results show that compression can reduce up to 28.6% end-to-end training time without accuracy loss.

Slow Network

Gradient compression is more beneficial in slower network. Therefore we limit the network bandwidth to 100Mbps (both downlink and uplink) and keep all other settings not changed. The results show that we can achieve up to 6x reduciton in training time.

| b382f996d159fbe4d48c1135290f5c4183fc6b46 | TIME(s) | | -- | -- | |baseline| 518.321322125| |onebit| 195.236724875| |randomk| 89.672168625| |topk| 83.9287285|

IMAGENET

To save time, we only tested 1bit algorithm. Topk and randomk are not guaranteed to converge on IMAGENET.

Workload Breakdown

In this experiment, we measure the workload breakdown into computation and communication. We use 8 Amazon EC2 p3.2xlarge instances, each of which is shipped with one Nvidia V100 GPU and 10Gbps Ethernet. We train two CNN models: Resnet-50_v2 and VGG-16. We first measure the computation time by collecting the elapsed time of running 50 iterations (t0) on one node. Then we measure the total training time for running 50 iterations (t1) on 8 nodes. Then, we get an estimate of communication time using t1 − t0.

As the figure shows, dist-EF-SGDM can reduce communication to varying degrees. For ResNet50_v2, the drop is trivial (17.6% decrease), mainly due to the smaller model size. In contrast, a remarkable decline (73.2% decrease) occurs using dist-EF-SGDM for VGG-16, since VGG-16 has larger model size (528M).

[ResNet50_v2]

[VGG-16]

Scaling Efficiency

We also measure scaling efficiency when the number of nodes varies from 1 to 8. We follow the same setup as in the above experiment. The figure shows that gradient compression improves the scaling efficiency. The efficiency gain in gradient compression is much higher for VGG-16 than ResNet-50_v2, since ResNet50_v2 has smaller communication overhead.

[ResNet50_v2]

[VGG-16]

The above two sub-experiments were conducted 2 months ago. There have been large updates since then. So the results are a little outdated. They are just for reference.

End-to-End Training

Finally, we train ResNet50_v2 and VGG-16 end-to-end to measure total reduction in training time. For such large batch training, warmup and linear scaling learning rate are used to avoid generalization gap. We set the number of warmup epochs to 5. We also leverage cosine annealing strategy for learning rate decay. For ResNet50_v2 we use 8 AWS EC2 P3.16xlarge instances while for VGG-16, we use 4 AWS EC2 P3.16xlarge.

[ResNet50_v2]

As the figure shows, we reduce the trianing time by 8.0% without accuracy loss for ResNet50_v2.

| 6c44049fd49e532781af96add6a02a0427e6a1a8 | VAl ACC | TIME(h) | | -- | -- | -- | |sgdm| 0.76914465625| 2.6505945833029516| |dist-ef-sgdm| 0.7632242968749999|2.4378090010373263 |

[VGG-16]

The above figure shows that our implementation of dist-EF-SGDM reduces the training time for 100 epochs by 39.04% compared to the full-precision SGDM. We note that there is a small gap in accuracy between dist-EF-SGDM and SGDM. We will investigate this problem in the future.

TODO

[x] support inter-node compression

[x] support intra-node for MXNet

[x] support onebit compressor

[x] support error-feedback

[x] support momentum

[x] support other compressors

[x] support FP16

[ ] support PyTorch and Tensorflow

Precautions

To run successfully, ps-lite should change one LOC. see the PR here. https://github.com/dmlc/ps-lite/pull/168

We only support Gluon for MXNet now. Raw MXNet's API does not support it.

Since gradient compression also has some overhead, this is a trade-off. It is only suitable for some cases, e.g. slow network or large models. In other cases, gradient compression will even harm performance.

Momentum here is the same as the framework's momentum. Why do we have to implement momentum again? This is because for some algorithms like dist-EF-SGDM , momentum should be added first but many frameworks like MXNet exchange gradient first and then add the momentum. So we have to implement momentum inside BytePS. When inside momentum is used, outside momentum should be disabled (set \mu = 0) in the users' scripts.

FP16 is not supported now.

Acknowledgement

Thanks @eric-haibin-lin @szhengac for guidance! They have been giving many valuable suggestions!
opened by jasperzhong 37
Can not run Distributed Training with RDMA
Describe the bug I use byteps according to Distributed Training with RDMA of byteps/docs/step-by-step-tutorial. I use the latest images : bytepsimage/tensorflow and get error. But using bytepsimage/tensorflow_rdma and bytepsimage/server_rdma can success. Using one scheduler, one server, two workers with one respective gpu in a physical worker.

To Reproduce Steps to reproduce the behavior:

For the scheduler: docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash export DMLC_ENABLE_RDMA=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=scheduler export DMLC_NUM_SERVER=1 export DMLC_INTERFACE=eth0 export DMLC_PS_ROOT_URI=xxx.xx.xx.xx export DMLC_PS_ROOT_PORT=9008 bpslaunch

For the server: docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash export DMLC_ENABLE_RDMA=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=1 export DMLC_INTERFACE=eth0 export DMLC_PS_ROOT_URI=xxx.xx.xx.xx export DMLC_PS_ROOT_PORT=9008 bpslaunch

For worker-0: nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash export NVIDIA_VISIBLE_DEVICES=0 export DMLC_ENABLE_RDMA=1 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_INTERFACE=eth0 export DMLC_PS_ROOT_URI=xxx.xx.xx.xx export DMLC_PS_ROOT_PORT=9008 bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000

For worker-1: nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash export NVIDIA_VISIBLE_DEVICES=7 export DMLC_ENABLE_RDMA=1 export DMLC_WORKER_ID=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_INTERFACE=eth0 export DMLC_PS_ROOT_URI=xxx.xx.xx.xx export DMLC_PS_ROOT_PORT=9008 bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000

scheduler see error BytePS launching scheduler [02:34:49] byteps/server/server.cc:339: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [02:34:49] src/postoffice.cc:20: enable RDMA for networking [02:34:49] src/./rdma_van.h:40: Shared memory IPC has been disabled [02:34:49] src/./rdma_van.h:801: OnConnect to Node 1 with Transport=RDMA [02:34:49] src/./rdma_van.h:207: Connect to Node 1 with Transport=RDMA [02:34:58] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [02:35:23] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [02:35:38] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [02:35:38] src/./rdma_van.h:207: Connect to Node 9 with Transport=RDMA [02:35:38] src/./rdma_van.h:207: Connect to Node 8 with Transport=RDMA [02:35:38] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:35:38] src/./rdma_transport.h:130: Check failed: mr Stack trace returned 7 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1b98c) [0x7f788672398c] [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1bdad) [0x7f7886723dad] [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x40fb8) [0x7f7886748fb8] [bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x57dbb) [0x7f788675fdbb] [bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f7885e0866f] [bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f78891356db] [bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f788946e88f]

Expected behavior run sucess

Screenshots scheduler server worker0 worker1

Environment physical machine(please complete the following information):

OS:16.04.2-Ubuntu

GCC version: 5.4.0

CUDA and NCCL version:10.1

Framework (TF, PyTorch, MXNet):TF
opened by mengkai94 27

CUDA runtime error when running with pytorch benchmark_byteps.py

Describe the bug Got cuda runtime error when running with pytorch benchmark_byteps.py.

Error info:

BytePS launching worker
running benchmark...
Model: resnet50
Batch size: 32
Number of GPUs: 1
Running warmup...
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
Traceback (most recent call last):
  File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 109, in <module>
    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
    return Timer(stmt, setup, timer).timeit(number)
  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
    timing = self.inner(it, self.timer)
  File "/usr/lib/python2.7/timeit.py", line 100, in inner
    _func()
  File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 90, in benchmark_step
    output = model(data)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
    x = self.conv1(x)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

To Reproduce Steps to reproduce the behavior: Following the step by step tutorial, and I use the bytepsimage/worker_pytorch image from official.

Environment (please complete the following information): same as byteps official pytorch worker image.

Additional context Add any other context about the problem here.

opened by un-knight 22

Unable to load tensorflow plugin for bytescheduler
I can build TF 1.13.2 from source with the patch. However when I try to run the tf_cnn_benchmarks.py I get a error when tf.load_library('libplugin.so') is called.

[libprotobuf ERROR external/protobuf_archive/src/google/protobuf/descriptor_database.cc:58] File already exists in database: tensorflow/core/kernels/boosted_trees/boosted_trees.proto [libprotobuf FATAL external/protobuf_archive/src/google/protobuf/descriptor.cc:1358] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): Aborted (core dumped)

Environment:

OS: Ubuntu 16.04.6 LTS (Xenial Xerus)

GCC version: gcc (Ubuntu 4.9.3-13ubuntu2) 4.9.3

CUDA and NCCL version: Cuda V10.0.130

TF 1.13.2 (installed from source with)

Any help would be appreciated. Thanks!
bug bytescheduler
opened by bhetherman 21
AttributeError: module 'byteps.torch' has no attribute 'push_pull'

I used hps.allreduce and an error was raised AttributeError: module 'byteps.torch' has no attribute 'allreduce' However, I replace hvd.allreduce with bps.push_pull, there was alos an error AttributeError: module 'byteps.torch' has no attribute 'push_pull'
bug enhancement single machine

opened by boscotsang 21

BytePS w/ MXNet doesn't work w/o docker container

Describe the bug If i try to run on bare machine i cannot run the MXNet example: https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#mxnet

But if I use the container provided then I am able to run the example.

To Reproduce Steps to reproduce the behavior:

Ubuntu 16.04 DLAMI EC2 instance p2.8xlarge (8 k80gpus)
pip install mxnet-cu100mkl
pip install byteps==0.2.0
git clone --recursive https://github.com/bytedance/byteps.git ~/byteps
run following commands on shell:


export NVIDIA_VISIBLE_DEVICES=0,1,2,3  # gpus list
export DMLC_WORKER_ID=0 # your worker id
export DMLC_NUM_WORKER=1 # one worker
export DMLC_ROLE=worker 

export DMLC_NUM_SERVER=1 
export DMLC_PS_ROOT_URI=10.0.0.1 
export DMLC_PS_ROOT_PORT=1234 

bpslaunch python3 ~/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32

See error as shown in logs

Expected behavior To run mxnet example

Logs

(mx_byteps) ubuntu@ip-172-31-85-4:~$ bpslaunch python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
BytePS launching worker
INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:Launch BytePS process on GPU-2
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
INFO:root:Launch BytePS process on GPU-0
INFO:root:Launch BytePS process on GPU-1
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '2', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============2
INFO:root:Launch BytePS process on GPU-3
environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '1', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============1
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '0', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============0
environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '3', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============3

Segmentation fault: 11


Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fe075c25100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fe1021c34b0]
  [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7fe102561d44]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7fe07531a737]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7fe07531d863]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7fe075313551]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7fe075279a67]
  [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7fdff9762970]
  [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fe1012dfec0]
Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f3556934100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f35e2ed24b0]
  [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f35e3270d44]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7f3556029737]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7f355602c863]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f3556022551]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7f3555f88a67]
  [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7f34e1762970]
  [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f35e1feeec0]
[2020-03-16 21:41:59*** Error in `.956268: F byteps/common/core_loops.cc:299] Check failed: r == ncclSuccess NCCL error: unhandled cuda error

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fefa3442100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7ff02f9e04b0]
  [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7ff02fd7ed44]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7fefa2b37737]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7fefa2b3a863]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7fefa2b30551]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7fefa2a96a67]
  [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7fef2d762970]
  [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7ff02eafcec0]

Segmentation fault: 11


Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0]
  [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f15832b5d44]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7f14f606e737]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7f14f6071863]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7f14f5fcda67]
  [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7f1481762970]
  [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1582033ec0]

Segmentation fault: 11


Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0]
  [bt] (2) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389f261) [0x7f14f6070261]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0611) [0x7f14f6071611]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38974a4) [0x7f14f60684a4]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f14f629056a]
  [bt] (7) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6d860a) [0x7f14f2ea960a]
  [bt] (8) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ab7101) [0x7f14f6288101]
Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0]
  [bt] (2) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389f261) [0x7f14f6070261]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0611) [0x7f14f6071611]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38974a4) [0x7f14f60684a4]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f14f629056a]
  [bt] (7) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6d860a) [0x7f14f2ea960a]
  [bt] (8) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ab7101) [0x7f14f6288101]
Aborted (core dumped)
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 134.

Segmentation fault (core dumped)
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.

Segmentation fault (core dumped)
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.

Segmentation fault (core dumped)
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.

Environment (please complete the following information):

OS: Ubuntu 16.04
GCC version: 5.4.0
CUDA and NCCL version: CUDA 10.0 and NCCL 2.4.7
Framework (TF, PyTorch, MXNet): MXNet

bug

opened by access2rohit 19

Run the distributed training on Kubernetes
After the successful single run on Kubernetes with the workaround, I tried to run the distributed train with 2 workers on Kubernetes. However there is only one worker running, and the another one hangs always. I assigned just 1 device (with 0 as device tag), but the running worker said 2 GPUS benchmarking. The running worker has 2 GPUs, and hanging worker has 1 GPU only.

How did you benchmark? bare-mental or Kubernetes?

Does it work if the worker just has 1 GPU? and is there any requirement on the GPU model?

Is there any Kubernetes operator to setup bytePS?

distributed
opened by compete369 19
Using ByteScheduler is not as fast as Ring-allreduce

Hi, I applied ByteScheduler to Horovod+Pytorch, and ran pytorch_horovod_benchmark.py to compare the speed with Horovod+Pytorch without ByteScheduler, and found:

Restnet50 models: ByteScheduler+Horovod+Pytorch: 180-200 img/sec per GPU Horovod+Pytorch: 254 img/sec per GPU

The speed is much worse. Remember that ByteScheduler mentioned in the paper can increase by 11%-15%. Could you tell me where the parameters are adjusted incorrectly? I try to adjust BYTESCHEDULER_CREDIT and BYTESCHEDULER_PARTITION It is found that the speed improvement is not great. Is there a recommended configuration size for these two parameters?
bytescheduler

opened by Richie-yan 18
some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory

I want to 1 worker and 1 server, but when I use the following command to start server, I have some error, can anyone meet the same error?

export BYTEPS_LOG_LEVEL=INFO export BYTEPS_ENABLE_IPC=1 export DMLC_ENABLE_RDMA=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=1

export DMLC_INTERFACE=ens6f1 export DMLC_PS_ROOT_URI=172.168.30.25 export DMLC_PS_ROOT_PORT=9000

bpslaunch

the error is as below: terminate called after throwing an instance of 'dmlc::Error' what(): [16:23:05] src/./rdma_transport.h:130: Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=941, kMempoolChunkSize=56

i don't need to docker. maybe i must need to use docker pull bytepsimage/tensorflow to correctly start? can i start byteps without docker?

opened by DeruiLiu 17
Distributed training with RDMA errors

Excuse me! When I use step-by-step-tutorial to training multi-machines training of RDMA, the error as follows:

However，when I run Horovod RDMA training with same machines, it can works normally! What factors can cause this error to occur?

opened by wuyujiji 16
one weird trick to reduce PULL time significantly

I happened to read an article about openmp yesterday. https://zhuanlan.zhihu.com/p/118604153

I followed the suggestions in the article and set OMP_WAIT_POLICY=PASSIVE for both workers and servers. and the throughput surged around 25%. the profile shows this is because a huge reduction in time taken by PULL operation. so i suppose the flag is beneficial to servers. i guess it eases the contention of cpus between server's threads and openmp's threads. the setting is 4 x p3.16xlarge as workers and 4 x c5dn.xlarge as servers. i wonder whether it works in other cases.
good first issue

opened by jasperzhong 16
安装问题

在纯CPU的机器上（拟作为server端、scheduler端）上安装byteps：（1）还需要安装cuda，cudnn，nccl吗？（2）对于pytorch的版本有要求吗？我目前安装了cuda11.3，以及pytorch10.0-CPU版本，然后执行python setup.py install，除了报了许多warning外，报错: fatal error: THC/THC.h: 没有那个文件或目录

opened by QingQingR 0
broadcast and is_initialized api are not supported with pytorch.

AttributeError: module 'byteps.torch' has no attribute 'broadcast' AttributeError: module 'byteps.torch' has no attribute 'is_initialized'

which is not compatible with horovord in pytorch.

opened by HangJie720 0
support for fault tolerance and straggler mitigation

Hi i have noticed that there is a plan for Fault-tolerance and straggler mitigation support in the future plan section. So how is the progress going right now?

Also, there is related paper from your team said that they have made the implementation based on BytePS. "Elastic Parameter Server Load Distribution in Deep Learning Clusters"

opened by youshaox 0
update shm naming scheme

use the hex representation of the tensor key in shm names. It's easier to tell the operation type, tensor id and partition number etc from the hex representation.

Signed-off-by: yulu.jia [email protected]

opened by pleasantrabbit 0
安装报错
安装报错，是什么依赖安装的不对吗，辛苦看下 python 3.7.5 tensorflow 2.5.0

` x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DEIGEN_MPL2_ONLY=1 -I3rdparty/ps-lite/include -I/usr/local/nccl/include -I/usr/include/python3.7m -c byteps/common/compressor/impl/dithering.cc -o build/temp.linux-x86_64-3.7/byteps/common/compressor/impl/dithering.o -std=c++11 -fPIC -Ofast -Wall -fopenmp -march=native -mno-avx512f -D_GLIBCXX_USE_CXX11_ABI=0 -DBYTEPS_BUILDING_SERVER In file included from byteps/common/compressor/impl/../compressor.h:23:0, from byteps/common/compressor/impl/../compressor_registry.h:19, from byteps/common/compressor/impl/dithering.cc:19: byteps/common/compressor/impl/dithering.cc: In member function ‘virtual byteps::common::compressor::tensor_t byteps::common::compressor::DitheringCompressor::Compress(byteps::common::compressor::tensor_t)’: byteps/common/compressor/impl/../common.h:48:42: error: ISO C++ forbids declaration of ‘type name’ with no type [-fpermissive] reinterpret_cast<const half_t*>(src),
^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:42: error: expected ‘>’ before ‘half_t’ reinterpret_cast<const half_t*>(src),
^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:42: error: expected ‘(’ before ‘half_t’ reinterpret_cast<const half_t*>(src),
^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:42: error: ‘half_t’ was not declared in this scope reinterpret_cast<const half_t*>(src),
^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:42: note: suggested alternative: ‘off_t’ reinterpret_cast<const half_t*>(src),
^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:48:49: error: expected primary-expression before ‘>’ token reinterpret_cast<const half_t*>(src),
^ byteps/common/compressor/impl/dithering.cc:119:3: note: in expansion of macro ‘COMPRESS_IMPL_SWITCH’ COMPRESS_IMPL_SWITCH(grad.dtype, CompressImpl, _buf.get(), grad.data, ^~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/dithering.cc: In member function ‘virtual byteps::common::compressor::tensor_t byteps::common::compressor::DitheringCompressor::Decompress(byteps::common::compressor::tensor_t)’: byteps/common/compressor/impl/../common.h:64:36: error: ‘half_t’ does not name a type; did you mean ‘off_t’? return func(reinterpret_cast<half_t*>(dst),
^ byteps/common/compressor/impl/dithering.cc:170:3: note: in expansion of macro ‘DECOMPRESS_IMPL_SWITCH’ DECOMPRESS_IMPL_SWITCH(_dtype, DecompressImpl, dst, compressed.data, ^~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:64:42: error: expected ‘>’ before ‘’ token return func(reinterpret_cast<half_t>(dst),
^ byteps/common/compressor/impl/dithering.cc:170:3: note: in expansion of macro ‘DECOMPRESS_IMPL_SWITCH’ DECOMPRESS_IMPL_SWITCH(_dtype, DecompressImpl, dst, compressed.data, ^~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:64:42: error: expected ‘(’ before ‘’ token return func(reinterpret_cast<half_t>(dst),
^ byteps/common/compressor/impl/dithering.cc:170:3: note: in expansion of macro ‘DECOMPRESS_IMPL_SWITCH’ DECOMPRESS_IMPL_SWITCH(_dtype, DecompressImpl, dst, compressed.data, ^~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:64:43: error: expected primary-expression before ‘>’ token return func(reinterpret_cast<half_t*>(dst),
^ byteps/common/compressor/impl/dithering.cc:170:3: note: in expansion of macro ‘DECOMPRESS_IMPL_SWITCH’ DECOMPRESS_IMPL_SWITCH(_dtype, DecompressImpl, dst, compressed.data, ^~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/dithering.cc: In member function ‘virtual void byteps::common::compressor::DitheringCompressor::FastUpdateError(byteps::common::compressor::tensor_t, byteps::common::compressor::tensor_t, byteps::common::compressor::tensor_t)’: byteps/common/compressor/impl/../common.h:80:36: error: ‘half_t’ does not name a type; did you mean ‘off_t’? return func(reinterpret_cast<half_t*>(dst),
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:80:42: error: expected ‘>’ before ‘’ token return func(reinterpret_cast<half_t>(dst),
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:80:42: error: expected ‘(’ before ‘’ token return func(reinterpret_cast<half_t>(dst),
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:80:43: error: expected primary-expression before ‘>’ token return func(reinterpret_cast<half_t*>(dst),
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:81:36: error: ‘half_t’ does not name a type; did you mean ‘off_t’? reinterpret_cast<half_t*>(src1),
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:81:42: error: expected ‘>’ before ‘’ token reinterpret_cast<half_t>(src1),
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:81:42: error: expected ‘(’ before ‘’ token reinterpret_cast<half_t>(src1),
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:81:43: error: expected primary-expression before ‘>’ token reinterpret_cast<half_t*>(src1),
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:82:76: error: expected ‘)’ before ‘;’ token reinterpret_cast<const uint16_t*>(src2), compressed_size);
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byteps/common/compressor/impl/../common.h:82:76: error: return-statement with a value, in function returning 'void' [-fpermissive] reinterpret_cast<const uint16_t*>(src2), compressed_size);
^ byteps/common/compressor/impl/dithering.cc:212:3: note: in expansion of macro ‘FAST_UPDATE_ERROR_IMPL_SWITCH’ FAST_UPDATE_ERROR_IMPL_SWITCH(_dtype, FastUpdateErrorImpl, error.data, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: An ERROR occured while building the server module.

Traceback (most recent call last): File "/usr/lib/python3.7/distutils/unixccompiler.py", line 118, in _compile extra_postargs) File "/usr/lib/python3.7/distutils/ccompiler.py", line 910, in spawn spawn(cmd, dry_run=self.dry_run) File "/usr/lib/python3.7/distutils/spawn.py", line 36, in spawn _spawn_posix(cmd, search_path, dry_run=dry_run) File "/usr/lib/python3.7/distutils/spawn.py", line 159, in _spawn_posix % (cmd, exit_status)) distutils.errors.DistutilsExecError: command 'x86_64-linux-gnu-gcc' failed with exit status 1 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/tmp/pip-install-gg9uk_gv/byteps_d2daed5e88db4e3dbcc5a12776b83bdf/setup.py", line 944, in build_extensions build_server(self, options) File "/tmp/pip-install-gg9uk_gv/byteps_d2daed5e88db4e3dbcc5a12776b83bdf/setup.py", line 337, in build_server build_ext.build_extension(server_lib) File "/usr/local/lib/python3.7/dist-packages/setuptools/command/build_ext.py", line 196, in build_extension _build_ext.build_extension(self, ext) File "/usr/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension depends=ext.depends) File "/usr/lib/python3.7/distutils/ccompiler.py", line 574, in compile self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts) File "/usr/lib/python3.7/distutils/unixccompiler.py", line 120, in _compile raise CompileError(msg) distutils.errors.CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1 ----------------------------------------

ERROR: Command errored out with exit status 1: /usr/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-gg9uk_gv/byteps_d2daed5e88db4e3dbcc5a12776b83bdf/setup.py'"'"'; file='"'"'/tmp/pip-install-gg9uk_gv/byteps_d2daed5e88db4e3dbcc5a12776b83bdf/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-6671etx_/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7/byteps Check the logs for full command output. `
opened by llplay 1

Releases(v0.2)

v0.2(Feb 19, 2020)
0.2.0 (2020-02)

Re-implementing the RDMA transport and largely improve RDMA performance

Add IPC support intra-server communication.

Fix a hanging bug in BytePS server.

Fix RDMA-related segmentation fault problem during fork() (e.g., used by PyTorch data loader).

New feature: Enable mixing use of colocate and non-colocate servers, along with a smart tensor allocation strategy.

New feature: Add bpslaunch as the command to launch tasks.

Add support for pip install: pip3 install byteps

Updated documents and example docker images

Source code(tar.gz)
Source code(zip)

A high performance and generic framework for distributed DNN training

Related tags

Overview

BytePS

News

Performance

Goodbye MPI, Hello Cloud

Quick Start

Install by pip

Build from source code

Examples

Use BytePS in Your Code

Limitations and Future Plans

Publications

Comments

Motivation

Design Overview

Interface

Implementation

Parameter Data Structure

Compressor - Development API

Exps

CIFAR100

End-to-End Training

Slow Network

IMAGENET

Workload Breakdown

Scaling Efficiency

End-to-End Training

TODO

Precautions

Acknowledgement

Releases(v0.2)

v0.2(Feb 19, 2020)

0.2.0 (2020-02)

Owner

Bytedance Inc.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

PyTorch extensions for high performance and large scale training.

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

A Python implementation of GRAIL, a generic framework to learn compact time series representations.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High performance Python GLMs with all the features!

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

BigDL: Distributed Deep Learning Framework for Apache Spark

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

Distributed Deep learning with Keras & Spark