Optimized primitives for collective multi-GPU communication

NVIDIA Corporation

Last update: Jan 9, 2023

Related tags

Deep Learning nccl

Overview

NCCL

Optimized primitives for inter-GPU communication.

Introduction

NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

For more information on NCCL usage, please refer to the NCCL documentation.

Build

Note: the official and tested builds of NCCL can be downloaded from: https://developer.nvidia.com/nccl. You can skip the following build steps if you choose to use the official builds.

To build the library :

$ cd nccl
$ make -j src.build

If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :

$ make src.build CUDA_HOME=<path to cuda install>

NCCL will be compiled and installed in build/ unless BUILDDIR is set.

By default, NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining NVCC_GENCODE (defined in makefiles/common.mk) to only include the architecture of the target platform :

$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"

Install

To install NCCL on the system, create a package then install it as root.

Debian/Ubuntu :

$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make pkg.debian.build
$ ls build/pkg/deb/

RedHat/CentOS :

$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make pkg.redhat.build
$ ls build/pkg/rpm/

OS-agnostic tarball :

$ make pkg.txz.build
$ ls build/pkg/txz/

Tests

Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests.

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>

Copyright

Comments

NCCL 2.6.4 makes system hanging.

Linux: Ubuntu 20.04 LTS GPU driver: newest NVidia driver for linux. CUDA 10.1, CUDNN ,7.6.5 NCCL 2.6.4 Hardware : CPU: Intel 9400f, MB: Z370,Ram : 64GB 2-channel, GPU: 2 2080ti on 2 PCIE 3.0 *8, with a NVlink bridge between them

I ran all nccl_tests, it seems NCCL is working. But when each test running(about 30 min for each test), the system freezes, I can't switch to browser or doing anything, I can only move the mouse, but the system doesn't respond to mouse-clicking or keyboard input. when the test finishes running, system goes back to normal ,and LOG prints in console.

log is here：

#  ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   3795 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   3795 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum     7.18    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
          16             4   float     sum     7.00    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
          32             8   float     sum     7.28    0.00    0.00  0e+00     7.19    0.00    0.00  0e+00
          64            16   float     sum     7.20    0.01    0.01  0e+00     7.05    0.01    0.01  0e+00
         128            32   float     sum     7.30    0.02    0.02  0e+00     7.19    0.02    0.02  0e+00
         256            64   float     sum     7.30    0.04    0.04  0e+00     7.20    0.04    0.04  0e+00
         512           128   float     sum     7.47    0.07    0.07  0e+00     7.12    0.07    0.07  0e+00
        1024           256   float     sum     8.14    0.13    0.13  0e+00     7.92    0.13    0.13  0e+00
        2048           512   float     sum     8.56    0.24    0.24  0e+00     8.43    0.24    0.24  0e+00
        4096          1024   float     sum     9.72    0.42    0.42  0e+00     9.49    0.43    0.43  0e+00
        8192          2048   float     sum    11.99    0.68    0.68  0e+00    11.92    0.69    0.69  0e+00
       16384          4096   float     sum    14.36    1.14    1.14  0e+00    14.21    1.15    1.15  0e+00
       32768          8192   float     sum    16.79    1.95    1.95  0e+00    16.64    1.97    1.97  0e+00
       65536         16384   float     sum    21.14    3.10    3.10  0e+00    20.55    3.19    3.19  0e+00
      131072         32768   float     sum    35.56    3.69    3.69  0e+00    35.43    3.70    3.70  0e+00
      262144         65536   float     sum    41.23    6.36    6.36  0e+00    41.21    6.36    6.36  0e+00
      524288        131072   float     sum    50.66   10.35   10.35  0e+00    50.82   10.32   10.32  0e+00
     1048576        262144   float     sum    72.54   14.45   14.45  0e+00    72.45   14.47   14.47  0e+00
     2097152        524288   float     sum    120.7   17.37   17.37  0e+00    118.4   17.71   17.71  0e+00
     4194304       1048576   float     sum    215.2   19.49   19.49  0e+00    214.7   19.53   19.53  0e+00
     8388608       2097152   float     sum    411.3   20.39   20.39  0e+00    399.1   21.02   21.02  0e+00
    16777216       4194304   float     sum    865.3   19.39   19.39  0e+00    779.6   21.52   21.52  0e+00
    33554432       8388608   float     sum   1547.9   21.68   21.68  0e+00   1699.3   19.75   19.75  0e+00
    67108864      16777216   float     sum   3115.1   21.54   21.54  0e+00   3007.4   22.31   22.31  0e+00
   134217728      33554432   float     sum   5994.3   22.39   22.39  0e+00   5991.9   22.40   22.40  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.43886 

/all_gather_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   9119 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   9119 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                             out-of-place                       in-place          
#       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             1   float     7.14    0.00    0.00  0e+00     7.06    0.00    0.00  0e+00
          16             2   float     7.03    0.00    0.00  0e+00     7.00    0.00    0.00  0e+00
          32             4   float     6.96    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
          64             8   float     7.10    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
         128            16   float     7.10    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
         256            32   float     7.18    0.02    0.02  0e+00     7.23    0.02    0.02  0e+00
         512            64   float     7.49    0.03    0.03  0e+00     7.47    0.03    0.03  0e+00
        1024           128   float     7.03    0.07    0.07  0e+00     6.96    0.07    0.07  0e+00
        2048           256   float     6.97    0.15    0.15  0e+00     6.97    0.15    0.15  0e+00
        4096           512   float     7.41    0.28    0.28  0e+00     7.00    0.29    0.29  0e+00
        8192          1024   float     9.59    0.43    0.43  0e+00     8.80    0.47    0.47  0e+00
       16384          2048   float    11.41    0.72    0.72  0e+00    10.78    0.76    0.76  0e+00
       32768          4096   float    13.39    1.22    1.22  0e+00    11.85    1.38    1.38  0e+00
       65536          8192   float    16.57    1.98    1.98  0e+00    13.83    2.37    2.37  0e+00
      131072         16384   float    23.07    2.84    2.84  0e+00    18.39    3.56    3.56  0e+00
      262144         32768   float    31.38    4.18    4.18  0e+00    30.27    4.33    4.33  0e+00
      524288         65536   float    36.00    7.28    7.28  0e+00    35.30    7.43    7.43  0e+00
     1048576        131072   float    47.38   11.06   11.06  0e+00    46.84   11.19   11.19  0e+00
     2097152        262144   float    70.44   14.89   14.89  0e+00    69.77   15.03   15.03  0e+00
     4194304        524288   float    120.1   17.46   17.46  0e+00    115.5   18.16   18.16  0e+00
     8388608       1048576   float    212.5   19.73   19.73  0e+00    210.2   19.95   19.95  0e+00
    16777216       2097152   float    418.5   20.05   20.05  0e+00    414.0   20.26   20.26  0e+00
    33554432       4194304   float    817.8   20.51   20.51  0e+00    785.1   21.37   21.37  0e+00
    67108864       8388608   float   1568.3   21.40   21.40  0e+00   1560.9   21.50   21.50  0e+00
   134217728      16777216   float   3298.6   20.34   20.34  0e+00   3070.3   21.86   21.86  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.6972 

./broadcast_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid  26256 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid  26256 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type    root     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float       0     7.24    0.00    0.00  0e+00     7.50    0.00    0.00  0e+00
          16             4   float       0     8.31    0.00    0.00  0e+00     7.69    0.00    0.00  0e+00
          32             8   float       0     8.15    0.00    0.00  0e+00     8.23    0.00    0.00  0e+00
          64            16   float       0     7.19    0.01    0.01  0e+00     7.13    0.01    0.01  0e+00
         128            32   float       0     7.25    0.02    0.02  0e+00     7.45    0.02    0.02  0e+00
         256            64   float       0     7.08    0.04    0.04  0e+00     7.16    0.04    0.04  0e+00
         512           128   float       0     7.47    0.07    0.07  0e+00     7.39    0.07    0.07  0e+00
        1024           256   float       0     7.19    0.14    0.14  0e+00    32.19    0.03    0.03  0e+00
        2048           512   float       0     7.36    0.28    0.28  0e+00     7.03    0.29    0.29  0e+00
        4096          1024   float       0     7.25    0.57    0.57  0e+00     7.07    0.58    0.58  0e+00
        8192          2048   float       0     9.11    0.90    0.90  0e+00     8.10    1.01    1.01  0e+00
       16384          4096   float       0    10.97    1.49    1.49  0e+00    10.52    1.56    1.56  0e+00
       32768          8192   float       0    13.36    2.45    2.45  0e+00    11.73    2.79    2.79  0e+00
       65536         16384   float       0    17.03    3.85    3.85  0e+00    14.24    4.60    4.60  0e+00
      131072         32768   float       0    22.66    5.78    5.78  0e+00    22.60    5.80    5.80  0e+00
      262144         65536   float       0    28.48    9.21    9.21  0e+00    28.45    9.21    9.21  0e+00
      524288        131072   float       0    40.26   13.02   13.02  0e+00    40.08   13.08   13.08  0e+00
     1048576        262144   float       0    63.48   16.52   16.52  0e+00    63.19   16.59   16.59  0e+00
     2097152        524288   float       0    110.1   19.04   19.04  0e+00    109.3   19.19   19.19  0e+00
     4194304       1048576   float       0    205.7   20.39   20.39  0e+00    237.1   17.69   17.69  0e+00
     8388608       2097152   float       0    425.1   19.73   19.73  0e+00    386.7   21.69   21.69  0e+00
    16777216       4194304   float       0    815.0   20.59   20.59  0e+00    824.0   20.36   20.36  0e+00
    33554432       8388608   float       0   1536.8   21.83   21.83  0e+00   1508.2   22.25   22.25  0e+00
    67108864      16777216   float       0   3139.2   21.38   21.38  0e+00   3124.3   21.48   21.48  0e+00
   134217728      33554432   float       0   6283.5   21.36   21.36  0e+00   5873.1   22.85   22.85  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.99748 

$ ./reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   4810 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   4810 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop    root     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum       0     7.16    0.00    0.00  0e+00     7.35    0.00    0.00  0e+00
          16             4   float     sum       0     7.74    0.00    0.00  0e+00     7.67    0.00    0.00  0e+00
          32             8   float     sum       0     7.08    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
          64            16   float     sum       0     7.13    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
         128            32   float     sum       0     7.15    0.02    0.02  0e+00     7.06    0.02    0.02  0e+00
         256            64   float     sum       0     7.14    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
         512           128   float     sum       0     7.14    0.07    0.07  0e+00     7.11    0.07    0.07  0e+00
        1024           256   float     sum       0     7.09    0.14    0.14  0e+00     7.09    0.14    0.14  0e+00
        2048           512   float     sum       0     7.11    0.29    0.29  0e+00     7.12    0.29    0.29  0e+00
        4096          1024   float     sum       0     7.28    0.56    0.56  0e+00     7.20    0.57    0.57  0e+00
        8192          2048   float     sum       0     8.72    0.94    0.94  0e+00     8.59    0.95    0.95  0e+00
       16384          4096   float     sum       0    10.80    1.52    1.52  0e+00    10.78    1.52    1.52  0e+00
       32768          8192   float     sum       0    12.89    2.54    2.54  0e+00    12.64    2.59    2.59  0e+00
       65536         16384   float     sum       0    16.42    3.99    3.99  0e+00    15.88    4.13    4.13  0e+00
      131072         32768   float     sum       0    23.17    5.66    5.66  0e+00    23.27    5.63    5.63  0e+00
      262144         65536   float     sum       0    29.13    9.00    9.00  0e+00    28.88    9.08    9.08  0e+00
      524288        131072   float     sum       0    40.93   12.81   12.81  0e+00    40.93   12.81   12.81  0e+00
     1048576        262144   float     sum       0    64.30   16.31   16.31  0e+00    64.25   16.32   16.32  0e+00
     2097152        524288   float     sum       0    110.5   18.98   18.98  0e+00    110.6   18.97   18.97  0e+00
     4194304       1048576   float     sum       0    202.1   20.76   20.76  0e+00    202.1   20.76   20.76  0e+00
     8388608       2097152   float     sum       0    386.5   21.70   21.70  0e+00    386.3   21.71   21.71  0e+00
    16777216       4194304   float     sum       0    752.6   22.29   22.29  0e+00    752.5   22.30   22.30  0e+00
    33554432       8388608   float     sum       0   1485.2   22.59   22.59  0e+00   1529.3   21.94   21.94  0e+00
    67108864      16777216   float     sum       0   2947.4   22.77   22.77  0e+00   2945.2   22.79   22.79  0e+00
   134217728      33554432   float     sum       0   5873.8   22.85   22.85  0e+00   5873.8   22.85   22.85  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.22671 
$ ./reduce_scatter_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   5435 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   5435 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             1   float     sum     7.21    0.00    0.00  0e+00     7.28    0.00    0.00  0e+00
          16             2   float     sum     7.12    0.00    0.00  0e+00     7.18    0.00    0.00  0e+00
          32             4   float     sum     7.14    0.00    0.00  0e+00     7.22    0.00    0.00  0e+00
          64             8   float     sum     7.20    0.00    0.00  0e+00     7.15    0.00    0.00  0e+00
         128            16   float     sum     7.14    0.01    0.01  0e+00     7.12    0.01    0.01  0e+00
         256            32   float     sum     7.16    0.02    0.02  0e+00     7.12    0.02    0.02  0e+00
         512            64   float     sum     7.18    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
        1024           128   float     sum     7.53    0.07    0.07  0e+00     7.27    0.07    0.07  0e+00
        2048           256   float     sum     7.28    0.14    0.14  0e+00     7.23    0.14    0.14  0e+00
        4096           512   float     sum     7.64    0.27    0.27  0e+00     7.57    0.27    0.27  0e+00
        8192          1024   float     sum     9.35    0.44    0.44  0e+00     9.24    0.44    0.44  0e+00
       16384          2048   float     sum    11.33    0.72    0.72  0e+00    11.23    0.73    0.73  0e+00
       32768          4096   float     sum    12.66    1.29    1.29  0e+00    12.62    1.30    1.30  0e+00
       65536          8192   float     sum    15.39    2.13    2.13  0e+00    15.31    2.14    2.14  0e+00
      131072         16384   float     sum    21.02    3.12    3.12  0e+00    21.35    3.07    3.07  0e+00
      262144         32768   float     sum    32.36    4.05    4.05  0e+00    31.98    4.10    4.10  0e+00
      524288         65536   float     sum    39.63    6.61    6.61  0e+00    39.76    6.59    6.59  0e+00
     1048576        131072   float     sum    57.11    9.18    9.18  0e+00    56.88    9.22    9.22  0e+00
     2097152        262144   float     sum    92.96   11.28   11.28  0e+00    92.54   11.33   11.33  0e+00
     4194304        524288   float     sum    166.4   12.60   12.60  0e+00    165.9   12.64   12.64  0e+00
     8388608       1048576   float     sum    308.5   13.59   13.59  0e+00    504.4    8.32    8.32  0e+00
    16777216       2097152   float     sum   1050.1    7.99    7.99  0e+00    693.5   12.10   12.10  0e+00
    33554432       4194304   float     sum   1533.4   10.94   10.94  0e+00   1414.8   11.86   11.86  0e+00
    67108864       8388608   float     sum   2529.2   13.27   13.27  0e+00   2314.2   14.50   14.50  0e+00
   134217728      16777216   float     sum   5619.2   11.94   11.94  0e+00   4905.4   13.68   13.68  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.44552

originally I found this issue with training with Tensorflow, I first submit bug to TENSORFLOW , here is the link:https://github.com/tensorflow/tensorflow/issues/40027

it shows when I remove NVLINK BRIDGE, the TF code runs well , and when I using NVLINK BRIDGE, but not using NCCL, the TF code runs well too. but when I using NCCL and NVLINK BRIDGE, the system halt, make me have to reboot.

opened by AlexWang1900 37

NCCL infiniband performance

Hi NCCL devs! I have two machines in a cluster communicating over infiniband. There is 400 Gb/sec of bandwidth available between the machines (confirmed with ib_send_bw), but:

nccl-tests only achieves about 20 GB/s, roughly half of what I would expect
there is a decent amount of variance

running broadcast_perf on 2 machines:

NCCL_DEBUG=INFO mpiexec -f <hosts file> /root/code/nccl-tests/build/broadcast_perf -b 1M -e 2048M -f 2 -g 1 -c 0 -d half

nccl.txt

This log shows that (1) nccl is getting between about 15 and 20 GB/s in busbw, and (2) the speed isn't monotonic for larger amounts of data and can change significantly across runs.

Any ideas on what could be going wrong here? I would expect that I should be getting something closer to 45 GB/s and that there would be more consistency across runs.

env vars:

NCCL_IB_HCA=^mlx5_2
NCCL_SOCKET_IFNAME=eth

ibstatus

Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:b859:9f03:00d4:fe72
        base lid:        0x2ed
        sm lid:          0x10d
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (4X EDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_1' port 1 status:
        default gid:     fe80:0000:0000:0000:b859:9f03:00d4:fe74
        base lid:        0x5b3
        sm lid:          0x10d
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (4X EDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_2' port 1 status:
        default gid:     0000:0000:0000:0000:0000:0000:0000:0000
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (4X EDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_3' port 1 status:
        default gid:     fe80:0000:0000:0000:b859:9f03:00d5:04c6
        base lid:        0x2f3
        sm lid:          0x10d
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (4X EDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_4' port 1 status:
        default gid:     fe80:0000:0000:0000:b859:9f03:00d5:04c8
        base lid:        0x679
        sm lid:          0x10d
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (4X EDR)
        link_layer:      InfiniBand

opened by christopherhesse 34

AllReduce hangs

My problem was diagnosed in https://github.com/tensorflow/tensorflow/issues/32654 - please find all the info about my environment there.

Using the master version of nccl. I launch all_reduce_perf and it hangs with 100% volatile GPU usage reported.

./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  15833 on jupyter-vmarkovtsev device  0 [0x02] GeForce GTX 1080 Ti
#   Rank  1 Pid  15833 on jupyter-vmarkovtsev device  1 [0x03] GeForce GTX 1080 Ti
#   Rank  2 Pid  15833 on jupyter-vmarkovtsev device  2 [0x82] GeForce GTX 1080 Ti
#   Rank  3 Pid  15833 on jupyter-vmarkovtsev device  3 [0x83] GeForce GTX 1080 Ti
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Bootstrap : Using [0]eth0:10.2.3.32<0>
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

jupyter-vmarkovtsev:15833:15833 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Socket : Using [0]eth0:10.2.3.32<0>
NCCL version 2.4.8+cuda10.0
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO nranks 4
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Channel 00 :    0   1   2   3
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
jupyter-vmarkovtsev:15833:15833 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Launch mode Group/CGMD
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680

jupyter-vmarkovtsev:15833:15833 [0] init.cc:1250 NCCL WARN Mismatched collective detected, please check your collectivecalls at and around rank 3. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs

I waited for 10 minutes, there are no more logs printed.

opened by vmarkovtsev 31

NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

We got stuck using Clara SDK docker image on Kubeflow with multi gpu train. /commands/train_2gpu.sh. It just hangs. Not sure if its plug-in not found issue or our hardware config issue. We are using DGX1 with kubernetes / kubeflow. Please help

Requested train epochs: 10; iterations: 158

2020-06-29 22:20:20.310128: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

Requested train epochs: 10; iterations: 158

2020-06-29 22:20:24.223690: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

2020-06-29 22:20:24.816974: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

ds-ml-01-0:17085:17354 [0] NCCL INFO Bootstrap : Using [0]eth0:10.233.66.147<0>

ds-ml-01-0:17085:17354 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

ds-ml-01-0:17085:17354 [0] NCCL INFO NET/IB : No device found.

ds-ml-01-0:17085:17354 [0] NCCL INFO NET/Socket : Using [0]eth0:10.233.66.147<0>

NCCL version 2.4.8+cuda10.1

ds-ml-01-0:17086:17353 [1] NCCL INFO Bootstrap : Using [0]eth0:10.233.66.147<0>

ds-ml-01-0:17086:17353 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

ds-ml-01-0:17086:17353 [1] NCCL INFO NET/IB : No device found.

ds-ml-01-0:17086:17353 [1] NCCL INFO NET/Socket : Using [0]eth0:10.233.66.147<0>

ds-ml-01-0:17085:17354 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff

ds-ml-01-0:17086:17353 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff

opened by lalithvaka 28

Suboptimal performance with TCP over high bandwidth networks

Hi! Many thanks for creating a great framework. NCCL is widely used at our org for scaling the training of ML models and has proved very reliable.

I am currently trying to figure out how to achieve optimal inter-node performance with NCCL running over TCP on high bandwidth networks (32Gpbs, 100Gpbs, and higher). Even with large message sizes we have not been able to reliably obtain more than 60% of wire speed over 32Gpbs networks (see below for nccl-tests output). From what I've gathered NCCL just hasn't been fully optimized for this configuration yet (although I'm still holding out some hope that I'm just doing it wrong 😄).

I'm prepared to work fulltime for several weeks on lifting any limitations in the current implementation but I could use a few pointers for getting started. Do you have a sense for what the most promising changes might be and how to incorporate them into the codebase? One thing I might want to explore is using multiple threads/TCP streams. But there is still scope to better utilize a single TCP stream as well so maybe there are some simpler optimizations to try first?

I've been looking into the codebase and there's a number of things that I don't really understand yet:

Running nccl-tests all_reduce_perf -w 0 -n 1 seems to spawn a total of 4 allreduce ops according to my TRACE output. I would have expected just 2 (on for in-place one for out-of-place).
I'm not super clear on the control flow/threading model. In my tests NCCL is using exactly two cores, some of main files of interest seem to be net_socket.cc, net.cc, socket.h, enqueue.cc and a lot of cycles are spent polling ncclSocketIrecv/ncclSocketIsend but I'm still struggling with how everything fits together and exactly where/how the actual network transfers happen.

Some more details on my setup. My current config consists of two GCE machines with 8xV100, plenty of cores/RAM and 32Gbps network (no RDMA). I get about 28Gbps bidirectional bandwidth by running one iperf3 server and client on each node (and >30Gpbs with -Z -P4 flags). Anecdotally, more complex setups that include Horovod have occasionally been able to hit 60% of wire speed on 32Gbps and 50Gbps networks. In this case, running nccl-tests only yields 16Gpbs:

root@managed-worker-l83z:/# mpirun --allow-run-as-root -H 10.73.0.52:1,10.73.0.15:1 -np 2 -mca btl_tcp_if_include ens12 -x NCCL_IB_DISABLE=1 -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=ens12 -x NCCL_DEBUG=INFO /nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 1 -c 0
# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
#
# Using devices
#   Rank  0 Pid     51 on managed-worker-l83z device  0 [0x00] Tesla V100-SXM2-16GB
#   Rank  1 Pid     73 on managed-worker-jbk7 device  0 [0x00] Tesla V100-SXM2-16GB
managed-worker-l83z:51:51 [0] NCCL INFO NET/Socket : Using [0]ens12:10.73.0.52<0>
managed-worker-l83z:51:51 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
managed-worker-l83z:51:51 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
NCCL version 2.4.2+cuda10.0
managed-worker-jbk7:73:73 [0] NCCL INFO NET/Socket : Using [0]ens12:10.73.0.15<0>
managed-worker-jbk7:73:73 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
managed-worker-jbk7:73:73 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
managed-worker-l83z:51:57 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
managed-worker-l83z:51:57 [0] NCCL INFO comm 0x7fd518002560 rank 0 nranks 2 cudaDev 0 nvmlDev 0
managed-worker-jbk7:73:78 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
managed-worker-jbk7:73:78 [0] NCCL INFO comm 0x7f9be0002560 rank 1 nranks 2 cudaDev 0 nvmlDev 0
managed-worker-l83z:51:57 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
managed-worker-jbk7:73:78 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
managed-worker-l83z:51:57 [0] NCCL INFO Channel 00 :    0   1
managed-worker-l83z:51:57 [0] NCCL INFO Channel 01 :    0   1
managed-worker-jbk7:73:78 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
managed-worker-l83z:51:57 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
managed-worker-jbk7:73:78 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
managed-worker-l83z:51:57 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
managed-worker-l83z:51:57 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0
managed-worker-jbk7:73:78 [0] NCCL INFO Ring 01 : 0 -> 1 [receive] via NET/Socket/0
managed-worker-l83z:51:57 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0
managed-worker-jbk7:73:78 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0
managed-worker-l83z:51:57 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
managed-worker-l83z:51:57 [0] NCCL INFO comm 0x7fd518002560 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
managed-worker-jbk7:73:78 [0] NCCL INFO comm 0x7f9be0002560 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
managed-worker-l83z:51:51 [0] NCCL INFO Launch mode Parallel
  1073741824     268435456   float     sum   539383    1.99    1.99    N/A   553087    1.94    1.94    N/A
managed-worker-l83z:51:51 [0] NCCL INFO Destroyed comm 0x7fd518002560 rank 0
managed-worker-jbk7:73:73 [0] NCCL INFO Destroyed comm 0x7f9be0002560 rank 1
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.96602
#

opened by cswinter 27

NCCL didn't print the right log about connection when enable the GDR

Environment

NCCL version 2.5.7+cuda10.0
8 * V100-PCIe per node, a total of 2 nodes

test command:

mpirun -np 16 --hostfile ../../hostfile.txt -bind-to none -map-by slot --display-map --mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 --mca btl openib,self,vader -x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/tmp/debug.log.%h.%p -x NCCL_IB_HCA=mlx5_0:1 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_READ=0 ./all_reduce_perf -b 8 -e 128M -f 2

Question: When I switched the ENV NCCL_NET_GDR_READ from 0 to 1, the nccl tests showed that the latency is much slower, when the NCCL_NET_GDR_READ was 0, the nccl-tests outpus was:

                                                     out-of-place                       in-place
       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum    38.87    0.00    0.00  2e-07    36.96    0.00    0.00  2e-07
          16             4   float     sum    36.45    0.00    0.00  2e-07    36.66    0.00    0.00  1e-07
          32             8   float     sum    36.74    0.00    0.00  1e-07    36.71    0.00    0.00  1e-07
          64            16   float     sum    37.62    0.00    0.00  1e-07    37.03    0.00    0.00  1e-07
         128            32   float     sum    38.05    0.00    0.01  1e-07    38.00    0.00    0.01  1e-07
         256            64   float     sum    38.31    0.01    0.01  6e-08    38.73    0.01    0.01  6e-08
         512           128   float     sum    39.79    0.01    0.02  6e-08    39.00    0.01    0.02  6e-08
        1024           256   float     sum    40.40    0.03    0.05  2e-07    39.96    0.03    0.05  2e-07
        2048           512   float     sum    42.57    0.05    0.09  2e-07    42.42    0.05    0.09  2e-07
        4096          1024   float     sum    73.62    0.06    0.10  5e-07    72.72    0.06    0.11  5e-07
        8192          2048   float     sum    81.68    0.10    0.19  5e-07    80.06    0.10    0.19  5e-07
       16384          4096   float     sum    84.74    0.19    0.36  5e-07    83.30    0.20    0.37  5e-07
       32768          8192   float     sum    90.39    0.36    0.68  5e-07    90.26    0.36    0.68  5e-07
       65536         16384   float     sum    104.2    0.63    1.18  5e-07    102.9    0.64    1.19  5e-07
      131072         32768   float     sum    120.0    1.09    2.05  5e-07    118.6    1.11    2.07  5e-07
      262144         65536   float     sum    218.7    1.20    2.25  5e-07    221.3    1.18    2.22  5e-07
      524288        131072   float     sum    356.1    1.47    2.76  5e-07    355.5    1.47    2.77  5e-07
     1048576        262144   float     sum    479.5    2.19    4.10  5e-07    483.1    2.17    4.07  5e-07
     2097152        524288   float     sum    765.7    2.74    5.14  5e-07    764.2    2.74    5.15  5e-07
     4194304       1048576   float     sum   1428.6    2.94    5.50  5e-07   1425.0    2.94    5.52  5e-07
     8388608       2097152   float     sum   2776.9    3.02    5.66  5e-07   2764.4    3.03    5.69  5e-07
    16777216       4194304   float     sum   5475.1    3.06    5.75  5e-07   5490.5    3.06    5.73  5e-07
    33554432       8388608   float     sum    10886    3.08    5.78  5e-07    10876    3.09    5.78  5e-07
    67108864      16777216   float     sum    37080    1.81    3.39  5e-07    75304    0.89    1.67  5e-07
   134217728      33554432   float     sum    72090    1.86    3.49  5e-07    57255    2.34    4.40  5e-07
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 1.92724

but when the NCCL_NET_GDR_READ was 1, the nccl-tests outpus was:

                                                     out-of-place                       in-place
       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum    43.22    0.00    0.00  2e-07    37.00    0.00    0.00  2e-07
          16             4   float     sum    37.34    0.00    0.00  2e-07    37.79    0.00    0.00  1e-07
          32             8   float     sum    37.33    0.00    0.00  1e-07    37.20    0.00    0.00  1e-07
          64            16   float     sum    37.89    0.00    0.00  1e-07    37.73    0.00    0.00  1e-07
         128            32   float     sum    38.61    0.00    0.01  1e-07    38.53    0.00    0.01  1e-07
         256            64   float     sum    43.42    0.01    0.01  6e-08    39.17    0.01    0.01  6e-08
         512           128   float     sum    40.46    0.01    0.02  6e-08    40.32    0.01    0.02  6e-08
        1024           256   float     sum    40.59    0.03    0.05  2e-07    40.28    0.03    0.05  2e-07
        2048           512   float     sum    43.55    0.05    0.09  2e-07    43.05    0.05    0.09  2e-07
        4096          1024   float     sum    73.49    0.06    0.10  5e-07    70.96    0.06    0.11  5e-07
        8192          2048   float     sum    79.89    0.10    0.19  5e-07    79.86    0.10    0.19  5e-07
       16384          4096   float     sum    84.63    0.19    0.36  5e-07    83.82    0.20    0.37  5e-07
       32768          8192   float     sum    93.38    0.35    0.66  5e-07    91.32    0.36    0.67  5e-07
       65536         16384   float     sum    107.4    0.61    1.14  5e-07    104.1    0.63    1.18  5e-07
      131072         32768   float     sum    122.9    1.07    2.00  5e-07    121.7    1.08    2.02  5e-07
      262144         65536   float     sum    225.9    1.16    2.18  5e-07    226.2    1.16    2.17  5e-07
      524288        131072   float     sum    346.8    1.51    2.83  5e-07    345.5    1.52    2.85  5e-07
     1048576        262144   float     sum    428.7    2.45    4.59  5e-07    430.0    2.44    4.57  5e-07
     2097152        524288   float     sum    576.1    3.64    6.83  5e-07    580.9    3.61    6.77  5e-07
     4194304       1048576   float     sum    927.3    4.52    8.48  5e-07    926.1    4.53    8.49  5e-07
     8388608       2097152   float     sum   1678.7    5.00    9.37  5e-07   1683.0    4.98    9.35  5e-07
    16777216       4194304   float     sum   3393.2    4.94    9.27  5e-07   3382.5    4.96    9.30  5e-07
    33554432       8388608   float     sum   7094.9    4.73    8.87  5e-07   7055.8    4.76    8.92  5e-07
    67108864      16777216   float     sum    16353    4.10    7.69  5e-07    16348    4.10    7.70  5e-07
   134217728      33554432   float     sum    32639    4.11    7.71  5e-07    32753    4.10    7.68  5e-07
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 2.89958

If I stop the nv_peer_mem service manualy by run the command: service nv_peer_mem stop,

Then run the tests with NCCL_NET_GDR_READ=0, the result was:

                                                     out-of-place                       in-place
       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum    39.78    0.00    0.00  2e-07    38.16    0.00    0.00  2e-07
          16             4   float     sum    37.00    0.00    0.00  2e-07    37.33    0.00    0.00  1e-07
          32             8   float     sum    37.30    0.00    0.00  1e-07    37.08    0.00    0.00  1e-07
          64            16   float     sum    38.21    0.00    0.00  2e-07    38.90    0.00    0.00  2e-07
         128            32   float     sum    38.55    0.00    0.01  2e-07    38.87    0.00    0.01  2e-07
         256            64   float     sum    39.50    0.01    0.01  2e-07    39.42    0.01    0.01  2e-07
         512           128   float     sum    40.47    0.01    0.02  2e-07    39.91    0.01    0.02  2e-07
        1024           256   float     sum    41.05    0.02    0.05  2e-07    41.08    0.02    0.05  2e-07
        2048           512   float     sum    44.04    0.05    0.09  2e-07    43.84    0.05    0.09  2e-07
        4096          1024   float     sum    48.00    0.09    0.16  2e-07    47.30    0.09    0.16  2e-07
        8192          2048   float     sum    52.58    0.16    0.29  2e-07    51.76    0.16    0.30  2e-07
       16384          4096   float     sum    65.36    0.25    0.47  2e-07    64.10    0.26    0.48  2e-07
       32768          8192   float     sum    90.61    0.36    0.68  2e-07    87.10    0.38    0.71  2e-07
       65536         16384   float     sum    133.1    0.49    0.92  2e-07    258.5    0.25    0.48  2e-07
      131072         32768   float     sum    283.5    0.46    0.87  5e-07    277.1    0.47    0.89  5e-07
      262144         65536   float     sum    307.3    0.85    1.60  5e-07    300.6    0.87    1.63  5e-07
      524288        131072   float     sum    350.6    1.50    2.80  5e-07    353.6    1.48    2.78  5e-07
     1048576        262144   float     sum    475.0    2.21    4.14  5e-07    474.2    2.21    4.15  5e-07
     2097152        524288   float     sum    766.7    2.74    5.13  5e-07    762.5    2.75    5.16  5e-07
     4194304       1048576   float     sum   1453.1    2.89    5.41  5e-07   1451.9    2.89    5.42  5e-07
     8388608       2097152   float     sum   2980.8    2.81    5.28  5e-07   2984.1    2.81    5.27  5e-07
    16777216       4194304   float     sum    71226    0.24    0.44  5e-07   5877.2    2.85    5.35  5e-07
    33554432       8388608   float     sum    12570    2.67    5.01  2e-07    12543    2.68    5.02  2e-07
    67108864      16777216   float     sum    97148    0.69    1.30  2e-07    25695    2.61    4.90  2e-07
   134217728      33554432   float     sum    97671    1.37    2.58  2e-07    69526    1.93    3.62  2e-07
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 1.67461

So, this description that GDR did take effect.

but the NCCL debug log always is [0] NCCL INFO Ring 00 : 15[41000] -> 0[1b000] [receive] via NET/IB/0

opened by weberxie 25

GPU occupation during model training

Hi,

Do you have any profiling result about GPU occupation during training ?

Because I found that NCCL communication overhead arrived 75%, does it normal ?

Thanks

opened by elevenxiang 23
peer mapping resources exhausted for < 8 GPUs

I am running a NCCL reduction across multiple GPUs on an Amazon P2 16x instance in a multi-process context (one MPI rank per GPU). When I added small arrays together across 16 workers I got the error "peer mapping resources exhausted". Looking online I determined that perhaps I was limited to 8 GPUs in a group and NCCL wasn't dealing with this limitation internally.

However, when I reduced between two groups of 8 GPUs using NCCL (by splitting MPI_COMM_WORLD into two separate communicators) and then did a standard MPI reduction in host memory to reduce the remaining two arrays, I got the same error. Same for 7 GPUs. I had to reduce the group size to 4 to get the correct behaviour.

It seems this is unrelated to the peer ensemble limitation but instead is related to other resources needed for multi-process reductions on a single node.

Joss Knight

opened by extabgrad 23

NCCL segfaults on single node with 10 GPUs

I was attempting to use distributed tensorflow when I noticed I could not add the 10th gpu on my node to a distributed strategy... After running nccl-tests, I noticed it appears to be an issue with NCCL.

mjlbach@node05:~/nccl-tests$  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid 226099 on node05-ccncluster device  0 [0x1a] TITAN Xp
#   Rank  1 Pid 226099 on node05-ccncluster device  1 [0x1b] TITAN Xp
#   Rank  2 Pid 226099 on node05-ccncluster device  2 [0x1c] TITAN Xp
#   Rank  3 Pid 226099 on node05-ccncluster device  3 [0x1d] TITAN Xp
#   Rank  4 Pid 226099 on node05-ccncluster device  4 [0x1e] TITAN Xp
#   Rank  5 Pid 226099 on node05-ccncluster device  5 [0x3d] TITAN Xp
#   Rank  6 Pid 226099 on node05-ccncluster device  6 [0x3e] TITAN Xp
#   Rank  7 Pid 226099 on node05-ccncluster device  7 [0x3f] TITAN Xp
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum    42.86    0.00    0.00  1e-07    42.51    0.00    0.00  1e-07
          16             4   float     sum    42.46    0.00    0.00  1e-07    43.06    0.00    0.00  1e-07
          32             8   float     sum    42.90    0.00    0.00  6e-08    42.75    0.00    0.00  6e-08
          64            16   float     sum    42.81    0.00    0.00  6e-08    43.06    0.00    0.00  6e-08
         128            32   float     sum    42.81    0.00    0.01  6e-08    42.92    0.00    0.01  6e-08
         256            64   float     sum    43.05    0.01    0.01  3e-08    43.34    0.01    0.01  3e-08
         512           128   float     sum    42.79    0.01    0.02  3e-08    42.65    0.01    0.02  3e-08
        1024           256   float     sum    42.91    0.02    0.04  1e-07    43.00    0.02    0.04  1e-07
        2048           512   float     sum    43.35    0.05    0.08  2e-07    43.25    0.05    0.08  2e-07
        4096          1024   float     sum    43.46    0.09    0.16  2e-07    43.40    0.09    0.17  2e-07
        8192          2048   float     sum    44.38    0.18    0.32  2e-07    43.88    0.19    0.33  2e-07
       16384          4096   float     sum    49.15    0.33    0.58  2e-07    48.86    0.34    0.59  2e-07
       32768          8192   float     sum    72.44    0.45    0.79  2e-07    71.88    0.46    0.80  2e-07
       65536         16384   float     sum    120.5    0.54    0.95  2e-07    121.7    0.54    0.94  2e-07
      131072         32768   float     sum    129.5    1.01    1.77  2e-07    129.5    1.01    1.77  2e-07
      262144         65536   float     sum    157.1    1.67    2.92  2e-07    157.0    1.67    2.92  2e-07
      524288        131072   float     sum    205.4    2.55    4.47  2e-07    205.3    2.55    4.47  2e-07
     1048576        262144   float     sum    305.1    3.44    6.01  2e-07    305.0    3.44    6.02  2e-07
     2097152        524288   float     sum    647.4    3.24    5.67  2e-07    495.1    4.24    7.41  2e-07
     4194304       1048576   float     sum    900.7    4.66    8.15  2e-07    898.9    4.67    8.17  2e-07
     8388608       2097152   float     sum   1735.0    4.83    8.46  2e-07   1718.9    4.88    8.54  2e-07
    16777216       4194304   float     sum   3425.8    4.90    8.57  2e-07   3406.6    4.92    8.62  2e-07
    33554432       8388608   float     sum   6793.3    4.94    8.64  2e-07   6792.5    4.94    8.64  2e-07
    67108864      16777216   float     sum    13579    4.94    8.65  2e-07    13574    4.94    8.65  2e-07
   134217728      33554432   float     sum    27135    4.95    8.66  2e-07    27134    4.95    8.66  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.0361
#
mjlbach@node05:~/nccl-tests$  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid 226138 on node05-ccncluster device  0 [0x1a] TITAN Xp
#   Rank  1 Pid 226138 on node05-ccncluster device  1 [0x1b] TITAN Xp
#   Rank  2 Pid 226138 on node05-ccncluster device  2 [0x1c] TITAN Xp
#   Rank  3 Pid 226138 on node05-ccncluster device  3 [0x1d] TITAN Xp
#   Rank  4 Pid 226138 on node05-ccncluster device  4 [0x1e] TITAN Xp
#   Rank  5 Pid 226138 on node05-ccncluster device  5 [0x3d] TITAN Xp
#   Rank  6 Pid 226138 on node05-ccncluster device  6 [0x3e] TITAN Xp
#   Rank  7 Pid 226138 on node05-ccncluster device  7 [0x3f] TITAN Xp
#   Rank  8 Pid 226138 on node05-ccncluster device  8 [0x40] TITAN Xp
#   Rank  9 Pid 226138 on node05-ccncluster device  9 [0x41] TITAN Xp
Segmentation fault (core dumped)
mjlbach@node05:~/nccl-tests$

opened by mjlbach 22

Infiniband regression in NCCL 2.5.6-2
I have an application that uses NCCL via PyTorch 1.4 compiled from source with NCCL 2.5.6-2 and crashes early with this error message on all/most ranks:

NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.5.6

Some of the ranks are printing out this error message:

NCCL WARN Call to ibv_modify_qp failed with error No such device

Running the same code on an image that is identical except that it was built with NCCL 2.4.8-1 runs fine. However, we are unable to use that version of NCCL for larger workloads because our nodes have multiple mellanox devices and we invariably run into https://github.com/NVIDIA/nccl/issues/179.

Additional datapoints:

After startup, the application performs one allreduce operation across all ranks to detect issues which appears to complete successfully despite the warnings.

nccl-tests runs to successfully and without printing any warnings.

As temporary workaround, I would also be interested in backporting the fix to https://github.com/NVIDIA/nccl/issues/179 to NCCL 2.4.8-1 but the referenced commit seems to incorporate a number of different changes. Is there a more minimal diff I could apply to get the fix?
opened by cswinter 20
NCCL Fortran bindings

Hi All,

I've added Fortran bindings for ncclBCast, ncclAllGather and ncclReduce. I plan on adding ncclAllReduce and ncclReduceScatter very soon.

I've also added tests/samples for using these with floats using both "Fortran Array" and "Pointer" syntax. The tests/samples for ncclAllGather and ncclReduce are for "out-of-place" operation. I plan on adding "in-place" tests/samples soon too.

Cheers, Kyle

opened by kylefernandes 20

NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10

I've built NCCL 2.16.2-1 from source on Ubuntu 20.04. nvidia-smi reports Driver Version: 470.161.03 and CUDA Version: 11.4. CUDA toolkit is version 11.2 (/usr/local/cuda-11.2).

On a system with 8x A40 GPUs and 4x NVLink bridges, reduce_scatter_perf, all_gather_perf, or all_reduce_perf from https://github.com/NVIDIA/nccl-tests are ~10x slower than with NCCL 2.14.3 that I've been using up to now. I've also seen considerable slowdowns in a Horovod training.

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      NV4     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     12-15,44-47     3
GPU1    NV4      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     8-11,40-43      2
GPU2    SYS     SYS      X      NV4     SYS     SYS     SYS     SYS     SYS     SYS     4-7,36-39       1
GPU3    SYS     SYS     NV4      X      SYS     SYS     SYS     SYS     SYS     SYS     0-3,32-35       0
GPU4    SYS     SYS     SYS     SYS      X      NV4     SYS     SYS     SYS     SYS     28-31,60-63     7
GPU5    SYS     SYS     SYS     SYS     NV4      X      SYS     SYS     SYS     SYS     24-27,56-59     6
GPU6    SYS     SYS     SYS     SYS     SYS     SYS      X      NV4     PHB     PHB     20-23,52-55     5
GPU7    SYS     SYS     SYS     SYS     SYS     SYS     NV4      X      SYS     SYS     16-19,48-51     4
mlx5_0  SYS     SYS     SYS     SYS     SYS     SYS     PHB     SYS      X      PIX
mlx5_1  SYS     SYS     SYS     SYS     SYS     SYS     PHB     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

I've built NCCL and nccl-tests like this:

# Tarball obtained from https://github.com/NVIDIA/nccl/archive/refs/tags/v2.16.2-1.tar.gz
nccl/v2.16$ tar xvf nccl-2.16.2-1.tar.gz
nccl/v2.16/nccl-2.16.2-1$ make -j src.build BUILDDIR=/learndata4/maxDev/nccl/v2.16/nccl-build

nccl/v2.16$ git clone https://github.com/NVIDIA/nccl-tests.git
nccl/v2.16/nccl-tests$ make -j BUILDDIR=build_mpi MPI=1 NCCL_HOME=nccl/v2.16/nccl-build/ MPI_HOME=/opt/openmpi

This is how I've run reduce_scatter_perf:

nccl/v2.16/nccl-tests/build_mpi$ mpirun -x LD_LIBRARY_PATH=nccl/v2.16/nccl-build/lib -x NCCL_DEBUG= -x NCCL_DEBUG_FILE= -H localhost:8 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.20.0.0/16 -np 8 ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 1 -c 0
# Avg bus bandwidth    : 2.40877

Performance was much better with NCCL 2.14.3, built similarly:

nccl/v2.14/nccl-tests/build_mpi$ mpirun -x LD_LIBRARY_PATH=nccl/v2.14/nccl-build/lib -x NCCL_DEBUG=info -x NCCL_DEBUG_FILE= -H localhost:8 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.20.0.0/16 -np 8 ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 1 -c 0
# Avg bus bandwidth    : 26.0137

Detailed logs with NCCL_DEBUG=INFO:

Those logs look quite similar to me.

Things that I've tried, but that do not improve the bandwidth:

Set -x NCCL_P2P_LEVEL=NVL to disable peer-to-peer communication over PCIe
Build nccl-tests without MPI and run like nccl-tests/build$ LD_LIBRARY_PATH=nccl/v2.16/nccl-build/lib ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 8 -c 0

Could my observation be related to the old version of CUDA that's installed here?

opened by maxhgerlach 5

Questions about profiling NCCL ring-reduce

Dear NCCL team,

I'm new to this library and trying to study the all-reduce systems by profiling the DL training. For the model, I'm trying to train Megatron-LM's BERT-base via Distributed Data Parallel (DDP). And also for the environment, I'm using a machine with 2 x A5000 GPUs, connected via PCIe network.

When I analyzed the code all_reduce.h, I found that the system determines the data size (based on the chunk size), which is explained as nelem. However, there were some interesting parts while profiling the all-reduce. When I checked the size of the data to use at send function, some were -511, some were 1, and some were 1048576. Can I get some advice for explaining the result of checking the data size, and why I could get that data?

My second interesting part is interpreting the training result using NCCL_DEBUG=INFO. Here is the snippet of my log.

 iteration        6/      10 | consumed samples:         1536 | consumed tokens:       185228 | elapsed time per iteration (ms): 306.5 | learning rate: 0.000E+00 | global batch size:   256 | loss scale: 134217728.0 | number of skipped iterations:   1 | number of nan iterations:   0 | samples per second: 835.199 | TFLOPs: 33.65 |
time (ms) | forward-compute: 105.94 | backward-compute: 155.04 | backward-params-all-reduce: 40.91 | backward-embedding-all-reduce: 0.02 | optimizer-copy-to-main-grad: 1.67 | optimizer-unscale-and-check-inf: 1.62 | optimizer: 3.36 | batch-generator: 1.13
jungfrau:11104:11104 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f69019ffe00 recvbuff 0x7f69019ffe00 count 240 datatype 0 op 0 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11103:11103 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fb6c79ffe00 recvbuff 0x7fb6c79ffe00 count 240 datatype 0 op 0 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11103:11103 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fb6c7853200 recvbuff 0x7fb6c7853200 count 656384 datatype 0 op 0 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11104:11104 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f6901853200 recvbuff 0x7f6901853200 count 656384 datatype 0 op 0 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 15 sendbuff 0x7fb6c79ffe00 recvbuff 0x7fb6c79ffe00 count 1 datatype 4 op 0 root 0 comm 0x7fb814002f70 [nranks=2] stream 0x563cb9e19e50
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 15 sendbuff 0x7f69019ffe00 recvbuff 0x7f69019ffe00 count 1 datatype 4 op 0 root 0 comm 0x7f6a99dab510 [nranks=2] stream 0x55fe4921a9b0
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6a37d40000 recvbuff 0x7f6a37d40000 count 16384 datatype 7 op 2 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb7ffd40000 recvbuff 0x7fb7ffd40000 count 16384 datatype 7 op 2 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6733820000 recvbuff 0x7f6733820000 count 16384 datatype 7 op 0 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6a37da0000 recvbuff 0x7f6a37da0000 count 16384 datatype 7 op 0 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 16 sendbuff 0x7f6acd55a600 recvbuff 0x7f6acd55a600 count 2 datatype 7 op 0 root 0 comm 0x7f6a99dab510 [nranks=2] stream 0x55fe4921a9b0
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb4fd820000 recvbuff 0x7fb4fd820000 count 16384 datatype 7 op 0 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb7ffda0000 recvbuff 0x7fb7ffda0000 count 16384 datatype 7 op 0 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 16 sendbuff 0x7fb86875a600 recvbuff 0x7fb86875a600 count 2 datatype 7 op 0 root 0 comm 0x7fb814002f70 [nranks=2] stream 0x563cb9e19e50
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 17 sendbuff 0x7fb58c000000 recvbuff 0x7fb58c000000 count 110160258 datatype 6 op 0 root 0 comm 0x7fb814002f70 [nranks=2] stream 0x563cb9e19e50
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 17 sendbuff 0x7f67c6000000 recvbuff 0x7f67c6000000 count 110160258 datatype 6 op 0 root 0 comm 0x7f6a99dab510 [nranks=2] stream 0x55fe4921a9b0
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6acd44d600 recvbuff 0x7f6acd44d600 count 1 datatype 7 op 2 root 0 comm 0x7f673c002f70 [nranks=1] stream 0x55fe4921ad70
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb86864d600 recvbuff 0x7fb86864d600 count 1 datatype 7 op 2 root 0 comm 0x7fb508002f70 [nranks=1] stream 0x563cb9e1a210

Because the BERT-base model has 12 layers, I thought that all-reduce's function call would be launched 12 times when the layer-by-layer synchronization happened. However, I'm quite confused when I saw that the function call was launched 14 times. Can I get a device for interpreting this log? (Additionally, I couldn't get some answer for the meaning of opCount and count, having a hard time with how to interpret the data.)

Thank you!

opened by tjdgh0715 0

Create Tag for PT1.13 checkpoint

PT1.13 seems to use this commit for the official release https://github.com/NVIDIA/nccl/tree/f89fd4777d2ef9229c039ff750ae21da01626f52. Could we create a separate tag for it to make versions easier to track, since PT is a pretty big part of where usages are from?

opened by sitecao 0

[Question] NCCL WARNS

Hi, I'm DaeHyun. Happy New Year~

I'm doing multi-node training using Pytorch-based Detectron2. But it's very embarrassing that the training time is very longer than when I use a single node!!

I use A6000 RTX with 4 GPUS per node. GPUs are normally working in the two nodes. The following is the message when I set NCCL_DEBUG=INFO.

cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379562:1379562 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379564:1379564 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379565:1379565 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379563:1379563 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 00 : 2[41000] -> 3[61000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 00 : 1[2c000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 01 : 2[41000] -> 3[61000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 01 : 1[2c000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 7[68000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 00 : 3[61000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 7[68000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 0[1000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 01 : 3[61000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 00 : 3[61000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 01 : 3[61000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 00 : 2[41000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 00 : 1[2c000] -> 0[1000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 01 : 2[41000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 01 : 1[2c000] -> 0[1000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 4[19000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 4[19000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 0[1000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 0[1000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO comm 0x7f7590002fb0 rank 0 nranks 8 cudaDev 0 busId 1000 - Init COMPLETE
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO comm 0x7f3928002fb0 rank 1 nranks 8 cudaDev 1 busId 2c000 - Init COMPLETE
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO comm 0x7f9f00002fb0 rank 3 nranks 8 cudaDev 3 busId 61000 - Init COMPLETE
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO comm 0x7fe414002fb0 rank 2 nranks 8 cudaDev 2 busId 41000 - Init COMPLETE
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Launch mode Parallel

What should I do?.. I think there's a problem with the message "No plugin found"

opened by daebakk 1

NCCL Ring algorithm is slower compared to purely MPI based collective algo (No Nccl)

Hello,

I am currently running deep learning workload on 2 nodes, both connected with RoCE. while running similar application in different environment, I am getting following speed performance.

from fastest to slowest

NCCL (Tree) > MPI > NCCL (Ring).

I have reran it multiple times but achieving same result. I am using horovod docker image and to test MPI result, I skip nccl installation step from Dockerfile and building docker image.

Command:

horovodrun --verbose -np=16 -H=${HOST} --fusion-threshold-mb 256 --cycle-time-ms 2.5 --mpi-args="--bind-to none --map-by numa \
 --mca btl=openib,self,vader btl_openib_cpc_include=rdmacm \
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 \
-x NCCL_DEBUG=INFO -x NCCL_VERSION \
-x NV_LIBNCCL_DEV_PACKAGE -x NV_LIBNCCL_DEV_PACKAGE_NAME -x NV_LIBNCCL_DEV_PACKAGE_VERSION -x NV_LIBNCCL_PACKAGE -x NV_LIBNCCL_PACKAGE_NAME -x NV_LIBNCCL_PACKAGE_VERSION \
-x NVARCH -x NVIDIA_DRIVER_CAPABILITIES -x NVIDIA_REQUIRE_CUDA -x NVIDIA_VISIBLE_DEVICES -x NV_CUDA_COMPAT_PACKAGE \
-x NV_CUDA_CUDART_DEV_VERSION -x NV_CUDA_CUDART_VERSION -x NV_CUDA_LIB_VERSION -x NV_LIBCUBLAS_DEV_PACKAGE -x NV_LIBCUBLAS_DEV_PACKAGE_NAME \
-x NV_LIBCUBLAS_DEV_VERSION -x NV_LIBCUBLAS_PACKAGE -x NV_LIBCUBLAS_PACKAGE_NAME -x NV_LIBCUBLAS_VERSION -x NV_LIBCUSPARSE_DEV_VERSION -x NV_LIBCUSPARSE_VERSION \
-x NV_LIBNPP_DEV_PACKAGE -x NV_LIBNPP_DEV_VERSION -x NV_LIBNPP_PACKAGE -x NV_LIBNPP_VERSION -x NV_NVML_DEV_VERSION -x NV_NVTX_VERSION \
-x NCCL_ALGO=Ring -x NCCL_SOCKET_IFNAME=ens8f0 -x NCCL_SOCKET_NTHREADS=4 -x NCCL_NSOCKS_PERTHREAD=16"

Is it expected rankings? Let me know if any information needed.

Thank you

opened by PurvangL 1

Make NCCL build with `-std=c++17`

Currently NCCL sets -std=c++11 in its Makefile. Current releases do not build under -std=c++17:

e.g., after modifying the Makefile to change the C++ standard:

In file included from include/core.h:62,
                 from include/info.h:13,
                 from include/graph.h:112,
                 from include/transport.h:11,
                 from include/comm.h:10,
                 from include/enqueue.h:10,
                 from enqueue.cc:7:
include/nvtx.h: In constructor ‘constexpr payload_schema::payload_schema(const nvtxPayloadSchemaEntry_t*, size_t, uint64_t, const char*)’:
include/nvtx.h:46:30: error: call to non-‘constexpr’ function ‘uint64_t nvtxPayloadSchemaRegister(nvtxDomainHandle_t, const nvtxPayloadSchemaAttr_t*)’
   46 |     nvtxPayloadSchemaRegister(nvtx3::domain::get<nccl_domain>(), &schema_attr);
      |     ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/nvtx3/nvToolsExtPayload.h:764,
                 from include/nvtx3.hpp:95,
                 from include/nvtx.h:10:
include/nvtx3/nvtxExtDetail/nvtxExtImplPayload_v1.h:77:28: note: ‘uint64_t nvtxPayloadSchemaRegister(nvtxDomainHandle_t, const nvtxPayloadSchemaAttr_t*)’ declared here
   77 | NVTX_EXT_FN_IMPL(uint64_t, nvtxPayloadSchemaRegister, (nvtxDomainHandle_t domain, const nvtxPayloadSchemaAttr_t* attr), (domain, attr))
      |                            ^~~~~~~~~~~~~~~~~~~~~~~~~
include/nvtx3/nvtxExtDetail/nvtxExtImplPayload_v1.h:61:39: note: in definition of macro ‘NVTX_EXT_FN_IMPL’
   61 | NVTX_LINKONCE_DEFINE_FUNCTION ret_val fn_name signature { \

While this is a non-standard way to build NCCL, we noticed this in the XLA build which self-builds NCCL using Bazel. It looks to me like a real bug.

opened by hawkinsp 1

Releases(v1.3.4-1)

v1.3.4-1(Mar 24, 2017)

Source code(tar.gz)
Source code(zip)
libnccl-dev_1.3.4-1.cuda8.0_amd64.deb(6.74 KB)
libnccl1_1.3.4-1.cuda8.0_amd64.deb(1.25 MB)
v1.2.3-1+cuda8.0(Mar 18, 2016)

Ubuntu 14.04 (trusty) installation packages for NCCL 1.2.3 (linked with CUDA 8.0)
Source code(tar.gz)
Source code(zip)
libnccl-dev_1.2.3-1.cuda8.0_amd64.deb(7.06 KB)
libnccl1_1.2.3-1.cuda8.0_amd64.deb(1.80 MB)
v1.2.3-1+cuda7.5(Jun 13, 2016)

Ubuntu 14.04 (trusty) installation packages for NCCL 1.2.3 (linked with CUDA 7.5)
Source code(tar.gz)
Source code(zip)
libnccl-dev_1.2.3-1.cuda7.5_amd64.deb(7.05 KB)
libnccl1_1.2.3-1.cuda7.5_amd64.deb(1.84 MB)

Optimized primitives for collective multi-GPU communication

Related tags

Overview

NCCL

Introduction

Build

Install

Tests

Copyright

Comments

Releases(v1.3.4-1)

v1.3.4-1(Mar 24, 2017)

v1.2.3-1+cuda8.0(Mar 18, 2016)

v1.2.3-1+cuda7.5(Jun 13, 2016)

Owner

NVIDIA Corporation

Collective Multi-type Entity Alignment Between Knowledge Graphs (WWW'20)

Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation. Intel iHD GPU (iGPU) support. NVIDIA GPU (dGPU) support.

High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

GrabGpu_py: a scripts for grab gpu when gpu is free

HashNeRF-pytorch - Pure PyTorch Implementation of NVIDIA paper on Instant Training of Neural Graphics primitives

An example showing how to use jax to train resnet50 on multi-node multi-GPU

Official source code to CVPR'20 paper, "When2com: Multi-Agent Perception via Communication Graph Grouping"

Code for Emergent Translation in Multi-Agent Communication

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Learning cell communication from spatial graphs of cells

Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Code for the paper: Adversarial Training Against Location-Optimized Adversarial Patches. ECCV-W 2020.

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)