Optimized primitives for collective multi-GPU communication

Related tags

Deep Learning nccl
Overview

NCCL

Optimized primitives for inter-GPU communication.

Introduction

NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

For more information on NCCL usage, please refer to the NCCL documentation.

Build

Note: the official and tested builds of NCCL can be downloaded from: https://developer.nvidia.com/nccl. You can skip the following build steps if you choose to use the official builds.

To build the library :

$ cd nccl
$ make -j src.build

If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :

$ make src.build CUDA_HOME=<path to cuda install>

NCCL will be compiled and installed in build/ unless BUILDDIR is set.

By default, NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining NVCC_GENCODE (defined in makefiles/common.mk) to only include the architecture of the target platform :

$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"

Install

To install NCCL on the system, create a package then install it as root.

Debian/Ubuntu :

$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make pkg.debian.build
$ ls build/pkg/deb/

RedHat/CentOS :

$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make pkg.redhat.build
$ ls build/pkg/rpm/

OS-agnostic tarball :

$ make pkg.txz.build
$ ls build/pkg/txz/

Tests

Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests.

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>

Copyright

All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.

Issues
  • NCCL 2.6.4 makes system hanging.

    NCCL 2.6.4 makes system hanging.

    Linux: Ubuntu 20.04 LTS GPU driver: newest NVidia driver for linux. CUDA 10.1, CUDNN ,7.6.5 NCCL 2.6.4 Hardware : CPU: Intel 9400f, MB: Z370,Ram : 64GB 2-channel, GPU: 2 2080ti on 2 PCIE 3.0 *8, with a NVlink bridge between them

    I ran all nccl_tests, it seems NCCL is working. But when each test running(about 30 min for each test), the system freezes, I can't switch to browser or doing anything, I can only move the mouse, but the system doesn't respond to mouse-clicking or keyboard input. when the test finishes running, system goes back to normal ,and LOG prints in console.

    log is here:

    #  ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   3795 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   3795 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float     sum     7.18    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
              16             4   float     sum     7.00    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
              32             8   float     sum     7.28    0.00    0.00  0e+00     7.19    0.00    0.00  0e+00
              64            16   float     sum     7.20    0.01    0.01  0e+00     7.05    0.01    0.01  0e+00
             128            32   float     sum     7.30    0.02    0.02  0e+00     7.19    0.02    0.02  0e+00
             256            64   float     sum     7.30    0.04    0.04  0e+00     7.20    0.04    0.04  0e+00
             512           128   float     sum     7.47    0.07    0.07  0e+00     7.12    0.07    0.07  0e+00
            1024           256   float     sum     8.14    0.13    0.13  0e+00     7.92    0.13    0.13  0e+00
            2048           512   float     sum     8.56    0.24    0.24  0e+00     8.43    0.24    0.24  0e+00
            4096          1024   float     sum     9.72    0.42    0.42  0e+00     9.49    0.43    0.43  0e+00
            8192          2048   float     sum    11.99    0.68    0.68  0e+00    11.92    0.69    0.69  0e+00
           16384          4096   float     sum    14.36    1.14    1.14  0e+00    14.21    1.15    1.15  0e+00
           32768          8192   float     sum    16.79    1.95    1.95  0e+00    16.64    1.97    1.97  0e+00
           65536         16384   float     sum    21.14    3.10    3.10  0e+00    20.55    3.19    3.19  0e+00
          131072         32768   float     sum    35.56    3.69    3.69  0e+00    35.43    3.70    3.70  0e+00
          262144         65536   float     sum    41.23    6.36    6.36  0e+00    41.21    6.36    6.36  0e+00
          524288        131072   float     sum    50.66   10.35   10.35  0e+00    50.82   10.32   10.32  0e+00
         1048576        262144   float     sum    72.54   14.45   14.45  0e+00    72.45   14.47   14.47  0e+00
         2097152        524288   float     sum    120.7   17.37   17.37  0e+00    118.4   17.71   17.71  0e+00
         4194304       1048576   float     sum    215.2   19.49   19.49  0e+00    214.7   19.53   19.53  0e+00
         8388608       2097152   float     sum    411.3   20.39   20.39  0e+00    399.1   21.02   21.02  0e+00
        16777216       4194304   float     sum    865.3   19.39   19.39  0e+00    779.6   21.52   21.52  0e+00
        33554432       8388608   float     sum   1547.9   21.68   21.68  0e+00   1699.3   19.75   19.75  0e+00
        67108864      16777216   float     sum   3115.1   21.54   21.54  0e+00   3007.4   22.31   22.31  0e+00
       134217728      33554432   float     sum   5994.3   22.39   22.39  0e+00   5991.9   22.40   22.40  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 7.43886 
    
    /all_gather_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   9119 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   9119 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                             out-of-place                       in-place          
    #       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             1   float     7.14    0.00    0.00  0e+00     7.06    0.00    0.00  0e+00
              16             2   float     7.03    0.00    0.00  0e+00     7.00    0.00    0.00  0e+00
              32             4   float     6.96    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
              64             8   float     7.10    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
             128            16   float     7.10    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
             256            32   float     7.18    0.02    0.02  0e+00     7.23    0.02    0.02  0e+00
             512            64   float     7.49    0.03    0.03  0e+00     7.47    0.03    0.03  0e+00
            1024           128   float     7.03    0.07    0.07  0e+00     6.96    0.07    0.07  0e+00
            2048           256   float     6.97    0.15    0.15  0e+00     6.97    0.15    0.15  0e+00
            4096           512   float     7.41    0.28    0.28  0e+00     7.00    0.29    0.29  0e+00
            8192          1024   float     9.59    0.43    0.43  0e+00     8.80    0.47    0.47  0e+00
           16384          2048   float    11.41    0.72    0.72  0e+00    10.78    0.76    0.76  0e+00
           32768          4096   float    13.39    1.22    1.22  0e+00    11.85    1.38    1.38  0e+00
           65536          8192   float    16.57    1.98    1.98  0e+00    13.83    2.37    2.37  0e+00
          131072         16384   float    23.07    2.84    2.84  0e+00    18.39    3.56    3.56  0e+00
          262144         32768   float    31.38    4.18    4.18  0e+00    30.27    4.33    4.33  0e+00
          524288         65536   float    36.00    7.28    7.28  0e+00    35.30    7.43    7.43  0e+00
         1048576        131072   float    47.38   11.06   11.06  0e+00    46.84   11.19   11.19  0e+00
         2097152        262144   float    70.44   14.89   14.89  0e+00    69.77   15.03   15.03  0e+00
         4194304        524288   float    120.1   17.46   17.46  0e+00    115.5   18.16   18.16  0e+00
         8388608       1048576   float    212.5   19.73   19.73  0e+00    210.2   19.95   19.95  0e+00
        16777216       2097152   float    418.5   20.05   20.05  0e+00    414.0   20.26   20.26  0e+00
        33554432       4194304   float    817.8   20.51   20.51  0e+00    785.1   21.37   21.37  0e+00
        67108864       8388608   float   1568.3   21.40   21.40  0e+00   1560.9   21.50   21.50  0e+00
       134217728      16777216   float   3298.6   20.34   20.34  0e+00   3070.3   21.86   21.86  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 6.6972 
    
    ./broadcast_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid  26256 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid  26256 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type    root     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float       0     7.24    0.00    0.00  0e+00     7.50    0.00    0.00  0e+00
              16             4   float       0     8.31    0.00    0.00  0e+00     7.69    0.00    0.00  0e+00
              32             8   float       0     8.15    0.00    0.00  0e+00     8.23    0.00    0.00  0e+00
              64            16   float       0     7.19    0.01    0.01  0e+00     7.13    0.01    0.01  0e+00
             128            32   float       0     7.25    0.02    0.02  0e+00     7.45    0.02    0.02  0e+00
             256            64   float       0     7.08    0.04    0.04  0e+00     7.16    0.04    0.04  0e+00
             512           128   float       0     7.47    0.07    0.07  0e+00     7.39    0.07    0.07  0e+00
            1024           256   float       0     7.19    0.14    0.14  0e+00    32.19    0.03    0.03  0e+00
            2048           512   float       0     7.36    0.28    0.28  0e+00     7.03    0.29    0.29  0e+00
            4096          1024   float       0     7.25    0.57    0.57  0e+00     7.07    0.58    0.58  0e+00
            8192          2048   float       0     9.11    0.90    0.90  0e+00     8.10    1.01    1.01  0e+00
           16384          4096   float       0    10.97    1.49    1.49  0e+00    10.52    1.56    1.56  0e+00
           32768          8192   float       0    13.36    2.45    2.45  0e+00    11.73    2.79    2.79  0e+00
           65536         16384   float       0    17.03    3.85    3.85  0e+00    14.24    4.60    4.60  0e+00
          131072         32768   float       0    22.66    5.78    5.78  0e+00    22.60    5.80    5.80  0e+00
          262144         65536   float       0    28.48    9.21    9.21  0e+00    28.45    9.21    9.21  0e+00
          524288        131072   float       0    40.26   13.02   13.02  0e+00    40.08   13.08   13.08  0e+00
         1048576        262144   float       0    63.48   16.52   16.52  0e+00    63.19   16.59   16.59  0e+00
         2097152        524288   float       0    110.1   19.04   19.04  0e+00    109.3   19.19   19.19  0e+00
         4194304       1048576   float       0    205.7   20.39   20.39  0e+00    237.1   17.69   17.69  0e+00
         8388608       2097152   float       0    425.1   19.73   19.73  0e+00    386.7   21.69   21.69  0e+00
        16777216       4194304   float       0    815.0   20.59   20.59  0e+00    824.0   20.36   20.36  0e+00
        33554432       8388608   float       0   1536.8   21.83   21.83  0e+00   1508.2   22.25   22.25  0e+00
        67108864      16777216   float       0   3139.2   21.38   21.38  0e+00   3124.3   21.48   21.48  0e+00
       134217728      33554432   float       0   6283.5   21.36   21.36  0e+00   5873.1   22.85   22.85  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 7.99748 
    
    $ ./reduce_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   4810 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   4810 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop    root     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float     sum       0     7.16    0.00    0.00  0e+00     7.35    0.00    0.00  0e+00
              16             4   float     sum       0     7.74    0.00    0.00  0e+00     7.67    0.00    0.00  0e+00
              32             8   float     sum       0     7.08    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
              64            16   float     sum       0     7.13    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
             128            32   float     sum       0     7.15    0.02    0.02  0e+00     7.06    0.02    0.02  0e+00
             256            64   float     sum       0     7.14    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
             512           128   float     sum       0     7.14    0.07    0.07  0e+00     7.11    0.07    0.07  0e+00
            1024           256   float     sum       0     7.09    0.14    0.14  0e+00     7.09    0.14    0.14  0e+00
            2048           512   float     sum       0     7.11    0.29    0.29  0e+00     7.12    0.29    0.29  0e+00
            4096          1024   float     sum       0     7.28    0.56    0.56  0e+00     7.20    0.57    0.57  0e+00
            8192          2048   float     sum       0     8.72    0.94    0.94  0e+00     8.59    0.95    0.95  0e+00
           16384          4096   float     sum       0    10.80    1.52    1.52  0e+00    10.78    1.52    1.52  0e+00
           32768          8192   float     sum       0    12.89    2.54    2.54  0e+00    12.64    2.59    2.59  0e+00
           65536         16384   float     sum       0    16.42    3.99    3.99  0e+00    15.88    4.13    4.13  0e+00
          131072         32768   float     sum       0    23.17    5.66    5.66  0e+00    23.27    5.63    5.63  0e+00
          262144         65536   float     sum       0    29.13    9.00    9.00  0e+00    28.88    9.08    9.08  0e+00
          524288        131072   float     sum       0    40.93   12.81   12.81  0e+00    40.93   12.81   12.81  0e+00
         1048576        262144   float     sum       0    64.30   16.31   16.31  0e+00    64.25   16.32   16.32  0e+00
         2097152        524288   float     sum       0    110.5   18.98   18.98  0e+00    110.6   18.97   18.97  0e+00
         4194304       1048576   float     sum       0    202.1   20.76   20.76  0e+00    202.1   20.76   20.76  0e+00
         8388608       2097152   float     sum       0    386.5   21.70   21.70  0e+00    386.3   21.71   21.71  0e+00
        16777216       4194304   float     sum       0    752.6   22.29   22.29  0e+00    752.5   22.30   22.30  0e+00
        33554432       8388608   float     sum       0   1485.2   22.59   22.59  0e+00   1529.3   21.94   21.94  0e+00
        67108864      16777216   float     sum       0   2947.4   22.77   22.77  0e+00   2945.2   22.79   22.79  0e+00
       134217728      33554432   float     sum       0   5873.8   22.85   22.85  0e+00   5873.8   22.85   22.85  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 8.22671 
    $ ./reduce_scatter_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   5435 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   5435 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             1   float     sum     7.21    0.00    0.00  0e+00     7.28    0.00    0.00  0e+00
              16             2   float     sum     7.12    0.00    0.00  0e+00     7.18    0.00    0.00  0e+00
              32             4   float     sum     7.14    0.00    0.00  0e+00     7.22    0.00    0.00  0e+00
              64             8   float     sum     7.20    0.00    0.00  0e+00     7.15    0.00    0.00  0e+00
             128            16   float     sum     7.14    0.01    0.01  0e+00     7.12    0.01    0.01  0e+00
             256            32   float     sum     7.16    0.02    0.02  0e+00     7.12    0.02    0.02  0e+00
             512            64   float     sum     7.18    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
            1024           128   float     sum     7.53    0.07    0.07  0e+00     7.27    0.07    0.07  0e+00
            2048           256   float     sum     7.28    0.14    0.14  0e+00     7.23    0.14    0.14  0e+00
            4096           512   float     sum     7.64    0.27    0.27  0e+00     7.57    0.27    0.27  0e+00
            8192          1024   float     sum     9.35    0.44    0.44  0e+00     9.24    0.44    0.44  0e+00
           16384          2048   float     sum    11.33    0.72    0.72  0e+00    11.23    0.73    0.73  0e+00
           32768          4096   float     sum    12.66    1.29    1.29  0e+00    12.62    1.30    1.30  0e+00
           65536          8192   float     sum    15.39    2.13    2.13  0e+00    15.31    2.14    2.14  0e+00
          131072         16384   float     sum    21.02    3.12    3.12  0e+00    21.35    3.07    3.07  0e+00
          262144         32768   float     sum    32.36    4.05    4.05  0e+00    31.98    4.10    4.10  0e+00
          524288         65536   float     sum    39.63    6.61    6.61  0e+00    39.76    6.59    6.59  0e+00
         1048576        131072   float     sum    57.11    9.18    9.18  0e+00    56.88    9.22    9.22  0e+00
         2097152        262144   float     sum    92.96   11.28   11.28  0e+00    92.54   11.33   11.33  0e+00
         4194304        524288   float     sum    166.4   12.60   12.60  0e+00    165.9   12.64   12.64  0e+00
         8388608       1048576   float     sum    308.5   13.59   13.59  0e+00    504.4    8.32    8.32  0e+00
        16777216       2097152   float     sum   1050.1    7.99    7.99  0e+00    693.5   12.10   12.10  0e+00
        33554432       4194304   float     sum   1533.4   10.94   10.94  0e+00   1414.8   11.86   11.86  0e+00
        67108864       8388608   float     sum   2529.2   13.27   13.27  0e+00   2314.2   14.50   14.50  0e+00
       134217728      16777216   float     sum   5619.2   11.94   11.94  0e+00   4905.4   13.68   13.68  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 4.44552 
    

    originally I found this issue with training with Tensorflow, I first submit bug to TENSORFLOW , here is the link:https://github.com/tensorflow/tensorflow/issues/40027

    it shows when I remove NVLINK BRIDGE, the TF code runs well , and when I using NVLINK BRIDGE, but not using NCCL, the TF code runs well too. but when I using NCCL and NVLINK BRIDGE, the system halt, make me have to reboot.

    opened by AlexWang1900 37
  • NCCL infiniband performance

    NCCL infiniband performance

    Hi NCCL devs! I have two machines in a cluster communicating over infiniband. There is 400 Gb/sec of bandwidth available between the machines (confirmed with ib_send_bw), but:

    1. nccl-tests only achieves about 20 GB/s, roughly half of what I would expect
    2. there is a decent amount of variance

    running broadcast_perf on 2 machines:

    NCCL_DEBUG=INFO mpiexec -f <hosts file> /root/code/nccl-tests/build/broadcast_perf -b 1M -e 2048M -f 2 -g 1 -c 0 -d half
    

    nccl.txt

    This log shows that (1) nccl is getting between about 15 and 20 GB/s in busbw, and (2) the speed isn't monotonic for larger amounts of data and can change significantly across runs.

    Any ideas on what could be going wrong here? I would expect that I should be getting something closer to 45 GB/s and that there would be more consistency across runs.

    env vars:

    NCCL_IB_HCA=^mlx5_2
    NCCL_SOCKET_IFNAME=eth
    

    ibstatus

    Infiniband device 'mlx5_0' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d4:fe72
            base lid:        0x2ed
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_1' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d4:fe74
            base lid:        0x5b3
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_2' port 1 status:
            default gid:     0000:0000:0000:0000:0000:0000:0000:0000
            base lid:        0x0
            sm lid:          0x0
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      Ethernet
    
    Infiniband device 'mlx5_3' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d5:04c6
            base lid:        0x2f3
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_4' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d5:04c8
            base lid:        0x679
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    opened by christopherhesse 34
  • AllReduce hangs

    AllReduce hangs

    My problem was diagnosed in https://github.com/tensorflow/tensorflow/issues/32654 - please find all the info about my environment there.

    Using the master version of nccl. I launch all_reduce_perf and it hangs with 100% volatile GPU usage reported.

    ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4
    # nThread 1 nGpus 4 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid  15833 on jupyter-vmarkovtsev device  0 [0x02] GeForce GTX 1080 Ti
    #   Rank  1 Pid  15833 on jupyter-vmarkovtsev device  1 [0x03] GeForce GTX 1080 Ti
    #   Rank  2 Pid  15833 on jupyter-vmarkovtsev device  2 [0x82] GeForce GTX 1080 Ti
    #   Rank  3 Pid  15833 on jupyter-vmarkovtsev device  3 [0x83] GeForce GTX 1080 Ti
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Bootstrap : Using [0]eth0:10.2.3.32<0>
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    
    jupyter-vmarkovtsev:15833:15833 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Socket : Using [0]eth0:10.2.3.32<0>
    NCCL version 2.4.8+cuda10.0
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO nranks 4
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
    jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Channel 00 :    0   1   2   3
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
    jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
    jupyter-vmarkovtsev:15833:15833 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Launch mode Group/CGMD
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    
    jupyter-vmarkovtsev:15833:15833 [0] init.cc:1250 NCCL WARN Mismatched collective detected, please check your collectivecalls at and around rank 3. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs
    

    I waited for 10 minutes, there are no more logs printed.

    opened by vmarkovtsev 31
  • NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

    NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

    We got stuck using Clara SDK docker image on Kubeflow with multi gpu train. /commands/train_2gpu.sh. It just hangs. Not sure if its plug-in not found issue or our hardware config issue. We are using DGX1 with kubernetes / kubeflow. Please help

    Requested train epochs: 10; iterations: 158

    2020-06-29 22:20:20.310128: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

    Requested train epochs: 10; iterations: 158

    2020-06-29 22:20:24.223690: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

    2020-06-29 22:20:24.816974: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

    ds-ml-01-0:17085:17354 [0] NCCL INFO Bootstrap : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/IB : No device found.

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/Socket : Using [0]eth0:10.233.66.147<0>

    NCCL version 2.4.8+cuda10.1

    ds-ml-01-0:17086:17353 [1] NCCL INFO Bootstrap : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/IB : No device found.

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/Socket : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17085:17354 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff

    ds-ml-01-0:17086:17353 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff

    opened by lalithvaka 28
  • Suboptimal performance with TCP over high bandwidth networks

    Suboptimal performance with TCP over high bandwidth networks

    Hi! Many thanks for creating a great framework. NCCL is widely used at our org for scaling the training of ML models and has proved very reliable.

    I am currently trying to figure out how to achieve optimal inter-node performance with NCCL running over TCP on high bandwidth networks (32Gpbs, 100Gpbs, and higher). Even with large message sizes we have not been able to reliably obtain more than 60% of wire speed over 32Gpbs networks (see below for nccl-tests output). From what I've gathered NCCL just hasn't been fully optimized for this configuration yet (although I'm still holding out some hope that I'm just doing it wrong 😄).

    I'm prepared to work fulltime for several weeks on lifting any limitations in the current implementation but I could use a few pointers for getting started. Do you have a sense for what the most promising changes might be and how to incorporate them into the codebase? One thing I might want to explore is using multiple threads/TCP streams. But there is still scope to better utilize a single TCP stream as well so maybe there are some simpler optimizations to try first?

    I've been looking into the codebase and there's a number of things that I don't really understand yet:

    • Running nccl-tests all_reduce_perf -w 0 -n 1 seems to spawn a total of 4 allreduce ops according to my TRACE output. I would have expected just 2 (on for in-place one for out-of-place).
    • I'm not super clear on the control flow/threading model. In my tests NCCL is using exactly two cores, some of main files of interest seem to be net_socket.cc, net.cc, socket.h, enqueue.cc and a lot of cycles are spent polling ncclSocketIrecv/ncclSocketIsend but I'm still struggling with how everything fits together and exactly where/how the actual network transfers happen.

    Some more details on my setup. My current config consists of two GCE machines with 8xV100, plenty of cores/RAM and 32Gbps network (no RDMA). I get about 28Gbps bidirectional bandwidth by running one iperf3 server and client on each node (and >30Gpbs with -Z -P4 flags). Anecdotally, more complex setups that include Horovod have occasionally been able to hit 60% of wire speed on 32Gbps and 50Gbps networks. In this case, running nccl-tests only yields 16Gpbs:

    [email protected]:/# mpirun --allow-run-as-root -H 10.73.0.52:1,10.73.0.15:1 -np 2 -mca btl_tcp_if_include ens12 -x NCCL_IB_DISABLE=1 -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=ens12 -x NCCL_DEBUG=INFO /nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 1 -c 0
    # nThread 1 nGpus 1 minBytes 1073741824 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
    #
    # Using devices
    #   Rank  0 Pid     51 on managed-worker-l83z device  0 [0x00] Tesla V100-SXM2-16GB
    #   Rank  1 Pid     73 on managed-worker-jbk7 device  0 [0x00] Tesla V100-SXM2-16GB
    managed-worker-l83z:51:51 [0] NCCL INFO NET/Socket : Using [0]ens12:10.73.0.52<0>
    managed-worker-l83z:51:51 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    managed-worker-l83z:51:51 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
    NCCL version 2.4.2+cuda10.0
    managed-worker-jbk7:73:73 [0] NCCL INFO NET/Socket : Using [0]ens12:10.73.0.15<0>
    managed-worker-jbk7:73:73 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    managed-worker-jbk7:73:73 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
    managed-worker-l83z:51:57 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
    managed-worker-l83z:51:57 [0] NCCL INFO comm 0x7fd518002560 rank 0 nranks 2 cudaDev 0 nvmlDev 0
    managed-worker-jbk7:73:78 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
    managed-worker-jbk7:73:78 [0] NCCL INFO comm 0x7f9be0002560 rank 1 nranks 2 cudaDev 0 nvmlDev 0
    managed-worker-l83z:51:57 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
    managed-worker-jbk7:73:78 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
    managed-worker-l83z:51:57 [0] NCCL INFO Channel 00 :    0   1
    managed-worker-l83z:51:57 [0] NCCL INFO Channel 01 :    0   1
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 01 : 0 -> 1 [receive] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
    managed-worker-l83z:51:57 [0] NCCL INFO comm 0x7fd518002560 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
    managed-worker-jbk7:73:78 [0] NCCL INFO comm 0x7f9be0002560 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    managed-worker-l83z:51:51 [0] NCCL INFO Launch mode Parallel
      1073741824     268435456   float     sum   539383    1.99    1.99    N/A   553087    1.94    1.94    N/A
    managed-worker-l83z:51:51 [0] NCCL INFO Destroyed comm 0x7fd518002560 rank 0
    managed-worker-jbk7:73:73 [0] NCCL INFO Destroyed comm 0x7f9be0002560 rank 1
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 1.96602
    #
    
    opened by cswinter 27
  • NCCL didn't print the right log about connection when enable the GDR

    NCCL didn't print the right log about connection when enable the GDR

    Environment

    • NCCL version 2.5.7+cuda10.0
    • 8 * V100-PCIe per node, a total of 2 nodes

    test command:

    mpirun -np 16 --hostfile ../../hostfile.txt -bind-to none -map-by slot --display-map --mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 --mca btl openib,self,vader -x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/tmp/debug.log.%h.%p -x NCCL_IB_HCA=mlx5_0:1 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_READ=0 ./all_reduce_perf -b 8 -e 128M -f 2
    

    Question: When I switched the ENV NCCL_NET_GDR_READ from 0 to 1, the nccl tests showed that the latency is much slower, when the NCCL_NET_GDR_READ was 0, the nccl-tests outpus was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    38.87    0.00    0.00  2e-07    36.96    0.00    0.00  2e-07
              16             4   float     sum    36.45    0.00    0.00  2e-07    36.66    0.00    0.00  1e-07
              32             8   float     sum    36.74    0.00    0.00  1e-07    36.71    0.00    0.00  1e-07
              64            16   float     sum    37.62    0.00    0.00  1e-07    37.03    0.00    0.00  1e-07
             128            32   float     sum    38.05    0.00    0.01  1e-07    38.00    0.00    0.01  1e-07
             256            64   float     sum    38.31    0.01    0.01  6e-08    38.73    0.01    0.01  6e-08
             512           128   float     sum    39.79    0.01    0.02  6e-08    39.00    0.01    0.02  6e-08
            1024           256   float     sum    40.40    0.03    0.05  2e-07    39.96    0.03    0.05  2e-07
            2048           512   float     sum    42.57    0.05    0.09  2e-07    42.42    0.05    0.09  2e-07
            4096          1024   float     sum    73.62    0.06    0.10  5e-07    72.72    0.06    0.11  5e-07
            8192          2048   float     sum    81.68    0.10    0.19  5e-07    80.06    0.10    0.19  5e-07
           16384          4096   float     sum    84.74    0.19    0.36  5e-07    83.30    0.20    0.37  5e-07
           32768          8192   float     sum    90.39    0.36    0.68  5e-07    90.26    0.36    0.68  5e-07
           65536         16384   float     sum    104.2    0.63    1.18  5e-07    102.9    0.64    1.19  5e-07
          131072         32768   float     sum    120.0    1.09    2.05  5e-07    118.6    1.11    2.07  5e-07
          262144         65536   float     sum    218.7    1.20    2.25  5e-07    221.3    1.18    2.22  5e-07
          524288        131072   float     sum    356.1    1.47    2.76  5e-07    355.5    1.47    2.77  5e-07
         1048576        262144   float     sum    479.5    2.19    4.10  5e-07    483.1    2.17    4.07  5e-07
         2097152        524288   float     sum    765.7    2.74    5.14  5e-07    764.2    2.74    5.15  5e-07
         4194304       1048576   float     sum   1428.6    2.94    5.50  5e-07   1425.0    2.94    5.52  5e-07
         8388608       2097152   float     sum   2776.9    3.02    5.66  5e-07   2764.4    3.03    5.69  5e-07
        16777216       4194304   float     sum   5475.1    3.06    5.75  5e-07   5490.5    3.06    5.73  5e-07
        33554432       8388608   float     sum    10886    3.08    5.78  5e-07    10876    3.09    5.78  5e-07
        67108864      16777216   float     sum    37080    1.81    3.39  5e-07    75304    0.89    1.67  5e-07
       134217728      33554432   float     sum    72090    1.86    3.49  5e-07    57255    2.34    4.40  5e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 1.92724
    

    but when the NCCL_NET_GDR_READ was 1, the nccl-tests outpus was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    43.22    0.00    0.00  2e-07    37.00    0.00    0.00  2e-07
              16             4   float     sum    37.34    0.00    0.00  2e-07    37.79    0.00    0.00  1e-07
              32             8   float     sum    37.33    0.00    0.00  1e-07    37.20    0.00    0.00  1e-07
              64            16   float     sum    37.89    0.00    0.00  1e-07    37.73    0.00    0.00  1e-07
             128            32   float     sum    38.61    0.00    0.01  1e-07    38.53    0.00    0.01  1e-07
             256            64   float     sum    43.42    0.01    0.01  6e-08    39.17    0.01    0.01  6e-08
             512           128   float     sum    40.46    0.01    0.02  6e-08    40.32    0.01    0.02  6e-08
            1024           256   float     sum    40.59    0.03    0.05  2e-07    40.28    0.03    0.05  2e-07
            2048           512   float     sum    43.55    0.05    0.09  2e-07    43.05    0.05    0.09  2e-07
            4096          1024   float     sum    73.49    0.06    0.10  5e-07    70.96    0.06    0.11  5e-07
            8192          2048   float     sum    79.89    0.10    0.19  5e-07    79.86    0.10    0.19  5e-07
           16384          4096   float     sum    84.63    0.19    0.36  5e-07    83.82    0.20    0.37  5e-07
           32768          8192   float     sum    93.38    0.35    0.66  5e-07    91.32    0.36    0.67  5e-07
           65536         16384   float     sum    107.4    0.61    1.14  5e-07    104.1    0.63    1.18  5e-07
          131072         32768   float     sum    122.9    1.07    2.00  5e-07    121.7    1.08    2.02  5e-07
          262144         65536   float     sum    225.9    1.16    2.18  5e-07    226.2    1.16    2.17  5e-07
          524288        131072   float     sum    346.8    1.51    2.83  5e-07    345.5    1.52    2.85  5e-07
         1048576        262144   float     sum    428.7    2.45    4.59  5e-07    430.0    2.44    4.57  5e-07
         2097152        524288   float     sum    576.1    3.64    6.83  5e-07    580.9    3.61    6.77  5e-07
         4194304       1048576   float     sum    927.3    4.52    8.48  5e-07    926.1    4.53    8.49  5e-07
         8388608       2097152   float     sum   1678.7    5.00    9.37  5e-07   1683.0    4.98    9.35  5e-07
        16777216       4194304   float     sum   3393.2    4.94    9.27  5e-07   3382.5    4.96    9.30  5e-07
        33554432       8388608   float     sum   7094.9    4.73    8.87  5e-07   7055.8    4.76    8.92  5e-07
        67108864      16777216   float     sum    16353    4.10    7.69  5e-07    16348    4.10    7.70  5e-07
       134217728      33554432   float     sum    32639    4.11    7.71  5e-07    32753    4.10    7.68  5e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 2.89958
    

    If I stop the nv_peer_mem service manualy by run the command: service nv_peer_mem stop,

    Then run the tests with NCCL_NET_GDR_READ=0, the result was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    39.78    0.00    0.00  2e-07    38.16    0.00    0.00  2e-07
              16             4   float     sum    37.00    0.00    0.00  2e-07    37.33    0.00    0.00  1e-07
              32             8   float     sum    37.30    0.00    0.00  1e-07    37.08    0.00    0.00  1e-07
              64            16   float     sum    38.21    0.00    0.00  2e-07    38.90    0.00    0.00  2e-07
             128            32   float     sum    38.55    0.00    0.01  2e-07    38.87    0.00    0.01  2e-07
             256            64   float     sum    39.50    0.01    0.01  2e-07    39.42    0.01    0.01  2e-07
             512           128   float     sum    40.47    0.01    0.02  2e-07    39.91    0.01    0.02  2e-07
            1024           256   float     sum    41.05    0.02    0.05  2e-07    41.08    0.02    0.05  2e-07
            2048           512   float     sum    44.04    0.05    0.09  2e-07    43.84    0.05    0.09  2e-07
            4096          1024   float     sum    48.00    0.09    0.16  2e-07    47.30    0.09    0.16  2e-07
            8192          2048   float     sum    52.58    0.16    0.29  2e-07    51.76    0.16    0.30  2e-07
           16384          4096   float     sum    65.36    0.25    0.47  2e-07    64.10    0.26    0.48  2e-07
           32768          8192   float     sum    90.61    0.36    0.68  2e-07    87.10    0.38    0.71  2e-07
           65536         16384   float     sum    133.1    0.49    0.92  2e-07    258.5    0.25    0.48  2e-07
          131072         32768   float     sum    283.5    0.46    0.87  5e-07    277.1    0.47    0.89  5e-07
          262144         65536   float     sum    307.3    0.85    1.60  5e-07    300.6    0.87    1.63  5e-07
          524288        131072   float     sum    350.6    1.50    2.80  5e-07    353.6    1.48    2.78  5e-07
         1048576        262144   float     sum    475.0    2.21    4.14  5e-07    474.2    2.21    4.15  5e-07
         2097152        524288   float     sum    766.7    2.74    5.13  5e-07    762.5    2.75    5.16  5e-07
         4194304       1048576   float     sum   1453.1    2.89    5.41  5e-07   1451.9    2.89    5.42  5e-07
         8388608       2097152   float     sum   2980.8    2.81    5.28  5e-07   2984.1    2.81    5.27  5e-07
        16777216       4194304   float     sum    71226    0.24    0.44  5e-07   5877.2    2.85    5.35  5e-07
        33554432       8388608   float     sum    12570    2.67    5.01  2e-07    12543    2.68    5.02  2e-07
        67108864      16777216   float     sum    97148    0.69    1.30  2e-07    25695    2.61    4.90  2e-07
       134217728      33554432   float     sum    97671    1.37    2.58  2e-07    69526    1.93    3.62  2e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 1.67461
    

    So, this description that GDR did take effect.

    but the NCCL debug log always is [0] NCCL INFO Ring 00 : 15[41000] -> 0[1b000] [receive] via NET/IB/0

    opened by weberxie 25
  • peer mapping resources exhausted for < 8 GPUs

    peer mapping resources exhausted for < 8 GPUs

    I am running a NCCL reduction across multiple GPUs on an Amazon P2 16x instance in a multi-process context (one MPI rank per GPU). When I added small arrays together across 16 workers I got the error "peer mapping resources exhausted". Looking online I determined that perhaps I was limited to 8 GPUs in a group and NCCL wasn't dealing with this limitation internally.

    However, when I reduced between two groups of 8 GPUs using NCCL (by splitting MPI_COMM_WORLD into two separate communicators) and then did a standard MPI reduction in host memory to reduce the remaining two arrays, I got the same error. Same for 7 GPUs. I had to reduce the group size to 4 to get the correct behaviour.

    It seems this is unrelated to the peer ensemble limitation but instead is related to other resources needed for multi-process reductions on a single node.

    Joss Knight

    opened by extabgrad 23
  • GPU occupation during model training

    GPU occupation during model training

    Hi,

    Do you have any profiling result about GPU occupation during training ?

    Because I found that NCCL communication overhead arrived 75%, does it normal ?

    check_gpu_utilization

    Thanks

    opened by elevenxiang 23
  • Point-to-point operations preview

    Point-to-point operations preview

    This is a PR for people to review and provide feedback on the p2p branch (issue #212).

    enhancement 
    opened by sjeaugey 22
  • NCCL segfaults on single node with 10 GPUs

    NCCL segfaults on single node with 10 GPUs

    I was attempting to use distributed tensorflow when I noticed I could not add the 10th gpu on my node to a distributed strategy... After running nccl-tests, I noticed it appears to be an issue with NCCL.

    [email protected]:~/nccl-tests$  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
    # nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid 226099 on node05-ccncluster device  0 [0x1a] TITAN Xp
    #   Rank  1 Pid 226099 on node05-ccncluster device  1 [0x1b] TITAN Xp
    #   Rank  2 Pid 226099 on node05-ccncluster device  2 [0x1c] TITAN Xp
    #   Rank  3 Pid 226099 on node05-ccncluster device  3 [0x1d] TITAN Xp
    #   Rank  4 Pid 226099 on node05-ccncluster device  4 [0x1e] TITAN Xp
    #   Rank  5 Pid 226099 on node05-ccncluster device  5 [0x3d] TITAN Xp
    #   Rank  6 Pid 226099 on node05-ccncluster device  6 [0x3e] TITAN Xp
    #   Rank  7 Pid 226099 on node05-ccncluster device  7 [0x3f] TITAN Xp
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    42.86    0.00    0.00  1e-07    42.51    0.00    0.00  1e-07
              16             4   float     sum    42.46    0.00    0.00  1e-07    43.06    0.00    0.00  1e-07
              32             8   float     sum    42.90    0.00    0.00  6e-08    42.75    0.00    0.00  6e-08
              64            16   float     sum    42.81    0.00    0.00  6e-08    43.06    0.00    0.00  6e-08
             128            32   float     sum    42.81    0.00    0.01  6e-08    42.92    0.00    0.01  6e-08
             256            64   float     sum    43.05    0.01    0.01  3e-08    43.34    0.01    0.01  3e-08
             512           128   float     sum    42.79    0.01    0.02  3e-08    42.65    0.01    0.02  3e-08
            1024           256   float     sum    42.91    0.02    0.04  1e-07    43.00    0.02    0.04  1e-07
            2048           512   float     sum    43.35    0.05    0.08  2e-07    43.25    0.05    0.08  2e-07
            4096          1024   float     sum    43.46    0.09    0.16  2e-07    43.40    0.09    0.17  2e-07
            8192          2048   float     sum    44.38    0.18    0.32  2e-07    43.88    0.19    0.33  2e-07
           16384          4096   float     sum    49.15    0.33    0.58  2e-07    48.86    0.34    0.59  2e-07
           32768          8192   float     sum    72.44    0.45    0.79  2e-07    71.88    0.46    0.80  2e-07
           65536         16384   float     sum    120.5    0.54    0.95  2e-07    121.7    0.54    0.94  2e-07
          131072         32768   float     sum    129.5    1.01    1.77  2e-07    129.5    1.01    1.77  2e-07
          262144         65536   float     sum    157.1    1.67    2.92  2e-07    157.0    1.67    2.92  2e-07
          524288        131072   float     sum    205.4    2.55    4.47  2e-07    205.3    2.55    4.47  2e-07
         1048576        262144   float     sum    305.1    3.44    6.01  2e-07    305.0    3.44    6.02  2e-07
         2097152        524288   float     sum    647.4    3.24    5.67  2e-07    495.1    4.24    7.41  2e-07
         4194304       1048576   float     sum    900.7    4.66    8.15  2e-07    898.9    4.67    8.17  2e-07
         8388608       2097152   float     sum   1735.0    4.83    8.46  2e-07   1718.9    4.88    8.54  2e-07
        16777216       4194304   float     sum   3425.8    4.90    8.57  2e-07   3406.6    4.92    8.62  2e-07
        33554432       8388608   float     sum   6793.3    4.94    8.64  2e-07   6792.5    4.94    8.64  2e-07
        67108864      16777216   float     sum    13579    4.94    8.65  2e-07    13574    4.94    8.65  2e-07
       134217728      33554432   float     sum    27135    4.95    8.66  2e-07    27134    4.95    8.66  2e-07
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 3.0361
    #
    [email protected]:~/nccl-tests$  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
    # nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid 226138 on node05-ccncluster device  0 [0x1a] TITAN Xp
    #   Rank  1 Pid 226138 on node05-ccncluster device  1 [0x1b] TITAN Xp
    #   Rank  2 Pid 226138 on node05-ccncluster device  2 [0x1c] TITAN Xp
    #   Rank  3 Pid 226138 on node05-ccncluster device  3 [0x1d] TITAN Xp
    #   Rank  4 Pid 226138 on node05-ccncluster device  4 [0x1e] TITAN Xp
    #   Rank  5 Pid 226138 on node05-ccncluster device  5 [0x3d] TITAN Xp
    #   Rank  6 Pid 226138 on node05-ccncluster device  6 [0x3e] TITAN Xp
    #   Rank  7 Pid 226138 on node05-ccncluster device  7 [0x3f] TITAN Xp
    #   Rank  8 Pid 226138 on node05-ccncluster device  8 [0x40] TITAN Xp
    #   Rank  9 Pid 226138 on node05-ccncluster device  9 [0x41] TITAN Xp
    Segmentation fault (core dumped)
    [email protected]:~/nccl-tests$
    
    opened by mjlbach 22
  • Improve warning message about truncated messages

    Improve warning message about truncated messages

    The "truncated message" error is most often triggered by a mismatch in collective size or env settings between ranks.

    For better interpretability, the patch displays hints of cause so that it would be easier for user to debug.

    It also changes the error type from InternalError to InvalidUsage to reflect the above most-likely causes.

    opened by kwen2501 0
  • Memory Leak in 2.10.4 release

    Memory Leak in 2.10.4 release

    Just want to seek some clarification for the memory leak issue that is fixed in the 2.11 release (in the release note). Can you give us some details about the leak? We're noticing two potential memory leak in NCCL 2.10 release, both seems to be related with communicator initialize + abort

    1. The leak seems to be related with NVB for HCM topology. NCCL_NVB_PRECONNECT=0 seems to fix the problem.
    2. The health check in pytorch https://github.com/pytorch/pytorch/pull/67668 that we have to disable. It creates a communicator and then abort to verify the healthiness of the host. But it's causing memory regression and we have to turn it off. This doesn't seem to be related with HCM (we observe this on NVSwitch hosts)
    opened by xw285cornell 6
  • Question: when SRIOV is enabled on DGX-like GPU servers, Does GPUDirect work normally?

    Question: when SRIOV is enabled on DGX-like GPU servers, Does GPUDirect work normally?

    Question: when SRIOV is enabled on serveral DGX-like GPU servers, Does GPUDirect, including GPUDirect P2P and GPUDirect RDMA, work normally?

    Background: In a kubernetes cluster, every GPU server has a 1Gb ethernet NIC and a 100Gb Mellanox CX5 NIC. All DGX-like GPU servers are interconnected via Ethernet and RoCE network. The RoCE network is used as the communication network between workers in a distributed training job. When P2P is enabled, NCCL_P2P_DISABLE=0, training jobs sometimes hangs, but when P2P is disabled, NCCL_P2P_DISABLE=1, training jobs does work normally. I don’t know why? Does GPUDirect work normally when SRIOV is enabled?

    Thanks a lot for your time.

    opened by whisper-wind17 1
  • Add env NCCL_NET_DISABLE_INTRA

    Add env NCCL_NET_DISABLE_INTRA

    Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

    opened by kwen2501 4
  • Do NCCL support multi-NIC on ethernet?

    Do NCCL support multi-NIC on ethernet?

    I was testing NCCL between two nodes, with each node have 4 GPU and 4 NIC on ethernet.

    But I found that NCCL only make use of one ethernet NIC even when I have set NCCL_SOCKET_IFNAME=nic0,nic1,nic2,nic3.

    And the NCCL debug info shows that it has detect all these 4 NICs (NCCL INFO Bootstrap : Using xxx).

    I see the https://github.com/NVIDIA/nccl/issues/452 and knows that NCCL do support multi-NIC on RDMA automatically, so I wonder if NCCL support multi-NIC on ethernet?

    opened by Dounm 9
  • Add logging and basic verification to remote allocator.

    Add logging and basic verification to remote allocator.

    Provides basic protection against spurious connections to the remote allocator service, as seen in #555.

    opened by chr1sj0nes 0
  • QUESTION❓: What is purpose of NCCL_SPINS_BEFORE_CHECK_ABORT

    QUESTION❓: What is purpose of NCCL_SPINS_BEFORE_CHECK_ABORT

    We noticed the constant #define SPINS_BEFORE_CHECK_ABORT 1000000. We found that this constant is used in checkAbort function. It seems that it will spin for 1000000 times before getting the comm->abortFlag, which I think may influence the time to call ncclCommAbort().

    When we change it to smaller number (e.g. 1000, 100), we find that it works well and the time for ncclCommAbort() is smaller.

    ❓ So I wonder what is the purpose of using NCCL_SPINS_BEFORE_CHECK_ABORT and why you set this constant to 1000000 (which is so large).

    opened by PeterSH6 4
  • why Mellnox Net card band width will be affected by

    why Mellnox Net card band width will be affected by "-x HCOLL_MAIN_IB=bond0" param when we used mpirun testing benchmark

    There was a strange issue When we using A100 do benchmark testing. The command as follow: mpirun -np 16 -H rdma1:8,rdma2:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=bond0 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_LEVEL=0 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl_tcp_if_include bond0 -mca btl ^openib python3 /mnt/horovod/examples/tensorflow/tensorflow_synthetic_benchmark.py --model=ResNet50 --batch-size=256

    befor we did it. Mellox card bw is 92Gbps, however after did it. It will be degrade to 80Gbps. Fortunately if we one more param "-x HCOLL_MAIN_IB=bond0" this issue will be disappeared. why????

    our env using docker by NVdia. and CX-5 ib roce cards work as AS mode. MPIRUN ver is , A100 host. [email protected]:/ttensorflow# mpirun --version mpirun (Open MPI) 4.0.4rc3

    opened by guoyaowen30 1
  • Data corruption with NCCL Allgather when one NVLink is down

    Data corruption with NCCL Allgather when one NVLink is down

    We're observing NCCL all gather produces corrupted data when one NVLink is down on a GPU, and we're doing all gather on multiple nodes.

    Hardware setup: DGX-1 like system, V100 SXM3 + NVSwitch + RDMA two nodes

    Software: NCCL 2.10.3

    Repro: nccl-tests all gather

    opened by xw285cornell 2
  • Support for multiple communicators

    Support for multiple communicators

    Hi,

    I've read the documentation regarding using multiple communicators (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#using-multiple-nccl-communicators-concurrently) and I'm still a little unclear about what combinations of calls are safe. My understanding is as follows...

    Same comm, same stream:

    ncclAllReduce(..., comm1, stream1);
    ncclAllReduce(..., comm1, stream1);
    

    Totally safe.

    Different comms, same stream:

    ncclAllReduce(..., comm1, stream1);
    ncclAllReduce(..., comm2, stream1);
    

    The documentation implies this is unsafe, though I don't understand why. What does it mean by "Operations on different communicators should therefore be used at different epochs"? Is it necessary to perform a cudaStreamSynchronize() between the two calls?

    Same comm, different streams:

    ncclAllReduce(..., comm1, stream1);
    ncclAllReduce(..., comm1, stream2);
    

    Safe, but the operations will be serialized.

    Different comms, different streams:

    ncclAllReduce(..., comm1, stream1);
    ncclAllReduce(..., comm2, stream2);
    

    Can lead to deadlock, if the operations use more blocks than are present on each device.

    I would appreciate it if you could confirm / correct the above, and consider making the documentation clearer. Thanks!

    opened by chr1sj0nes 2
Releases(v1.3.4-1)
Owner
NVIDIA Corporation
NVIDIA Corporation
Collective Multi-type Entity Alignment Between Knowledge Graphs (WWW'20)

CG-MuAlign A reference implementation for "Collective Multi-type Entity Alignment Between Knowledge Graphs", published in WWW 2020. If you find our pa

Bran Zhu 23 Aug 31, 2021
Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation. Intel iHD GPU (iGPU) support. NVIDIA GPU (dGPU) support.

mtomo Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation.

Katsuya Hyodo 23 Jul 17, 2021
High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Anakin2.0 Welcome to the Anakin GitHub. Anakin is a cross-platform, high-performance inference engine, which is originally developed by Baidu engineer

null 493 Nov 30, 2021
An example showing how to use jax to train resnet50 on multi-node multi-GPU

jax-multi-gpu-resnet50-example This repo shows how to use jax for multi-node multi-GPU training. The example is adapted from the resnet50 example in d

Yangzihao Wang 11 Nov 26, 2021
Official source code to CVPR'20 paper, "When2com: Multi-Agent Perception via Communication Graph Grouping"

When2com: Multi-Agent Perception via Communication Graph Grouping This is the PyTorch implementation of our paper: When2com: Multi-Agent Perception vi

null 23 Nov 15, 2021
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 78 Nov 8, 2021
WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

WarpDrive is a flexible, lightweight, and easy-to-use open-source reinforcement learning (RL) framework that implements end-to-end multi-agent RL on a single GPU (Graphics Processing Unit).

Salesforce 227 Nov 28, 2021
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

High-Performance Brain-to-Text Communication via Handwriting Overview This repo is associated with this manuscript, preprint and dataset. The code can

Francis R. Willett 256 Nov 24, 2021
Learning cell communication from spatial graphs of cells

ncem Features Repository for the manuscript Fischer, D. S., Schaar, A. C. and Theis, F. Learning cell communication from spatial graphs of cells. 2021

Theis Lab 31 Nov 18, 2021
Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

NDQ: Learning Nearly Decomposable Value Functions with Communication Minimization Note This codebase accompanies paper Learning Nearly Decomposable Va

Tonghan Wang 51 Nov 18, 2021
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 2 Oct 11, 2021
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 64 Nov 29, 2021
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

Spotify 9.2k Dec 4, 2021
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Documentation | FAQ | Release Notes | Roadmap | MACE Model Zoo | Demo | Join Us | 中文 Mobile AI Compute Engine (or MACE for short) is a deep learning i

Xiaomi 4.5k Nov 26, 2021
Code for the paper: Adversarial Training Against Location-Optimized Adversarial Patches. ECCV-W 2020.

Adversarial Training Against Location-Optimized Adversarial Patches arXiv | Paper | Code | Video | Slides Code for the paper: Sukrut Rao, David Stutz,

Sukrut Rao 18 Oct 26, 2021
Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official implementation of GOCor This is the official implementation of our paper : GOCor: Bringing Globally Optimized Correspondence Volumes into You

Prune Truong 64 Oct 23, 2021
Use AI to generate a optimized stock portfolio

Use AI, Modern Portfolio Theory, and Monte Carlo simulation's to generate a optimized stock portfolio that minimizes risk while maximizing returns. Ho

Greg James 26 Sep 29, 2021
In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

In-Place Activated BatchNorm In-Place Activated BatchNorm for Memory-Optimized Training of DNNs In-Place Activated BatchNorm (InPlace-ABN) is a novel

null 1.2k Nov 26, 2021
Optimized code based on M2 for faster image captioning training

Transformer Captioning This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimi

lyricpoem 11 Oct 11, 2021