Environment
- NCCL version 2.5.7+cuda10.0
- 8 * V100-PCIe per node, a total of 2 nodes
test command:
mpirun -np 16 --hostfile ../../hostfile.txt -bind-to none -map-by slot --display-map --mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 --mca btl openib,self,vader -x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/tmp/debug.log.%h.%p -x NCCL_IB_HCA=mlx5_0:1 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_READ=0 ./all_reduce_perf -b 8 -e 128M -f 2
Question:
When I switched the ENV NCCL_NET_GDR_READ from 0 to 1, the nccl tests showed that the latency is much slower,
when the NCCL_NET_GDR_READ was 0, the nccl-tests outpus was:
out-of-place in-place
size count type redop time algbw busbw error time algbw busbw error
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 38.87 0.00 0.00 2e-07 36.96 0.00 0.00 2e-07
16 4 float sum 36.45 0.00 0.00 2e-07 36.66 0.00 0.00 1e-07
32 8 float sum 36.74 0.00 0.00 1e-07 36.71 0.00 0.00 1e-07
64 16 float sum 37.62 0.00 0.00 1e-07 37.03 0.00 0.00 1e-07
128 32 float sum 38.05 0.00 0.01 1e-07 38.00 0.00 0.01 1e-07
256 64 float sum 38.31 0.01 0.01 6e-08 38.73 0.01 0.01 6e-08
512 128 float sum 39.79 0.01 0.02 6e-08 39.00 0.01 0.02 6e-08
1024 256 float sum 40.40 0.03 0.05 2e-07 39.96 0.03 0.05 2e-07
2048 512 float sum 42.57 0.05 0.09 2e-07 42.42 0.05 0.09 2e-07
4096 1024 float sum 73.62 0.06 0.10 5e-07 72.72 0.06 0.11 5e-07
8192 2048 float sum 81.68 0.10 0.19 5e-07 80.06 0.10 0.19 5e-07
16384 4096 float sum 84.74 0.19 0.36 5e-07 83.30 0.20 0.37 5e-07
32768 8192 float sum 90.39 0.36 0.68 5e-07 90.26 0.36 0.68 5e-07
65536 16384 float sum 104.2 0.63 1.18 5e-07 102.9 0.64 1.19 5e-07
131072 32768 float sum 120.0 1.09 2.05 5e-07 118.6 1.11 2.07 5e-07
262144 65536 float sum 218.7 1.20 2.25 5e-07 221.3 1.18 2.22 5e-07
524288 131072 float sum 356.1 1.47 2.76 5e-07 355.5 1.47 2.77 5e-07
1048576 262144 float sum 479.5 2.19 4.10 5e-07 483.1 2.17 4.07 5e-07
2097152 524288 float sum 765.7 2.74 5.14 5e-07 764.2 2.74 5.15 5e-07
4194304 1048576 float sum 1428.6 2.94 5.50 5e-07 1425.0 2.94 5.52 5e-07
8388608 2097152 float sum 2776.9 3.02 5.66 5e-07 2764.4 3.03 5.69 5e-07
16777216 4194304 float sum 5475.1 3.06 5.75 5e-07 5490.5 3.06 5.73 5e-07
33554432 8388608 float sum 10886 3.08 5.78 5e-07 10876 3.09 5.78 5e-07
67108864 16777216 float sum 37080 1.81 3.39 5e-07 75304 0.89 1.67 5e-07
134217728 33554432 float sum 72090 1.86 3.49 5e-07 57255 2.34 4.40 5e-07
Out of bounds values : 0 OK
Avg bus bandwidth : 1.92724
but when the NCCL_NET_GDR_READ was 1, the nccl-tests outpus was:
out-of-place in-place
size count type redop time algbw busbw error time algbw busbw error
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 43.22 0.00 0.00 2e-07 37.00 0.00 0.00 2e-07
16 4 float sum 37.34 0.00 0.00 2e-07 37.79 0.00 0.00 1e-07
32 8 float sum 37.33 0.00 0.00 1e-07 37.20 0.00 0.00 1e-07
64 16 float sum 37.89 0.00 0.00 1e-07 37.73 0.00 0.00 1e-07
128 32 float sum 38.61 0.00 0.01 1e-07 38.53 0.00 0.01 1e-07
256 64 float sum 43.42 0.01 0.01 6e-08 39.17 0.01 0.01 6e-08
512 128 float sum 40.46 0.01 0.02 6e-08 40.32 0.01 0.02 6e-08
1024 256 float sum 40.59 0.03 0.05 2e-07 40.28 0.03 0.05 2e-07
2048 512 float sum 43.55 0.05 0.09 2e-07 43.05 0.05 0.09 2e-07
4096 1024 float sum 73.49 0.06 0.10 5e-07 70.96 0.06 0.11 5e-07
8192 2048 float sum 79.89 0.10 0.19 5e-07 79.86 0.10 0.19 5e-07
16384 4096 float sum 84.63 0.19 0.36 5e-07 83.82 0.20 0.37 5e-07
32768 8192 float sum 93.38 0.35 0.66 5e-07 91.32 0.36 0.67 5e-07
65536 16384 float sum 107.4 0.61 1.14 5e-07 104.1 0.63 1.18 5e-07
131072 32768 float sum 122.9 1.07 2.00 5e-07 121.7 1.08 2.02 5e-07
262144 65536 float sum 225.9 1.16 2.18 5e-07 226.2 1.16 2.17 5e-07
524288 131072 float sum 346.8 1.51 2.83 5e-07 345.5 1.52 2.85 5e-07
1048576 262144 float sum 428.7 2.45 4.59 5e-07 430.0 2.44 4.57 5e-07
2097152 524288 float sum 576.1 3.64 6.83 5e-07 580.9 3.61 6.77 5e-07
4194304 1048576 float sum 927.3 4.52 8.48 5e-07 926.1 4.53 8.49 5e-07
8388608 2097152 float sum 1678.7 5.00 9.37 5e-07 1683.0 4.98 9.35 5e-07
16777216 4194304 float sum 3393.2 4.94 9.27 5e-07 3382.5 4.96 9.30 5e-07
33554432 8388608 float sum 7094.9 4.73 8.87 5e-07 7055.8 4.76 8.92 5e-07
67108864 16777216 float sum 16353 4.10 7.69 5e-07 16348 4.10 7.70 5e-07
134217728 33554432 float sum 32639 4.11 7.71 5e-07 32753 4.10 7.68 5e-07
Out of bounds values : 0 OK
Avg bus bandwidth : 2.89958
If I stop the nv_peer_mem service manualy by run the command:
service nv_peer_mem stop
,
Then run the tests with NCCL_NET_GDR_READ=0, the result was:
out-of-place in-place
size count type redop time algbw busbw error time algbw busbw error
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 39.78 0.00 0.00 2e-07 38.16 0.00 0.00 2e-07
16 4 float sum 37.00 0.00 0.00 2e-07 37.33 0.00 0.00 1e-07
32 8 float sum 37.30 0.00 0.00 1e-07 37.08 0.00 0.00 1e-07
64 16 float sum 38.21 0.00 0.00 2e-07 38.90 0.00 0.00 2e-07
128 32 float sum 38.55 0.00 0.01 2e-07 38.87 0.00 0.01 2e-07
256 64 float sum 39.50 0.01 0.01 2e-07 39.42 0.01 0.01 2e-07
512 128 float sum 40.47 0.01 0.02 2e-07 39.91 0.01 0.02 2e-07
1024 256 float sum 41.05 0.02 0.05 2e-07 41.08 0.02 0.05 2e-07
2048 512 float sum 44.04 0.05 0.09 2e-07 43.84 0.05 0.09 2e-07
4096 1024 float sum 48.00 0.09 0.16 2e-07 47.30 0.09 0.16 2e-07
8192 2048 float sum 52.58 0.16 0.29 2e-07 51.76 0.16 0.30 2e-07
16384 4096 float sum 65.36 0.25 0.47 2e-07 64.10 0.26 0.48 2e-07
32768 8192 float sum 90.61 0.36 0.68 2e-07 87.10 0.38 0.71 2e-07
65536 16384 float sum 133.1 0.49 0.92 2e-07 258.5 0.25 0.48 2e-07
131072 32768 float sum 283.5 0.46 0.87 5e-07 277.1 0.47 0.89 5e-07
262144 65536 float sum 307.3 0.85 1.60 5e-07 300.6 0.87 1.63 5e-07
524288 131072 float sum 350.6 1.50 2.80 5e-07 353.6 1.48 2.78 5e-07
1048576 262144 float sum 475.0 2.21 4.14 5e-07 474.2 2.21 4.15 5e-07
2097152 524288 float sum 766.7 2.74 5.13 5e-07 762.5 2.75 5.16 5e-07
4194304 1048576 float sum 1453.1 2.89 5.41 5e-07 1451.9 2.89 5.42 5e-07
8388608 2097152 float sum 2980.8 2.81 5.28 5e-07 2984.1 2.81 5.27 5e-07
16777216 4194304 float sum 71226 0.24 0.44 5e-07 5877.2 2.85 5.35 5e-07
33554432 8388608 float sum 12570 2.67 5.01 2e-07 12543 2.68 5.02 2e-07
67108864 16777216 float sum 97148 0.69 1.30 2e-07 25695 2.61 4.90 2e-07
134217728 33554432 float sum 97671 1.37 2.58 2e-07 69526 1.93 3.62 2e-07
Out of bounds values : 0 OK
Avg bus bandwidth : 1.67461
So, this description that GDR did take effect.
but the NCCL debug log always is [0] NCCL INFO Ring 00 : 15[41000] -> 0[1b000] [receive] via NET/IB/0