2 node * 2 gpus per node, 4 gpus total, gpuA40x4 partition, benchmarks flags: "-d cuda"

( perf. numbers are similar for s11 and openmpi/4.1.5+cuda under nvhpc )

slingshot10 , openmpi/4.1.2, gcc/11.2.0

slingshot11, openmpi/4.1.5+cuda, gcc/11.4.0slingshot11, openmpi/5.0.1+cuda, gcc/11.4.0 (use mpirun, srun not supported )
osu_reduce
# OSU MPI-CUDA Reduce Latency Test v5.9
# Size       Avg Latency(us)
4                     141.83
8                     135.70
16                    133.55
32                    134.55
64                    136.96
128                   142.41
256                   149.63
512                   147.23
1024                  144.80
2048                  153.08
4096                  159.19
8192                  159.99
16384                 166.59
32768                 179.74
65536                 188.19
131072                105.19
262144                218.92https://github.com/olcf-tutorials/MPI_ping_pong 524288                340.70
1048576               726.87
osu_reduce
# OSU MPI-CUDA Reduce Latency Test v5.9
# Size       Avg Latency(us)
4                      46.61
8                      48.19
16                     47.40
32                     48.73
64                     48.64
128                    50.86
256                    51.29
512                    57.45
1024                   76.34
2048                  116.33
4096                   94.03
8192                   94.33
16384                 185.85
32768                 237.63
65536                  71.93
131072                155.68
262144                489.54
524288                291.04
1048576               923.25
osu_reduce
# OSU MPI-CUDA Reduce Latency Test v5.9
# Size       Avg Latency(us)
4                      23.22
8                      23.12
16                     23.46
32                     23.33
64                     24.01
128                    26.91
256                    32.98
512                    33.58
1024                   29.93
2048                   86.17
4096                   91.42
8192                   96.20
16384                 104.96
32768                 138.09
65536                 217.69
131072                387.73
262144               1007.76
524288               2227.94
1048576              4584.68
osu_bcast
# OSU MPI-CUDA Broadcast Latency Test v5.9
# Size       Avg Latency(us)
1                      85.52
2                      86.07
4                      86.32
8                      86.11
16                     86.22
32                     86.72
64                     87.37
128                    87.10
256                    87.52
512                    87.79
1024                   87.73
2048                   87.87
4096                   89.30
8192                   89.80
16384                 171.33
32768                 351.89
65536                 705.63
131072                904.49
262144               1117.40
524288               1320.32
1048576               133.31
osu_bcast
# OSU MPI-CUDA Broadcast Latency Test v5.9
# Size       Avg Latency(us)
1                      89.36
2                      89.29
4                      89.56
8                      89.69
16                     90.72
32                     91.85
64                     91.53
128                   101.95
256                    93.39
512                    95.99
1024                  101.72
2048                  113.58
4096                  143.00
8192                  185.05
16384                 259.32
32768                 391.85
65536                 168.12
131072                233.04
262144                326.11
524288                452.23
1048576               534.99
osu_bcast
# OSU MPI-CUDA Broadcast Latency Test v5.9
# Size       Avg Latency(us)
1                      46.61
2                      46.58
4                      46.62
8                      46.53
16                     46.65
32                     46.71
64                     46.82
128                    47.03
256                    46.98
512                    48.28
1024                   48.68
2048                  186.13
4096                  233.85
8192                  235.12
16384                 238.00
32768                 285.07
65536                 131.28
131072                220.26
262144                413.38
524288                809.30
1048576              1593.66
osu_alltoallv
# OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                     637.54
2                     638.52
4                     639.18
8                     637.57
16                    635.48
32                    635.14
64                    639.59
128                   643.95
256                   643.28
512                   637.44
1024                  638.55
2048                  638.64
4096                  642.52
8192                  640.43
16384                 805.22
32768                1494.24
65536                2943.63
131072               5846.05
262144              11811.71
524288              23857.81
1048576              1739.03
osu_alltoallv
# OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                     522.28
2                     521.31
4                     523.75
8                     522.82
16                    522.87
32                    524.22
64                    522.77
128                   526.60
256                   523.93
512                   534.46
1024                  534.16
2048                  476.68
4096                  493.51
8192                  529.08
16384                 720.68
32768                 975.54
65536                 541.50
131072                625.32
262144                856.80
524288               1256.33
1048576              2020.87
osu_alltoallv
# OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                      74.21
2                      74.00
4                      73.96
8                      73.95
16                     74.18
32                     74.81
64                     75.19
128                    75.67
256                    79.16
512                    88.08
1024                  109.05
2048                  155.38
4096                  190.44
8192                  251.58
16384                 378.31
32768                 654.48
65536                 225.11
131072                406.02
262144                671.12
524288               1020.81
1048576              1844.89

Slingshot11 comparison of cpu and cuda-aware with openmpi/4.1.5+cuda:

https://github.com/olcf-tutorials/MPI_ping_pong

gcc and openmpi/4.1.5+cuda
[arnoldg@dt-login04 cuda-aware-olcf]$ cat compile.sh
nvcc -c ping_pong_cuda_aware.cu 
mpicc -o ping_pong_cuda_aware ping_pong_cuda_aware.o -lcudart -lcuda

### 2 nodes , 1 core per node
[arnoldg@dt-login04 cuda-aware-olcf]$ srun ping_pong
Transfer size (B):          8, Transfer Time (s):     0.000028800, Bandwidth (GB/s):     0.000258699
Transfer size (B):         16, Transfer Time (s):     0.000024511, Bandwidth (GB/s):     0.000607946
Transfer size (B):         32, Transfer Time (s):     0.000024504, Bandwidth (GB/s):     0.001216236
Transfer size (B):         64, Transfer Time (s):     0.000024508, Bandwidth (GB/s):     0.002432084
Transfer size (B):        128, Transfer Time (s):     0.000024502, Bandwidth (GB/s):     0.004865362
Transfer size (B):        256, Transfer Time (s):     0.000024503, Bandwidth (GB/s):     0.009730004
Transfer size (B):        512, Transfer Time (s):     0.000051003, Bandwidth (GB/s):     0.009349196
Transfer size (B):       1024, Transfer Time (s):     0.000046495, Bandwidth (GB/s):     0.020511295
Transfer size (B):       2048, Transfer Time (s):     0.000074002, Bandwidth (GB/s):     0.025774236
Transfer size (B):       4096, Transfer Time (s):     0.000156007, Bandwidth (GB/s):     0.024452162
Transfer size (B):       8192, Transfer Time (s):     0.000212012, Bandwidth (GB/s):     0.035985738
Transfer size (B):      16384, Transfer Time (s):     0.000212993, Bandwidth (GB/s):     0.071639946
Transfer size (B):      32768, Transfer Time (s):     0.000407249, Bandwidth (GB/s):     0.074936002
Transfer size (B):      65536, Transfer Time (s):     0.000064525, Bandwidth (GB/s):     0.945919110
Transfer size (B):     131072, Transfer Time (s):     0.000123387, Bandwidth (GB/s):     0.989332648
Transfer size (B):     262144, Transfer Time (s):     0.000223807, Bandwidth (GB/s):     1.090851978
Transfer size (B):     524288, Transfer Time (s):     0.000369294, Bandwidth (GB/s):     1.322203569
Transfer size (B):    1048576, Transfer Time (s):     0.000613924, Bandwidth (GB/s):     1.590688320
Transfer size (B):    2097152, Transfer Time (s):     0.001097033, Bandwidth (GB/s):     1.780371056
Transfer size (B):    4194304, Transfer Time (s):     0.002173799, Bandwidth (GB/s):     1.796968938
Transfer size (B):    8388608, Transfer Time (s):     0.004155268, Bandwidth (GB/s):     1.880143607
Transfer size (B):   16777216, Transfer Time (s):     0.007972952, Bandwidth (GB/s):     1.959750802
Transfer size (B):   33554432, Transfer Time (s):     0.015780976, Bandwidth (GB/s):     1.980232452
Transfer size (B):   67108864, Transfer Time (s):     0.031320847, Bandwidth (GB/s):     1.995476069
Transfer size (B):  134217728, Transfer Time (s):     0.062092963, Bandwidth (GB/s):     2.013110578
Transfer size (B):  268435456, Transfer Time (s):     0.118608354, Bandwidth (GB/s):     2.107777326
Transfer size (B):  536870912, Transfer Time (s):     0.234608678, Bandwidth (GB/s):     2.131208467
Transfer size (B): 1073741824, Transfer Time (s):     0.443599408, Bandwidth (GB/s):     2.254286145

### 2 nodes, 1 core per node , 1 gpu per node
[arnoldg@dt-login04 cuda-aware-olcf]$ srun ping_pong_cuda_aware
Transfer size (B):          8, Transfer Time (s):     0.000048964, Bandwidth (GB/s):     0.000152165
Transfer size (B):         16, Transfer Time (s):     0.000048999, Bandwidth (GB/s):     0.000304111
Transfer size (B):         32, Transfer Time (s):     0.000048998, Bandwidth (GB/s):     0.000608230
Transfer size (B):         64, Transfer Time (s):     0.000049009, Bandwidth (GB/s):     0.001216208
Transfer size (B):        128, Transfer Time (s):     0.000048994, Bandwidth (GB/s):     0.002433118
Transfer size (B):        256, Transfer Time (s):     0.000049003, Bandwidth (GB/s):     0.004865372
Transfer size (B):        512, Transfer Time (s):     0.000050997, Bandwidth (GB/s):     0.009350298
Transfer size (B):       1024, Transfer Time (s):     0.000092998, Bandwidth (GB/s):     0.010254775
Transfer size (B):       2048, Transfer Time (s):     0.000141558, Bandwidth (GB/s):     0.013473972
Transfer size (B):       4096, Transfer Time (s):     0.000209211, Bandwidth (GB/s):     0.018233769
Transfer size (B):       8192, Transfer Time (s):     0.000210205, Bandwidth (GB/s):     0.036294980
Transfer size (B):      16384, Transfer Time (s):     0.000237235, Bandwidth (GB/s):     0.064319354
Transfer size (B):      32768, Transfer Time (s):     0.000449374, Bandwidth (GB/s):     0.067911337
Transfer size (B):      65536, Transfer Time (s):     0.000134989, Bandwidth (GB/s):     0.452148675
Transfer size (B):     131072, Transfer Time (s):     0.000175172, Bandwidth (GB/s):     0.696858934
Transfer size (B):     262144, Transfer Time (s):     0.000263563, Bandwidth (GB/s):     0.926307537
Transfer size (B):     524288, Transfer Time (s):     0.000454662, Bandwidth (GB/s):     1.073943296
Transfer size (B):    1048576, Transfer Time (s):     0.000636923, Bandwidth (GB/s):     1.533251233
Transfer size (B):    2097152, Transfer Time (s):     0.001033314, Bandwidth (GB/s):     1.890156478
Transfer size (B):    4194304, Transfer Time (s):     0.001887483, Bandwidth (GB/s):     2.069555042
Transfer size (B):    8388608, Transfer Time (s):     0.003595216, Bandwidth (GB/s):     2.173026678
Transfer size (B):   16777216, Transfer Time (s):     0.006722890, Bandwidth (GB/s):     2.324149242
Transfer size (B):   33554432, Transfer Time (s):     0.013281675, Bandwidth (GB/s):     2.352865892
Transfer size (B):   67108864, Transfer Time (s):     0.026546471, Bandwidth (GB/s):     2.354361872
Transfer size (B):  134217728, Transfer Time (s):     0.053021420, Bandwidth (GB/s):     2.357537783
Transfer size (B):  268435456, Transfer Time (s):     0.108629334, Bandwidth (GB/s):     2.301404150
Transfer size (B):  536870912, Transfer Time (s):     0.216473238, Bandwidth (GB/s):     2.309754331
Transfer size (B): 1073741824, Transfer Time (s):     0.446161794, Bandwidth (GB/s):     2.241339384
[arnoldg@dt-login04 cuda-aware-olcf]$ 

Using non-cuda-aware and just staging the data using traditional cuda calls, but with our best mpi ( openmpi/4.1.6 using FI_PROVIDER cxi ), performance is about 2x better:

openmpi/4.1.6 with cuda staged version
Transfer size (B):          8, Transfer Time (s):     0.000013760, Bandwidth (GB/s):     0.000541470
Transfer size (B):         16, Transfer Time (s):     0.000015137, Bandwidth (GB/s):     0.000984401
Transfer size (B):         32, Transfer Time (s):     0.000014056, Bandwidth (GB/s):     0.002120306
Transfer size (B):         64, Transfer Time (s):     0.000014179, Bandwidth (GB/s):     0.004203727
Transfer size (B):        128, Transfer Time (s):     0.000014915, Bandwidth (GB/s):     0.007992320
Transfer size (B):        256, Transfer Time (s):     0.000014713, Bandwidth (GB/s):     0.016204301
Transfer size (B):        512, Transfer Time (s):     0.000015178, Bandwidth (GB/s):     0.031416626
Transfer size (B):       1024, Transfer Time (s):     0.000015057, Bandwidth (GB/s):     0.063335754
Transfer size (B):       2048, Transfer Time (s):     0.000015482, Bandwidth (GB/s):     0.123197580
Transfer size (B):       4096, Transfer Time (s):     0.000015829, Bandwidth (GB/s):     0.240994509
Transfer size (B):       8192, Transfer Time (s):     0.000019132, Bandwidth (GB/s):     0.398773506
Transfer size (B):      16384, Transfer Time (s):     0.000025477, Bandwidth (GB/s):     0.598935610
Transfer size (B):      32768, Transfer Time (s):     0.000033831, Bandwidth (GB/s):     0.902050795
Transfer size (B):      65536, Transfer Time (s):     0.000039537, Bandwidth (GB/s):     1.543744663
Transfer size (B):     131072, Transfer Time (s):     0.000058515, Bandwidth (GB/s):     2.086149580
Transfer size (B):     262144, Transfer Time (s):     0.000094330, Bandwidth (GB/s):     2.588149678
Transfer size (B):     524288, Transfer Time (s):     0.000160523, Bandwidth (GB/s):     3.041809385
Transfer size (B):    1048576, Transfer Time (s):     0.000260600, Bandwidth (GB/s):     3.747357543
Transfer size (B):    2097152, Transfer Time (s):     0.000501920, Bandwidth (GB/s):     3.891305984
Transfer size (B):    4194304, Transfer Time (s):     0.000966248, Bandwidth (GB/s):     4.042697467
Transfer size (B):    8388608, Transfer Time (s):     0.001854184, Bandwidth (GB/s):     4.213443167
Transfer size (B):   16777216, Transfer Time (s):     0.003628512, Bandwidth (GB/s):     4.306172912
Transfer size (B):   33554432, Transfer Time (s):     0.007129726, Bandwidth (GB/s):     4.383057527
Transfer size (B):   67108864, Transfer Time (s):     0.013566608, Bandwidth (GB/s):     4.606899369
Transfer size (B):  134217728, Transfer Time (s):     0.026968276, Bandwidth (GB/s):     4.635075703
Transfer size (B):  268435456, Transfer Time (s):     0.055595515, Bandwidth (GB/s):     4.496765646
Transfer size (B):  536870912, Transfer Time (s):     0.108996281, Bandwidth (GB/s):     4.587312487
Transfer size (B): 1073741824, Transfer Time (s):     0.218830109, Bandwidth (GB/s):     4.569755079
  • No labels