You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

2 node * 2 gpus per node, 4 gpus total, gpuA40x4 partition, benchmarks flags: "-d cuda"

( perf. numbers are similar for s11 and openmpi/4.1.5+cuda under nvhpc )

slingshot10 , openmpi/4.1.2, gcc/11.2.0

slingshot11, openmpi/4.1.5+cuda, gcc/11.4.0slingshot11, openmpi/5.0.1+cuda, gcc/11.4.0 (use mpirun, srun not supported )
osu_reduce
# OSU MPI-CUDA Reduce Latency Test v5.9
# Size       Avg Latency(us)
4                     141.83
8                     135.70
16                    133.55
32                    134.55
64                    136.96
128                   142.41
256                   149.63
512                   147.23
1024                  144.80
2048                  153.08
4096                  159.19
8192                  159.99
16384                 166.59
32768                 179.74
65536                 188.19
131072                105.19
262144                218.92https://github.com/olcf-tutorials/MPI_ping_pong 524288                340.70
1048576               726.87
osu_reduce
# OSU MPI-CUDA Reduce Latency Test v5.9
# Size       Avg Latency(us)
4                      46.61
8                      48.19
16                     47.40
32                     48.73
64                     48.64
128                    50.86
256                    51.29
512                    57.45
1024                   76.34
2048                  116.33
4096                   94.03
8192                   94.33
16384                 185.85
32768                 237.63
65536                  71.93
131072                155.68
262144                489.54
524288                291.04
1048576               923.25
osu_reduce
# OSU MPI-CUDA Reduce Latency Test v5.9
# Size       Avg Latency(us)
4                      23.22
8                      23.12
16                     23.46
32                     23.33
64                     24.01
128                    26.91
256                    32.98
512                    33.58
1024                   29.93
2048                   86.17
4096                   91.42
8192                   96.20
16384                 104.96
32768                 138.09
65536                 217.69
131072                387.73
262144               1007.76
524288               2227.94
1048576              4584.68
osu_bcast
# OSU MPI-CUDA Broadcast Latency Test v5.9
# Size       Avg Latency(us)
1                      85.52
2                      86.07
4                      86.32
8                      86.11
16                     86.22
32                     86.72
64                     87.37
128                    87.10
256                    87.52
512                    87.79
1024                   87.73
2048                   87.87
4096                   89.30
8192                   89.80
16384                 171.33
32768                 351.89
65536                 705.63
131072                904.49
262144               1117.40
524288               1320.32
1048576               133.31
osu_bcast
# OSU MPI-CUDA Broadcast Latency Test v5.9
# Size       Avg Latency(us)
1                      89.36
2                      89.29
4                      89.56
8                      89.69
16                     90.72
32                     91.85
64                     91.53
128                   101.95
256                    93.39
512                    95.99
1024                  101.72
2048                  113.58
4096                  143.00
8192                  185.05
16384                 259.32
32768                 391.85
65536                 168.12
131072                233.04
262144                326.11
524288                452.23
1048576               534.99
osu_bcast
# OSU MPI-CUDA Broadcast Latency Test v5.9
# Size       Avg Latency(us)
1                      46.61
2                      46.58
4                      46.62
8                      46.53
16                     46.65
32                     46.71
64                     46.82
128                    47.03
256                    46.98
512                    48.28
1024                   48.68
2048                  186.13
4096                  233.85
8192                  235.12
16384                 238.00
32768                 285.07
65536                 131.28
131072                220.26
262144                413.38
524288                809.30
1048576              1593.66
osu_alltoallv
# OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                     637.54
2                     638.52
4                     639.18
8                     637.57
16                    635.48
32                    635.14
64                    639.59
128                   643.95
256                   643.28
512                   637.44
1024                  638.55
2048                  638.64
4096                  642.52
8192                  640.43
16384                 805.22
32768                1494.24
65536                2943.63
131072               5846.05
262144              11811.71
524288              23857.81
1048576              1739.03
osu_alltoallv
# OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                     522.28
2                     521.31
4                     523.75
8                     522.82
16                    522.87
32                    524.22
64                    522.77
128                   526.60
256                   523.93
512                   534.46
1024                  534.16
2048                  476.68
4096                  493.51
8192                  529.08
16384                 720.68
32768                 975.54
65536                 541.50
131072                625.32
262144                856.80
524288               1256.33
1048576              2020.87
osu_alltoallv
# OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                      74.21
2                      74.00
4                      73.96
8                      73.95
16                     74.18
32                     74.81
64                     75.19
128                    75.67
256                    79.16
512                    88.08
1024                  109.05
2048                  155.38
4096                  190.44
8192                  251.58
16384                 378.31
32768                 654.48
65536                 225.11
131072                406.02
262144                671.12
524288               1020.81
1048576              1844.89

https://github.com/olcf-tutorials/MPI_ping_pong

gcc and openmpi/4.1.5+cuda
[arnoldg@dt-login04 cuda-aware-olcf]$ cat compile.sh
nvcc -c ping_pong_cuda_aware.cu 
mpicc -o ping_pong_cuda_aware ping_pong_cuda_aware.o -lcudart -lcuda

### 2 nodes , 1 core per node
[arnoldg@dt-login04 cuda-aware-olcf]$ srun ping_pong
Transfer size (B):          8, Transfer Time (s):     0.000028800, Bandwidth (GB/s):     0.000258699
Transfer size (B):         16, Transfer Time (s):     0.000024511, Bandwidth (GB/s):     0.000607946
Transfer size (B):         32, Transfer Time (s):     0.000024504, Bandwidth (GB/s):     0.001216236
Transfer size (B):         64, Transfer Time (s):     0.000024508, Bandwidth (GB/s):     0.002432084
Transfer size (B):        128, Transfer Time (s):     0.000024502, Bandwidth (GB/s):     0.004865362
Transfer size (B):        256, Transfer Time (s):     0.000024503, Bandwidth (GB/s):     0.009730004
Transfer size (B):        512, Transfer Time (s):     0.000051003, Bandwidth (GB/s):     0.009349196
Transfer size (B):       1024, Transfer Time (s):     0.000046495, Bandwidth (GB/s):     0.020511295
Transfer size (B):       2048, Transfer Time (s):     0.000074002, Bandwidth (GB/s):     0.025774236
Transfer size (B):       4096, Transfer Time (s):     0.000156007, Bandwidth (GB/s):     0.024452162
Transfer size (B):       8192, Transfer Time (s):     0.000212012, Bandwidth (GB/s):     0.035985738
Transfer size (B):      16384, Transfer Time (s):     0.000212993, Bandwidth (GB/s):     0.071639946
Transfer size (B):      32768, Transfer Time (s):     0.000407249, Bandwidth (GB/s):     0.074936002
Transfer size (B):      65536, Transfer Time (s):     0.000064525, Bandwidth (GB/s):     0.945919110
Transfer size (B):     131072, Transfer Time (s):     0.000123387, Bandwidth (GB/s):     0.989332648
Transfer size (B):     262144, Transfer Time (s):     0.000223807, Bandwidth (GB/s):     1.090851978
Transfer size (B):     524288, Transfer Time (s):     0.000369294, Bandwidth (GB/s):     1.322203569
Transfer size (B):    1048576, Transfer Time (s):     0.000613924, Bandwidth (GB/s):     1.590688320
Transfer size (B):    2097152, Transfer Time (s):     0.001097033, Bandwidth (GB/s):     1.780371056
Transfer size (B):    4194304, Transfer Time (s):     0.002173799, Bandwidth (GB/s):     1.796968938
Transfer size (B):    8388608, Transfer Time (s):     0.004155268, Bandwidth (GB/s):     1.880143607
Transfer size (B):   16777216, Transfer Time (s):     0.007972952, Bandwidth (GB/s):     1.959750802
Transfer size (B):   33554432, Transfer Time (s):     0.015780976, Bandwidth (GB/s):     1.980232452
Transfer size (B):   67108864, Transfer Time (s):     0.031320847, Bandwidth (GB/s):     1.995476069
Transfer size (B):  134217728, Transfer Time (s):     0.062092963, Bandwidth (GB/s):     2.013110578
Transfer size (B):  268435456, Transfer Time (s):     0.118608354, Bandwidth (GB/s):     2.107777326
Transfer size (B):  536870912, Transfer Time (s):     0.234608678, Bandwidth (GB/s):     2.131208467
Transfer size (B): 1073741824, Transfer Time (s):     0.443599408, Bandwidth (GB/s):     2.254286145

### 2 nodes, 1 core per node , 1 gpu per node
[arnoldg@dt-login04 cuda-aware-olcf]$ srun ping_pong_cuda_aware
Transfer size (B):          8, Transfer Time (s):     0.000048964, Bandwidth (GB/s):     0.000152165
Transfer size (B):         16, Transfer Time (s):     0.000048999, Bandwidth (GB/s):     0.000304111
Transfer size (B):         32, Transfer Time (s):     0.000048998, Bandwidth (GB/s):     0.000608230
Transfer size (B):         64, Transfer Time (s):     0.000049009, Bandwidth (GB/s):     0.001216208
Transfer size (B):        128, Transfer Time (s):     0.000048994, Bandwidth (GB/s):     0.002433118
Transfer size (B):        256, Transfer Time (s):     0.000049003, Bandwidth (GB/s):     0.004865372
Transfer size (B):        512, Transfer Time (s):     0.000050997, Bandwidth (GB/s):     0.009350298
Transfer size (B):       1024, Transfer Time (s):     0.000092998, Bandwidth (GB/s):     0.010254775
Transfer size (B):       2048, Transfer Time (s):     0.000141558, Bandwidth (GB/s):     0.013473972
Transfer size (B):       4096, Transfer Time (s):     0.000209211, Bandwidth (GB/s):     0.018233769
Transfer size (B):       8192, Transfer Time (s):     0.000210205, Bandwidth (GB/s):     0.036294980
Transfer size (B):      16384, Transfer Time (s):     0.000237235, Bandwidth (GB/s):     0.064319354
Transfer size (B):      32768, Transfer Time (s):     0.000449374, Bandwidth (GB/s):     0.067911337
Transfer size (B):      65536, Transfer Time (s):     0.000134989, Bandwidth (GB/s):     0.452148675
Transfer size (B):     131072, Transfer Time (s):     0.000175172, Bandwidth (GB/s):     0.696858934
Transfer size (B):     262144, Transfer Time (s):     0.000263563, Bandwidth (GB/s):     0.926307537
Transfer size (B):     524288, Transfer Time (s):     0.000454662, Bandwidth (GB/s):     1.073943296
Transfer size (B):    1048576, Transfer Time (s):     0.000636923, Bandwidth (GB/s):     1.533251233
Transfer size (B):    2097152, Transfer Time (s):     0.001033314, Bandwidth (GB/s):     1.890156478
Transfer size (B):    4194304, Transfer Time (s):     0.001887483, Bandwidth (GB/s):     2.069555042
Transfer size (B):    8388608, Transfer Time (s):     0.003595216, Bandwidth (GB/s):     2.173026678
Transfer size (B):   16777216, Transfer Time (s):     0.006722890, Bandwidth (GB/s):     2.324149242
Transfer size (B):   33554432, Transfer Time (s):     0.013281675, Bandwidth (GB/s):     2.352865892
Transfer size (B):   67108864, Transfer Time (s):     0.026546471, Bandwidth (GB/s):     2.354361872
Transfer size (B):  134217728, Transfer Time (s):     0.053021420, Bandwidth (GB/s):     2.357537783
Transfer size (B):  268435456, Transfer Time (s):     0.108629334, Bandwidth (GB/s):     2.301404150
Transfer size (B):  536870912, Transfer Time (s):     0.216473238, Bandwidth (GB/s):     2.309754331
Transfer size (B): 1073741824, Transfer Time (s):     0.446161794, Bandwidth (GB/s):     2.241339384
[arnoldg@dt-login04 cuda-aware-olcf]$ 
  • No labels