2 node * 2 gpus per node, 4 gpus total, gpuA40x4 partition, benchmarks flags: "-d cuda"
( perf. numbers are similar for s11 and openmpi/4.1.5+cuda under nvhpc )
slingshot10 , openmpi/4.1.2, gcc/11.2.0 | slingshot11, openmpi/4.1.5+cuda, gcc/11.4.0 | slingshot11, openmpi/5.0.1+cuda, gcc/11.4.0 (use mpirun, srun not supported ) |
---|---|---|
osu_reduce # OSU MPI-CUDA Reduce Latency Test v5.9 # Size Avg Latency(us) 4 141.83 8 135.70 16 133.55 32 134.55 64 136.96 128 142.41 256 149.63 512 147.23 1024 144.80 2048 153.08 4096 159.19 8192 159.99 16384 166.59 32768 179.74 65536 188.19 131072 105.19 262144 218.92https://github.com/olcf-tutorials/MPI_ping_pong 524288 340.70 1048576 726.87 | osu_reduce # OSU MPI-CUDA Reduce Latency Test v5.9 # Size Avg Latency(us) 4 46.61 8 48.19 16 47.40 32 48.73 64 48.64 128 50.86 256 51.29 512 57.45 1024 76.34 2048 116.33 4096 94.03 8192 94.33 16384 185.85 32768 237.63 65536 71.93 131072 155.68 262144 489.54 524288 291.04 1048576 923.25 | osu_reduce # OSU MPI-CUDA Reduce Latency Test v5.9 # Size Avg Latency(us) 4 23.22 8 23.12 16 23.46 32 23.33 64 24.01 128 26.91 256 32.98 512 33.58 1024 29.93 2048 86.17 4096 91.42 8192 96.20 16384 104.96 32768 138.09 65536 217.69 131072 387.73 262144 1007.76 524288 2227.94 1048576 4584.68 |
osu_bcast # OSU MPI-CUDA Broadcast Latency Test v5.9 # Size Avg Latency(us) 1 85.52 2 86.07 4 86.32 8 86.11 16 86.22 32 86.72 64 87.37 128 87.10 256 87.52 512 87.79 1024 87.73 2048 87.87 4096 89.30 8192 89.80 16384 171.33 32768 351.89 65536 705.63 131072 904.49 262144 1117.40 524288 1320.32 1048576 133.31 | osu_bcast # OSU MPI-CUDA Broadcast Latency Test v5.9 # Size Avg Latency(us) 1 89.36 2 89.29 4 89.56 8 89.69 16 90.72 32 91.85 64 91.53 128 101.95 256 93.39 512 95.99 1024 101.72 2048 113.58 4096 143.00 8192 185.05 16384 259.32 32768 391.85 65536 168.12 131072 233.04 262144 326.11 524288 452.23 1048576 534.99 | osu_bcast # OSU MPI-CUDA Broadcast Latency Test v5.9 # Size Avg Latency(us) 1 46.61 2 46.58 4 46.62 8 46.53 16 46.65 32 46.71 64 46.82 128 47.03 256 46.98 512 48.28 1024 48.68 2048 186.13 4096 233.85 8192 235.12 16384 238.00 32768 285.07 65536 131.28 131072 220.26 262144 413.38 524288 809.30 1048576 1593.66 |
osu_alltoallv # OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9 # Size Avg Latency(us) 1 637.54 2 638.52 4 639.18 8 637.57 16 635.48 32 635.14 64 639.59 128 643.95 256 643.28 512 637.44 1024 638.55 2048 638.64 4096 642.52 8192 640.43 16384 805.22 32768 1494.24 65536 2943.63 131072 5846.05 262144 11811.71 524288 23857.81 1048576 1739.03 | osu_alltoallv # OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9 # Size Avg Latency(us) 1 522.28 2 521.31 4 523.75 8 522.82 16 522.87 32 524.22 64 522.77 128 526.60 256 523.93 512 534.46 1024 534.16 2048 476.68 4096 493.51 8192 529.08 16384 720.68 32768 975.54 65536 541.50 131072 625.32 262144 856.80 524288 1256.33 1048576 2020.87 | osu_alltoallv # OSU MPI-CUDA All-to-Allv Personalized Exchange Latency Test v5.9 # Size Avg Latency(us) 1 74.21 2 74.00 4 73.96 8 73.95 16 74.18 32 74.81 64 75.19 128 75.67 256 79.16 512 88.08 1024 109.05 2048 155.38 4096 190.44 8192 251.58 16384 378.31 32768 654.48 65536 225.11 131072 406.02 262144 671.12 524288 1020.81 1048576 1844.89 |
Slingshot11 comparison of cpu and cuda-aware with openmpi/4.1.5+cuda:
https://github.com/olcf-tutorials/MPI_ping_pong
gcc and openmpi/4.1.5+cuda
[arnoldg@dt-login04 cuda-aware-olcf]$ cat compile.sh nvcc -c ping_pong_cuda_aware.cu mpicc -o ping_pong_cuda_aware ping_pong_cuda_aware.o -lcudart -lcuda ### 2 nodes , 1 core per node [arnoldg@dt-login04 cuda-aware-olcf]$ srun ping_pong Transfer size (B): 8, Transfer Time (s): 0.000028800, Bandwidth (GB/s): 0.000258699 Transfer size (B): 16, Transfer Time (s): 0.000024511, Bandwidth (GB/s): 0.000607946 Transfer size (B): 32, Transfer Time (s): 0.000024504, Bandwidth (GB/s): 0.001216236 Transfer size (B): 64, Transfer Time (s): 0.000024508, Bandwidth (GB/s): 0.002432084 Transfer size (B): 128, Transfer Time (s): 0.000024502, Bandwidth (GB/s): 0.004865362 Transfer size (B): 256, Transfer Time (s): 0.000024503, Bandwidth (GB/s): 0.009730004 Transfer size (B): 512, Transfer Time (s): 0.000051003, Bandwidth (GB/s): 0.009349196 Transfer size (B): 1024, Transfer Time (s): 0.000046495, Bandwidth (GB/s): 0.020511295 Transfer size (B): 2048, Transfer Time (s): 0.000074002, Bandwidth (GB/s): 0.025774236 Transfer size (B): 4096, Transfer Time (s): 0.000156007, Bandwidth (GB/s): 0.024452162 Transfer size (B): 8192, Transfer Time (s): 0.000212012, Bandwidth (GB/s): 0.035985738 Transfer size (B): 16384, Transfer Time (s): 0.000212993, Bandwidth (GB/s): 0.071639946 Transfer size (B): 32768, Transfer Time (s): 0.000407249, Bandwidth (GB/s): 0.074936002 Transfer size (B): 65536, Transfer Time (s): 0.000064525, Bandwidth (GB/s): 0.945919110 Transfer size (B): 131072, Transfer Time (s): 0.000123387, Bandwidth (GB/s): 0.989332648 Transfer size (B): 262144, Transfer Time (s): 0.000223807, Bandwidth (GB/s): 1.090851978 Transfer size (B): 524288, Transfer Time (s): 0.000369294, Bandwidth (GB/s): 1.322203569 Transfer size (B): 1048576, Transfer Time (s): 0.000613924, Bandwidth (GB/s): 1.590688320 Transfer size (B): 2097152, Transfer Time (s): 0.001097033, Bandwidth (GB/s): 1.780371056 Transfer size (B): 4194304, Transfer Time (s): 0.002173799, Bandwidth (GB/s): 1.796968938 Transfer size (B): 8388608, Transfer Time (s): 0.004155268, Bandwidth (GB/s): 1.880143607 Transfer size (B): 16777216, Transfer Time (s): 0.007972952, Bandwidth (GB/s): 1.959750802 Transfer size (B): 33554432, Transfer Time (s): 0.015780976, Bandwidth (GB/s): 1.980232452 Transfer size (B): 67108864, Transfer Time (s): 0.031320847, Bandwidth (GB/s): 1.995476069 Transfer size (B): 134217728, Transfer Time (s): 0.062092963, Bandwidth (GB/s): 2.013110578 Transfer size (B): 268435456, Transfer Time (s): 0.118608354, Bandwidth (GB/s): 2.107777326 Transfer size (B): 536870912, Transfer Time (s): 0.234608678, Bandwidth (GB/s): 2.131208467 Transfer size (B): 1073741824, Transfer Time (s): 0.443599408, Bandwidth (GB/s): 2.254286145 ### 2 nodes, 1 core per node , 1 gpu per node [arnoldg@dt-login04 cuda-aware-olcf]$ srun ping_pong_cuda_aware Transfer size (B): 8, Transfer Time (s): 0.000048964, Bandwidth (GB/s): 0.000152165 Transfer size (B): 16, Transfer Time (s): 0.000048999, Bandwidth (GB/s): 0.000304111 Transfer size (B): 32, Transfer Time (s): 0.000048998, Bandwidth (GB/s): 0.000608230 Transfer size (B): 64, Transfer Time (s): 0.000049009, Bandwidth (GB/s): 0.001216208 Transfer size (B): 128, Transfer Time (s): 0.000048994, Bandwidth (GB/s): 0.002433118 Transfer size (B): 256, Transfer Time (s): 0.000049003, Bandwidth (GB/s): 0.004865372 Transfer size (B): 512, Transfer Time (s): 0.000050997, Bandwidth (GB/s): 0.009350298 Transfer size (B): 1024, Transfer Time (s): 0.000092998, Bandwidth (GB/s): 0.010254775 Transfer size (B): 2048, Transfer Time (s): 0.000141558, Bandwidth (GB/s): 0.013473972 Transfer size (B): 4096, Transfer Time (s): 0.000209211, Bandwidth (GB/s): 0.018233769 Transfer size (B): 8192, Transfer Time (s): 0.000210205, Bandwidth (GB/s): 0.036294980 Transfer size (B): 16384, Transfer Time (s): 0.000237235, Bandwidth (GB/s): 0.064319354 Transfer size (B): 32768, Transfer Time (s): 0.000449374, Bandwidth (GB/s): 0.067911337 Transfer size (B): 65536, Transfer Time (s): 0.000134989, Bandwidth (GB/s): 0.452148675 Transfer size (B): 131072, Transfer Time (s): 0.000175172, Bandwidth (GB/s): 0.696858934 Transfer size (B): 262144, Transfer Time (s): 0.000263563, Bandwidth (GB/s): 0.926307537 Transfer size (B): 524288, Transfer Time (s): 0.000454662, Bandwidth (GB/s): 1.073943296 Transfer size (B): 1048576, Transfer Time (s): 0.000636923, Bandwidth (GB/s): 1.533251233 Transfer size (B): 2097152, Transfer Time (s): 0.001033314, Bandwidth (GB/s): 1.890156478 Transfer size (B): 4194304, Transfer Time (s): 0.001887483, Bandwidth (GB/s): 2.069555042 Transfer size (B): 8388608, Transfer Time (s): 0.003595216, Bandwidth (GB/s): 2.173026678 Transfer size (B): 16777216, Transfer Time (s): 0.006722890, Bandwidth (GB/s): 2.324149242 Transfer size (B): 33554432, Transfer Time (s): 0.013281675, Bandwidth (GB/s): 2.352865892 Transfer size (B): 67108864, Transfer Time (s): 0.026546471, Bandwidth (GB/s): 2.354361872 Transfer size (B): 134217728, Transfer Time (s): 0.053021420, Bandwidth (GB/s): 2.357537783 Transfer size (B): 268435456, Transfer Time (s): 0.108629334, Bandwidth (GB/s): 2.301404150 Transfer size (B): 536870912, Transfer Time (s): 0.216473238, Bandwidth (GB/s): 2.309754331 Transfer size (B): 1073741824, Transfer Time (s): 0.446161794, Bandwidth (GB/s): 2.241339384 [arnoldg@dt-login04 cuda-aware-olcf]$
Using non-cuda-aware and just staging the data using traditional cuda calls, but with our best mpi ( openmpi/4.1.6 using FI_PROVIDER cxi ), performance is about 2x better:
openmpi/4.1.6 with cuda staged version
Transfer size (B): 8, Transfer Time (s): 0.000013760, Bandwidth (GB/s): 0.000541470 Transfer size (B): 16, Transfer Time (s): 0.000015137, Bandwidth (GB/s): 0.000984401 Transfer size (B): 32, Transfer Time (s): 0.000014056, Bandwidth (GB/s): 0.002120306 Transfer size (B): 64, Transfer Time (s): 0.000014179, Bandwidth (GB/s): 0.004203727 Transfer size (B): 128, Transfer Time (s): 0.000014915, Bandwidth (GB/s): 0.007992320 Transfer size (B): 256, Transfer Time (s): 0.000014713, Bandwidth (GB/s): 0.016204301 Transfer size (B): 512, Transfer Time (s): 0.000015178, Bandwidth (GB/s): 0.031416626 Transfer size (B): 1024, Transfer Time (s): 0.000015057, Bandwidth (GB/s): 0.063335754 Transfer size (B): 2048, Transfer Time (s): 0.000015482, Bandwidth (GB/s): 0.123197580 Transfer size (B): 4096, Transfer Time (s): 0.000015829, Bandwidth (GB/s): 0.240994509 Transfer size (B): 8192, Transfer Time (s): 0.000019132, Bandwidth (GB/s): 0.398773506 Transfer size (B): 16384, Transfer Time (s): 0.000025477, Bandwidth (GB/s): 0.598935610 Transfer size (B): 32768, Transfer Time (s): 0.000033831, Bandwidth (GB/s): 0.902050795 Transfer size (B): 65536, Transfer Time (s): 0.000039537, Bandwidth (GB/s): 1.543744663 Transfer size (B): 131072, Transfer Time (s): 0.000058515, Bandwidth (GB/s): 2.086149580 Transfer size (B): 262144, Transfer Time (s): 0.000094330, Bandwidth (GB/s): 2.588149678 Transfer size (B): 524288, Transfer Time (s): 0.000160523, Bandwidth (GB/s): 3.041809385 Transfer size (B): 1048576, Transfer Time (s): 0.000260600, Bandwidth (GB/s): 3.747357543 Transfer size (B): 2097152, Transfer Time (s): 0.000501920, Bandwidth (GB/s): 3.891305984 Transfer size (B): 4194304, Transfer Time (s): 0.000966248, Bandwidth (GB/s): 4.042697467 Transfer size (B): 8388608, Transfer Time (s): 0.001854184, Bandwidth (GB/s): 4.213443167 Transfer size (B): 16777216, Transfer Time (s): 0.003628512, Bandwidth (GB/s): 4.306172912 Transfer size (B): 33554432, Transfer Time (s): 0.007129726, Bandwidth (GB/s): 4.383057527 Transfer size (B): 67108864, Transfer Time (s): 0.013566608, Bandwidth (GB/s): 4.606899369 Transfer size (B): 134217728, Transfer Time (s): 0.026968276, Bandwidth (GB/s): 4.635075703 Transfer size (B): 268435456, Transfer Time (s): 0.055595515, Bandwidth (GB/s): 4.496765646 Transfer size (B): 536870912, Transfer Time (s): 0.108996281, Bandwidth (GB/s): 4.587312487 Transfer size (B): 1073741824, Transfer Time (s): 0.218830109, Bandwidth (GB/s): 4.569755079