...
The compiler wrappers enable linking of a libmpi_gtl_cuda library that enables gpu-rdma with the Cray MPI.
Running a CrayPE job
See the Running jobs section above for details on the partitions etc.
GPU direct support
These MPI implementations should be used only when mpi + cuda/gpu_direct are needed in an application. The pure-mpi performance will be less than the MPI implementations above for small message sizes. For large messages, the performance should be close to equivalent to the cpu-only implementations.
openmpi
choose one of:
Code Block |
---|
module load gcc openmpi/4.1.5+cuda # the default gcc/11.4.0
module load nvhpc openmpi/4.1.5+cuda # will load the openmpi/4.1.5+cuda built with nvhpc compilers
# in testing mode
module load gcc openmpi/5.0.1+cuda # only mpirun is supported, do not use with srun |
see also: gpudirect s10 vs s11 performance
Running a CrayPE job
See the Running jobs section above for details on the partitions etc.
Code Block |
---|
[gbauer@dt-login04 ~]$ module unload openmpi gcc
[gbauer@dt-login04 ~]$ module load PrgEnv-gnu cuda craype-x86-milan craype-accel-ncsa
[gbauer@dt-login04 ~]$ srun --account=bbka-delta-gpu --partition=gpuA40x4 --nodes=2 --ntasks-per-node=2 --cpus-per-task=2 --gpus-per-task=1 --mem=0 --time=00:10:00 ./xthi
srun: job 2735921 queued and waiting for resources
srun: job 2735921 has been allocated resources
Rank 0 |
Code Block |
[gbauer@dt-login04 ~]$ module unload openmpi gcc [gbauer@dt-login04 ~]$ module load PrgEnv-gnu cuda craype-x86-milan craype-accel-ncsa [gbauer@dt-login04 ~]$ srun --account=bbka-delta-gpu --partition=gpuA40x4 --nodes=2 --ntasks-per-node=2 --cpus-per-task=2 --gpus-per-task=1 --mem=0 --time=00:10:00 ./xthi srun: job 2735921 queued and waiting for resources srun: job 2735921 has been allocated resources Rank 0, thread 0, on gpub003.delta.ncsa.illinois.edu. core = 0,1,(6.548536 seconds). Rank 0, thread 1, on gpub003.delta.ncsa.illinois.edu. core = 0,1,(6.548521 seconds). Rank 1, thread 1, on gpub003.delta.ncsa.illinois.edu. core = 2,3,(18.908121 seconds). Rank 1, thread 0, on gpub003.delta.ncsa.illinois.edu. core = 20,31,(186.908134548536 seconds). Rank 20, thread 01, on gpub004gpub003.delta.ncsa.illinois.edu. core = 0,1,(106.076774548521 seconds). Rank 21, thread 1, on gpub004gpub003.delta.ncsa.illinois.edu. core = 02,13,(1018.076761908121 seconds). Rank 31, thread 0, on gpub004gpub003.delta.ncsa.illinois.edu. core = 2,3,(1618.366058908134 seconds). Rank 3, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366045 seconds). |
gpu direct support
...
, |
...
openmpi
choose one of:
Code Block |
---|
module load gcc openmpi/4.1.5+cuda # the default gcc/11.4.0
module load nvhpc openmpi/4.1.5+cuda # will load the openmpi/4.1.5+cuda built with nvhpc compilers
# in testing mode
module load gcc openmpi/5.0.1+cuda # only mpirun is supported, do not use with srun |
...
thread 0, on gpub004.delta.ncsa.illinois.edu. core = 0,1,(10.076774 seconds).
Rank 2, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 0,1,(10.076761 seconds).
Rank 3, thread 0, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366058 seconds).
Rank 3, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366045 seconds). |
Cray Programming Environments
...