Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The compiler wrappers enable linking of a libmpi_gtl_cuda library that enables gpu-rdma with the Cray MPI.

Running a CrayPE job

See the Running jobs section above for details on the partitions etc.

GPU direct support 

These MPI implementations should be used only when mpi + cuda/gpu_direct are needed in an application.  The pure-mpi performance will be less than the MPI implementations above for small message sizes.  For large messages, the performance should be close to equivalent to the cpu-only implementations.

openmpi

choose one of:

Code Block
module load gcc openmpi/4.1.5+cuda     # the default gcc/11.4.0
module load nvhpc openmpi/4.1.5+cuda   # will load the openmpi/4.1.5+cuda built with nvhpc compilers

# in testing mode
module load gcc openmpi/5.0.1+cuda     # only mpirun is supported, do not use with srun

see also: gpudirect s10 vs s11 performance

Running a CrayPE job

See the Running jobs section above for details on the partitions etc.

Code Block
[gbauer@dt-login04 ~]$ module unload openmpi gcc 
[gbauer@dt-login04 ~]$ module load PrgEnv-gnu cuda craype-x86-milan craype-accel-ncsa
[gbauer@dt-login04 ~]$ srun --account=bbka-delta-gpu --partition=gpuA40x4 --nodes=2 --ntasks-per-node=2 --cpus-per-task=2 --gpus-per-task=1 --mem=0 --time=00:10:00 ./xthi
srun: job 2735921 queued and waiting for resources
srun: job 2735921 has been allocated resources
Rank 0
Code Block
[gbauer@dt-login04 ~]$ module unload openmpi gcc 
[gbauer@dt-login04 ~]$ module load PrgEnv-gnu cuda craype-x86-milan craype-accel-ncsa
[gbauer@dt-login04 ~]$ srun --account=bbka-delta-gpu --partition=gpuA40x4 --nodes=2 --ntasks-per-node=2 --cpus-per-task=2 --gpus-per-task=1 --mem=0 --time=00:10:00 ./xthi
srun: job 2735921 queued and waiting for resources
srun: job 2735921 has been allocated resources
Rank 0, thread 0, on gpub003.delta.ncsa.illinois.edu. core = 0,1,(6.548536 seconds).
Rank 0, thread 1, on gpub003.delta.ncsa.illinois.edu. core = 0,1,(6.548521 seconds).
Rank 1, thread 1, on gpub003.delta.ncsa.illinois.edu. core = 2,3,(18.908121 seconds).
Rank 1, thread 0, on gpub003.delta.ncsa.illinois.edu. core = 20,31,(186.908134548536 seconds).
Rank 20, thread 01, on gpub004gpub003.delta.ncsa.illinois.edu. core = 0,1,(106.076774548521 seconds).
Rank 21, thread 1, on gpub004gpub003.delta.ncsa.illinois.edu. core = 02,13,(1018.076761908121 seconds).
Rank 31, thread 0, on gpub004gpub003.delta.ncsa.illinois.edu. core = 2,3,(1618.366058908134 seconds).
Rank 3, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366045 seconds).

gpu direct support 

...

,

...

openmpi

choose one of:

Code Block
module load gcc openmpi/4.1.5+cuda     # the default gcc/11.4.0
module load nvhpc openmpi/4.1.5+cuda   # will load the openmpi/4.1.5+cuda built with nvhpc compilers

# in testing mode
module load gcc openmpi/5.0.1+cuda     # only mpirun is supported, do not use with srun

...

 thread 0, on gpub004.delta.ncsa.illinois.edu. core = 0,1,(10.076774 seconds).
Rank 2, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 0,1,(10.076761 seconds).
Rank 3, thread 0, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366058 seconds).
Rank 3, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366045 seconds).


Cray Programming Environments

...