The Delta system upgraded its high-speed network in  mid- January. This upgrade involved both a hardware change in the network adaptor and a change in the network software stack.

The current Delta high speed network is based on HPE/Cray's Slingshot 10 using a UCX based network software stack with verbs with ConnectX5 network adaptors in a ethernet mode. 

The new Delta high speed network is based on Slingshot 11 using CXI OFI with HPE/Cray Cassini  network adaptors. 

The base OS is an updated RHEL 8.8 with only sub-minor version changes to most package from the OS on the production side of Delta.

All three file systems are available from login node as well as from the compute nodes.

Slingshot 11 Testing

A dedicated Delta login node and a mix of compute nodes from the cpu, gpuA100x4 and gpuA40x4 partitions are configured to support jobs. All nodes see the shared home, projects and scratch file systems. 

Jobs that are submitted via Open OnDemand are NOT able to access the Slingshot 11 test resources. If you need to use a Jupyter notebook then please see the Delta Documentation on manual Jupyter notebook set-up using the specific dt-login04 where needed. 

Login Node

Please use login.delta.ncsa.illinois.edu 

During the test period you will need to specifically log into dt-login04.delta.ncsa.illinois.edu :

ssh username@dt-login04.delta.ncsa.illinois.edu

Programming Environment

There are two types of programming environments supporting the Slingshot 11 network software stack: Standard programming environments and CrayPE programming environment. 

For a majority of the workloads on Delta we recommend using the standard programming environments. 

If your application needs "gpu-direct" for MPI communication with the GPUs then you will need to use a CrayPE programming environment. The CrayPE programming environments are still undergoing integration into the Delta environment and may or maynot perform as expected. 

Standard Programming Environments

By default a GCC and OpenMPI based programming environment is loaded via modules.  

The Intel compiler with Intel MPI is available. 

A MVAPICH MPI is available as well built with the default GCC module.

You would use the specific compiler names as you do on the production side of Delta. See https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/prog_env.html for more information.

default modules ( example )
[gbauer@dt-login04 ~]$ module list
Currently Loaded Modules:
  1) gcc/11.4.0   2) openmpi/4.1.6   3) cuda/11.8.0   4) cue-login-env/1.0   5) default-s11

Some of the modules from the production side of Delta will be available are listed under the "/sw/user/modules" section of the listing.

Python is still provided by the Anaconda modules.

Module availability
[gbauer@dt-login04 ~]$ module avail
-------------- /sw/spack/deltas11-2023-03/modules/lmod/openmpi/4.1.6-lranp74/gcc/11.4.0 --------------
   fftw/3.3.10                  hdf5/1.14.3     netcdf-c/4.9.2              parallel-netcdf/1.12.3
   gromacs/2022.5.cuda          ior/3.3.0       netcdf-fortran/4.6.1        parmetis/4.0.3
   gromacs/2022.5.x86_64 (D)    mdtest/1.9.3    osu-micro-benchmarks/7.3    petsc/3.20.1
------------------------- /sw/spack/deltas11-2023-03/modules/lmod/gcc/11.4.0 -------------------------
   cuda/11.8.0    cuda/12.3.0 (L,D)    gsl/2.7.1    mvapich/3.0    openmpi/4.1.6 (L)
------------------------------------------ /sw/user/modules ------------------------------------------
   ...   
anaconda3_cpu/23.3.1 default openmpi-5.0_beta/5.0.0rc9
anaconda3_cpu/23.7.4 (D) gthumb/3.12.0 paraview/5.10.0gui
anaconda3_gpu/22.10.0 gurobi-dev/10.0.3 paraview/5.10.1 (D)
anaconda3_gpu/23.3.1 gurobi/10.0.1 paraview/5.11.2
  ...  

Expand the following code block to see a more complete listing of the standard programming modules.

Full Listing of Module availability
[gbauer@dt-login04 ~]$ module avail

-------------- /sw/spack/deltas11-2023-03/modules/lmod/openmpi/4.1.6-lranp74/gcc/11.4.0 --------------
fftw/3.3.10 hdf5/1.14.3 netcdf-c/4.9.2 parallel-netcdf/1.12.3
gromacs/2022.5.cuda ior/3.3.0 netcdf-fortran/4.6.1 parmetis/4.0.3
gromacs/2022.5.x86_64 (D) mdtest/1.9.3 osu-micro-benchmarks/7.3 petsc/3.20.1

------------------------- /sw/spack/deltas11-2023-03/modules/lmod/gcc/11.4.0 -------------------------
cuda/11.8.0 (L,D) gsl/2.7.1 mvapich/3.0 openmpi/4.1.6 (L)

------------------------------------------ /sw/user/modules ------------------------------------------
AMDuProf/3.5 aws-cli/2.13.14 matlab_unlicensed/2021b
AMDuProf/3.6 conda-env/cegan-py3.9.18 mvapich-3.0rc_s11/3.0rc
AMDuProf/4.0 (D) craype-accel-ncsa/1.0 namd3/2022.07.multicore_cuda
ImageMagick/6.9 cudnn/8.4.1.50 (D) namd3/2022.07.multinode_cuda (D)
ImageMagick/7.1.0 (D) cudnn/8.9.0.131 node/21.2.0
Intel_AI_toolkit/2023.1 cue-login-env/1.0 (L) nvhpc_latest/22.11
anaconda3_Rcpu/22.9.0 default-s11 (L) openmpi-5.0_beta/5.0.0rc8 
(D)
anaconda3_cpu/23.3.1 default openmpi-5.0_beta/5.0.0rc9
anaconda3_cpu/23.7.4 (D) gthumb/3.12.0 paraview/5.10.0gui
anaconda3_gpu/22.10.0 gurobi-dev/10.0.3 paraview/5.10.1 (D)
anaconda3_gpu/23.3.1 gurobi/10.0.1 paraview/5.11.2
anaconda3_gpu/23.7.4 julia/1.9.0 posix2ime/2020.1
anaconda3_gpu/23.9.0 (D) lammps/2022.06.cpu slurm-env/0.1
anaconda3_mi100/4.14.0 lammps/2022.06.gpu_cuda (D) visit/3.2.2 (D)
anaconda3_mi100/23.7.4 (D) lammps/2023.08.cuda.s11 visit/3.3.3
anaconda3_x86_64/23.3.1 lammps/2023.08.x86_64.s11 westpa/2022.03
anaconda3_x86_64/23.7.4 (D) llvm/15.0.0

---------------------------- /sw/spack/deltas11-2023-03/modules/lmod/Core ----------------------------
banner/1.3.5 git/2.39.3 ndiff/2.00 subversion/1.14.2
cuda/11.8.0 htop/3.2.2 nvtop/3.0.1
dos2unix/7.4.4 intel-oneapi-compilers/2024.0.0 parallel/20220522
gcc/11.4.0 (L) intel/2024.0.0 readline/8.2

Running jobs

Jobs that are submitted to the scheduler will automatically be tagged with a ss11 feature which indicates to the scheduler that the jobs are only to be run on nodes with the Slingshot 11 software stack.

Slurm Partitions

All partitions are running ss11 jobs. This is now the default. 

Currently the following resources are available when submitting jobs from dt-login04.  

Partition

nodes available for ss11 testing

cpu16 CPU nodes  : 2048 cores total
gpuA40x4

2 A40 nodes : 8 A40 gpus total

gpuA100x42 A100 nodes : 8 A100 gpus total

Submitting Jobs

You should be able to re-use your existing job scripts that work on the production side of Delta with some modifications.

You will need to make changes to any module commands to match what is available in the testing side. 

PMI and srun

srun will need to look for PMI2 as the MPI process management interface. If you see a PMIX Error message, please add or change your batch script to use on of

$ export SLURM_MPI_TYPE=pmi2

or

$ srun --pmi=mpi2 osu_reduce


As a reminder, Open OnDemand cannot be used to submit Jupyter notebook jobs for testing.   If you need to use a Jupyter notebook then please see the Delta Documentation on manual Jupyter notebook set-up using the specific dt-login04 where needed. 

Viewing job information

To view jobs and confirm the Feature is set to ss11

squeue %f shows features
$ squeue -u $USER
       JOBID    PARTITION         NAME           USER ST       TIME  NODES   NODELIST(REASON) FEATURES
     2734871          cpu         bash         gbauer  R       1:47      2        cn[122-123] ss11

GPU direct support 

These MPI implementations should be used only when mpi + cuda/gpu_direct are needed in an application.  The pure-mpi performance will be less than the MPI implementations above for small message sizes.  For large messages, the performance should be close to equivalent to the cpu-only implementations.

openmpi

choose one of:

module load gcc openmpi/4.1.6   # the default gcc/11.4.0


module load gcc openmpi/4.1.5+cuda # if your code requires cuda-aware-mpi semantics

see also: gpudirect s10 vs s11 performance

CrayPE Programming Environments

(Available for testing but under construction)

HPE/Cray provides a set of programming environments that are similar to what one would find on HPE/Cray EX systems like NERSC Perlmutter etc. Please note that these programming environments use the Cray MPI library.  The modules for the HPE/Cray provided programming environments are PrgEnv-gnu, PrgEnv-cray,  and PrgEnv-nvhpc which enable the GNU, Cray and Nvidia HPC SDK compilers and matching Cray MPI libraries.  

You must use the cc, CC and ftn compiler to use these programming environments.
See the cc, CC and ftn man pages  that wrapper the GNU , Cray CCE, and NVIDIA compilers. 

Expand the following code block to see a listing of all the modules that come with the HPE/CrayPE provided programming environment.

craype modules listing
------------------------ /opt/cray/pe/lmod/modulefiles/craype-targets/default ------------------------
   craype-accel-amd-gfx908    craype-hugepages128M    craype-hugepages512M    craype-x86-milan
   craype-accel-amd-gfx90a    craype-hugepages16M     craype-hugepages64M     craype-x86-rome
   craype-accel-amd-gfx940    craype-hugepages1G      craype-hugepages8M      craype-x86-spr-hbm
   craype-accel-host          craype-hugepages256M    craype-network-none     craype-x86-spr
   craype-accel-intel-max     craype-hugepages2G      craype-network-ofi      craype-x86-trento
   craype-accel-nvidia70      craype-hugepages2M      craype-network-ucx
   craype-accel-nvidia80      craype-hugepages32M     craype-x86-genoa
   craype-arm-grace           craype-hugepages4M      craype-x86-milan-x
--------------------------------- /opt/cray/pe/lmod/modulefiles/core ---------------------------------
   PrgEnv-cray/8.4.0      cray-R/4.2.1.2           cray-pals/1.2.12           gdb4hpc/4.15.1
   PrgEnv-gnu/8.4.0       cray-ccdb/5.0.1          cray-pmi/6.1.12            papi/7.0.1.1
   PrgEnv-nvhpc/8.4.0     cray-cti/2.18.1          cray-python/3.10.10        perftools-base/23.09.0
   PrgEnv-nvidia/8.4.0    cray-dsmml/0.2.2         cray-stat/4.12.1           sanitizers4hpc/1.1.1
   atp/3.15.1             cray-dyninst/12.3.0      craype/2.7.23              valgrind4hpc/2.13.1
   cce/16.0.1             cray-libpals/1.2.12      craypkg-gen/1.3.30
   cpe-cuda/23.09         cray-libsci/23.09.1.1    gcc-native/10.3
   cpe/23.09              cray-mrnet/5.1.1         gcc-native/11.2     (D)

Using a Cray programming environment

Here is how you can enable the GNU CrayPE programming environment. 

[gbauer@dt-login04 ~]$ module unload openmpi gcc 
[gbauer@dt-login04 ~]$ module load PrgEnv-gnu cuda craype-x86-milan craype-accel-ncsa

Compiling

Again, you will need to use the  cc, CC and ftn compiler wrappers.

# Use the HPE/Cray compiler wrappers cc, CC and ftn to compile and link
# you might need to add libraries manually 
[gbauer@dt-login04 ~]$ cc -fopenmp -o xthi xthi.c -lcuda -lcudart

The compiler wrappers enable linking of a libmpi_gtl_cuda library that enables gpu-rdma with the Cray MPI.


Running a CrayPE job

See the Running jobs section above for details on the partitions etc.

[gbauer@dt-login04 ~]$ module unload openmpi gcc 
[gbauer@dt-login04 ~]$ module load PrgEnv-gnu cuda craype-x86-milan craype-accel-ncsa
[gbauer@dt-login04 ~]$ srun --account=bbka-delta-gpu --partition=gpuA40x4 --nodes=2 --ntasks-per-node=2 --cpus-per-task=2 --gpus-per-task=1 --mem=0 --time=00:10:00 ./xthi
srun: job 2735921 queued and waiting for resources
srun: job 2735921 has been allocated resources
Rank 0, thread 0, on gpub003.delta.ncsa.illinois.edu. core = 0,1,(6.548536 seconds).
Rank 0, thread 1, on gpub003.delta.ncsa.illinois.edu. core = 0,1,(6.548521 seconds).
Rank 1, thread 1, on gpub003.delta.ncsa.illinois.edu. core = 2,3,(18.908121 seconds).
Rank 1, thread 0, on gpub003.delta.ncsa.illinois.edu. core = 2,3,(18.908134 seconds).
Rank 2, thread 0, on gpub004.delta.ncsa.illinois.edu. core = 0,1,(10.076774 seconds).
Rank 2, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 0,1,(10.076761 seconds).
Rank 3, thread 0, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366058 seconds).
Rank 3, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366045 seconds).


Cray Programming Environments

along with a PrgEnv-<gnu,cray> , module load crapye-x86-milan craype-accel-ncsa  cuda ( compile and runtime )

    • choose a programming environment (run one of the module load lines below ):

      module load examples
      module load PrgEnv-cray craype-x86-milan craype-accel-ncsa cuda
      module load PrgEnv-gnu craype-x86-milan craype-accel-ncsa cuda
    • Use the cray compiler wrappers: cc, CC, ftn for c, c++ and fortran codes respectively.  Do not use the mpi* compiler wrappers that are in your $PATH.  
  • cudaMalloc (classic gpudirect) supported for single node cases ( no multi-node gpudirect at this time
  • cuda managed/unified memory support multi-node gpudirect
    • use cudaMallocManaged where possible
      #ifdef _ENABLE_CUDA_
              case CUDA:
                  // do not use cudaMalloc if you can avoid it
                  CUDA_CHECK(cudaMalloc((void **)buffer, size));
                  break;
              case MANAGED:
                  // use newer cuda managed memory instead, letting the cuda driver take care of what goes where at runtime
                  CUDA_CHECK(cudaMallocManaged((void **)buffer, size, cudaMemAttachGlobal));
                  break;
  • see also: https://cpe.ext.hpe.com/docs/mpt/mpich/intro_mpi.html , "GPU Support in Cray MPICH" for a description of which PrgEnv-<compiler> might best fit your application
    • PrgEnv-gnu and PrgEnv-cray work with cuda managed memory (or classic cudaMalloc for single-node jobs )
    • PrgEnv-cray
      • notes: may need -std=gnu11 -std=c++11 for compiler compatibility with gnu , export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
    • https://docs.nvidia.com/cuda/gpudirect-rdma/index.html (pertains mostly to Mellanox/Infiniband networks -- Nvidia documentation )
  • ensure your application linked libmpi_gtl_cuda, here's an example from the osu_micro_benchmarks  https://mvapich.cse.ohio-state.edu/benchmarks/ 
    •  ./configure CC=cc CXX=CC --enable-cuda  ; cd mpi/collective; make osu_reduce; make osu_bcast  :
$ ldd osu_reduce | grep cuda
    libcudart.so.11.0 => /sw/spack/deltas11-2023-03/apps/linux-rhel8-zen/gcc-8.5.0/cuda-11.8.0-zttjnty/lib64/libcudart.so.11.0 (0x00007f0c6d888000)
    libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f3b8e802000)
    libmpi_gtl_cuda.so.0 => /opt/cray/pe/lib64/libmpi_gtl_cuda.so.0 (0x00007f3b8e5bc000)

[arnoldg@dt-login04 collective]$ srun osu_bcast -d managed
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.

# OSU MPI Broadcast Latency Test v5.9
# Size       Avg Latency(us)
1                       2.21
2                       2.20
4                       2.20
8                       2.20
16                      2.20
32                      2.32
64                      2.27
128                     2.67
256                     3.90
512                     4.03
1024                    4.17
2048                    4.41
4096                    4.76
8192                    5.21
16384                   5.96
32768                   9.42
65536                  13.66
131072                 19.38
262144                 29.41
524288                 55.80
1048576               100.87

# old style gpu direct "-d cuda" fails for multi-node cases:

[arnoldg@dt-login04 collective]$ srun osu_bcast -d cuda
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.

# OSU MPI-CUDA Broadcast Latency Test v5.9
# Size       Avg Latency(us)
cxil_map: write error
cxil_map: write error
MPICH ERROR [Rank 1] [job id 2765312.9] [Thu Dec 21 12:58:18 2023] [gpub004.delta.ncsa.illinois.edu] - Abort(339288079) (rank 1 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(446)..........: MPI_Bcast(buf=0x7efbe0600000, count=1, MPI_CHAR, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Bcast(431)..........: 
MPIR_CRAY_Bcast(532).....: 
MPIR_CRAY_Bcast_Tree(162): 
MPIC_Recv(194)...........: 
MPID_Recv(380)...........: 
MPIDI_recv_unsafe(87)....: 
MPIDI_OFI_do_irecv(356)..: OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Invalid argument)

slurmstepd: error: *** STEP 2765312.9 ON gpub003 CANCELLED AT 2023-12-21T12:58:18 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: gpub004: task 1: Exited with exit code 1
srun: error: gpub003: task 0: Killed

# gpudirect "-d cuda" works on a single node
[arnoldg@dt-login04 collective]$ srun osu_bcast -d cuda
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.

# OSU MPI-CUDA Broadcast Latency Test v5.9
# Size       Avg Latency(us)
1                     159.36
2                     158.99
4                     158.59
8                     159.96
16                    160.36
32                    160.27
64                    160.35
128                   160.35
256                   160.16
512                   147.08
1024                  213.06
2048                  212.95
4096                  212.71
8192                  202.39
16384                 202.36
32768                 201.82
65536                 202.83
131072                202.40
262144                203.22
524288                203.47
1048576               205.54  

# cuda managed memory with mpi pt2pt example:

[arnoldg@dt-login04 pt2pt]$ srun osu_bibw -d managed
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
         This can lead to multiple processes using the same GPU.
         Please use the get_local_rank script in the OMB repo for this.
# OSU MPI Bi-Directional Bandwidth Test v5.9
# Size      Bandwidth (MB/s)
1                       1.65
2                       3.38
4                       6.50
8                      13.12
16                     26.28
32                     52.40
64                    104.37
128                   207.78
256                   406.47
512                   843.80
1024                 1678.60
2048                 3311.43
4096                 6579.15
8192                13882.65
16384               24171.86
32768               28589.49
65536               32155.62
131072              39795.07
262144              41833.97
524288              43021.68
1048576             43479.76
2097152             43729.48
4194304             43784.78
[arnoldg@dt-login04 pt2pt]$ 
  • No labels

1 Comment

  1. To access the original slingshot10 modules , choose one of ( for non-MPI codes ) :

    original modules
    module use /sw/spack/deltagpu-2022-03/modules/lmod/Core 
    module use /sw/spack/deltacpu-2022-03/modules/lmod/Core