Table of Contents |
---|
The Delta system upgraded its high-speed network in mid- January. This upgrade involved both a hardware change in the network adaptor and a change in the network software stack.
The current Delta high speed network is based on HPE/Cray's Slingshot 10 using a UCX based network software stack with verbs with ConnectX5 network adaptors in a ethernet mode.
The new Delta high speed network is based on Slingshot 11 using CXI OFI with HPE/Cray Cassini network adaptors.
The base OS is an updated RHEL 8.8 with only sub-minor version changes to most package from the OS on the production side of Delta.
All three file systems are available from login node as well as from the compute nodes.
Slingshot 11 Testing
A dedicated Delta login node and a mix of compute nodes from the cpu, gpuA100x4 and gpuA40x4 partitions are configured to support jobs. All nodes see the shared home, projects and scratch file systems.
Jobs that are submitted via Open OnDemand are NOT able to access the Slingshot 11 test resources. If you need to use a Jupyter notebook then please see the Delta Documentation on manual Jupyter notebook set-up using the specific dt-login04 where needed.
Login Node
Please use login.delta.ncsa.illinois.edu
During the test period you will need to specifically log into dt-login04.delta.ncsa.illinois.edu
:
|
Programming Environment
There are two types of programming environments supporting the Slingshot 11 network software stack: Standard programming environments and CrayPE programming environment.
For a majority of the workloads on Delta we recommend using the standard programming environments.
If your application needs "gpu-direct" for MPI communication with the GPUs then you will need to use a CrayPE programming environment. The CrayPE programming environments are still undergoing integration into the Delta environment and may or maynot perform as expected.
Standard Programming Environments
By default a GCC and OpenMPI based programming environment is loaded via modules.
The Intel compiler with Intel MPI is available.
A MVAPICH MPI is available as well built with the default GCC module.
You would use the specific compiler names as you do on the production side of Delta. See https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/prog_env.html for more information.
Code Block | ||||
---|---|---|---|---|
| ||||
[gbauer@dt-login04 ~]$ module list Currently Loaded Modules: 1) gcc/11.4.0 2) openmpi/4.1.6 3) cuda/11.8.0 4) cue-login-env/1.0 5) default-s11 |
Some of the modules from the production side of Delta will be available are listed under the "/sw/user/modules
" section of the listing.
Python is still provided by the Anaconda modules.
Code Block | ||||
---|---|---|---|---|
| ||||
[gbauer@dt-login04 ~]$ module avail -------------- /sw/spack/deltas11-2023-03/modules/lmod/openmpi/4.1.6-lranp74/gcc/11.4.0 -------------- fftw/3.3.10 hdf5/1.14.3 netcdf-c/4.9.2 parallel-netcdf/1.12.3 gromacs/2022.5.cuda ior/3.3.0 netcdf-fortran/4.6.1 parmetis/4.0.3 gromacs/2022.5.x86_64 (D) mdtest/1.9.3 osu-micro-benchmarks/7.3 petsc/3.20.1 ------------------------- /sw/spack/deltas11-2023-03/modules/lmod/gcc/11.4.0 ------------------------- cuda/11.8.0 cuda/12.3.0 (L,D) gsl/2.7.1 mvapich/3.0 openmpi/4.1.6 (L) ------------------------------------------ /sw/user/modules ------------------------------------------ ... anaconda3_cpu/23.3.1 default openmpi-5.0_beta/5.0.0rc9 anaconda3_cpu/23.7.4 (D) gthumb/3.12.0 paraview/5.10.0gui anaconda3_gpu/22.10.0 gurobi-dev/10.0.3 paraview/5.10.1 (D) anaconda3_gpu/23.3.1 gurobi/10.0.1 paraview/5.11.2 ... |
Expand the following code block to see a more complete listing of the standard programming modules.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
[gbauer@dt-login04 ~]$ module avail -------------- /sw/spack/deltas11-2023-03/modules/lmod/openmpi/4.1.6-lranp74/gcc/11.4.0 -------------- fftw/3.3.10 hdf5/1.14.3 netcdf-c/4.9.2 parallel-netcdf/1.12.3 gromacs/2022.5.cuda ior/3.3.0 netcdf-fortran/4.6.1 parmetis/4.0.3 gromacs/2022.5.x86_64 (D) mdtest/1.9.3 osu-micro-benchmarks/7.3 petsc/3.20.1 ------------------------- /sw/spack/deltas11-2023-03/modules/lmod/gcc/11.4.0 ------------------------- cuda/11.8.0 (L,D) gsl/2.7.1 mvapich/3.0 openmpi/4.1.6 (L) ------------------------------------------ /sw/user/modules ------------------------------------------ AMDuProf/3.5 aws-cli/2.13.14 matlab_unlicensed/2021b AMDuProf/3.6 conda-env/cegan-py3.9.18 mvapich-3.0rc_s11/3.0rc AMDuProf/4.0 (D) craype-accel-ncsa/1.0 namd3/2022.07.multicore_cuda ImageMagick/6.9 cudnn/8.4.1.50 (D) namd3/2022.07.multinode_cuda (D) ImageMagick/7.1.0 (D) cudnn/8.9.0.131 node/21.2.0 Intel_AI_toolkit/2023.1 cue-login-env/1.0 (L) nvhpc_latest/22.11 anaconda3_Rcpu/22.9.0 default-s11 (L) openmpi-5.0_beta/5.0.0rc8 (D) anaconda3_cpu/23.3.1 default openmpi-5.0_beta/5.0.0rc9 anaconda3_cpu/23.7.4 (D) gthumb/3.12.0 paraview/5.10.0gui anaconda3_gpu/22.10.0 gurobi-dev/10.0.3 paraview/5.10.1 (D) anaconda3_gpu/23.3.1 gurobi/10.0.1 paraview/5.11.2 anaconda3_gpu/23.7.4 julia/1.9.0 posix2ime/2020.1 anaconda3_gpu/23.9.0 (D) lammps/2022.06.cpu slurm-env/0.1 anaconda3_mi100/4.14.0 lammps/2022.06.gpu_cuda (D) visit/3.2.2 (D) anaconda3_mi100/23.7.4 (D) lammps/2023.08.cuda.s11 visit/3.3.3 anaconda3_x86_64/23.3.1 lammps/2023.08.x86_64.s11 westpa/2022.03 anaconda3_x86_64/23.7.4 (D) llvm/15.0.0 ---------------------------- /sw/spack/deltas11-2023-03/modules/lmod/Core ---------------------------- banner/1.3.5 git/2.39.3 ndiff/2.00 subversion/1.14.2 cuda/11.8.0 htop/3.2.2 nvtop/3.0.1 dos2unix/7.4.4 intel-oneapi-compilers/2024.0.0 parallel/20220522 gcc/11.4.0 (L) intel/2024.0.0 readline/8.2 |
Running jobs
Jobs that are submitted to the scheduler will automatically be tagged with a ss11 feature which indicates to the scheduler that the jobs are only to be run on nodes with the Slingshot 11 software stack.
Slurm Partitions
All partitions are running ss11 jobs. This is now the default.
Currently the following resources are available when submitting jobs from dt-login04.
| |
---|---|
| |
Submitting Jobs
You should be able to re-use your existing job scripts that work on the production side of Delta with some modifications.
You will need to make changes to any module commands to match what is available in the testing side.
Tip | ||
---|---|---|
| ||
srun will need to look for PMI2 as the MPI process management interface. If you see a PMIX Error message, please add or change your batch script to use on of $ export SLURM_MPI_TYPE=pmi2 or $ srun --pmi=mpi2 osu_reduce |
As a reminder, Open OnDemand cannot be used to submit Jupyter notebook jobs for testing. If you need to use a Jupyter notebook then please see the Delta Documentation on manual Jupyter notebook set-up using the specific dt-login04 where needed.
Viewing job information
To view jobs and confirm the Feature is set to ss11
Code Block | ||
---|---|---|
| ||
$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) FEATURES 2734871 cpu bash gbauer R 1:47 2 cn[122-123] ss11 |
GPU direct support
These MPI implementations should be used only when mpi + cuda/gpu_direct are needed in an application. The pure-mpi performance will be less than the MPI implementations above for small message sizes. For large messages, the performance should be close to equivalent to the cpu-only implementations.
openmpi
choose one of:
Code Block |
---|
module load gcc openmpi/4.1.6 # the default gcc/11.4.0 # in testing mode module load gcc openmpi/5.0.1+cuda # only mpirun is supported, do not use with srun |
see also: gpudirect s10 vs s11 performance
CrayPE Programming Environments
(Available for testing but under construction)
HPE/Cray provides a set of programming environments that are similar to what one would find on HPE/Cray EX systems like NERSC Perlmutter etc. Please note that these programming environments use the Cray MPI library. The modules for the HPE/Cray provided programming environments are PrgEnv-gnu, PrgEnv-cray, and PrgEnv-nvhpc which enable the GNU, Cray and Nvidia HPC SDK compilers and matching Cray MPI libraries.
You must use the cc, CC and ftn compiler to use these programming environments.
See the cc, CC and ftn man pages that wrapper the GNU , Cray CCE, and NVIDIA compilers.
Expand the following code block to see a listing of all the modules that come with the HPE/CrayPE provided programming environment.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
------------------------ /opt/cray/pe/lmod/modulefiles/craype-targets/default ------------------------ craype-accel-amd-gfx908 craype-hugepages128M craype-hugepages512M craype-x86-milan craype-accel-amd-gfx90a craype-hugepages16M craype-hugepages64M craype-x86-rome craype-accel-amd-gfx940 craype-hugepages1G craype-hugepages8M craype-x86-spr-hbm craype-accel-host craype-hugepages256M craype-network-none craype-x86-spr craype-accel-intel-max craype-hugepages2G craype-network-ofi craype-x86-trento craype-accel-nvidia70 craype-hugepages2M craype-network-ucx craype-accel-nvidia80 craype-hugepages32M craype-x86-genoa craype-arm-grace craype-hugepages4M craype-x86-milan-x --------------------------------- /opt/cray/pe/lmod/modulefiles/core --------------------------------- PrgEnv-cray/8.4.0 cray-R/4.2.1.2 cray-pals/1.2.12 gdb4hpc/4.15.1 PrgEnv-gnu/8.4.0 cray-ccdb/5.0.1 cray-pmi/6.1.12 papi/7.0.1.1 PrgEnv-nvhpc/8.4.0 cray-cti/2.18.1 cray-python/3.10.10 perftools-base/23.09.0 PrgEnv-nvidia/8.4.0 cray-dsmml/0.2.2 cray-stat/4.12.1 sanitizers4hpc/1.1.1 atp/3.15.1 cray-dyninst/12.3.0 craype/2.7.23 valgrind4hpc/2.13.1 cce/16.0.1 cray-libpals/1.2.12 craypkg-gen/1.3.30 cpe-cuda/23.09 cray-libsci/23.09.1.1 gcc-native/10.3 cpe/23.09 cray-mrnet/5.1.1 gcc-native/11.2 (D) |
Using a Cray programming environment
Here is how you can enable the GNU CrayPE programming environment.
Code Block |
---|
[gbauer@dt-login04 ~]$ module unload openmpi gcc [gbauer@dt-login04 ~]$ module load PrgEnv-gnu cuda craype-x86-milan craype-accel-ncsa |
Compiling
Again, you will need to use the cc, CC and ftn compiler wrappers.
Code Block | ||
---|---|---|
| ||
# Use the HPE/Cray compiler wrappers cc, CC and ftn to compile and link # you might need to add libraries manually [gbauer@dt-login04 ~]$ cc -fopenmp -o xthi xthi.c -lcuda -lcudart |
The compiler wrappers enable linking of a libmpi_gtl_cuda library that enables gpu-rdma with the Cray MPI.
Running a CrayPE job
See the Running jobs section above for details on the partitions etc.
Code Block |
---|
[gbauer@dt-login04 ~]$ module unload openmpi gcc [gbauer@dt-login04 ~]$ module load PrgEnv-gnu cuda craype-x86-milan craype-accel-ncsa [gbauer@dt-login04 ~]$ srun --account=bbka-delta-gpu --partition=gpuA40x4 --nodes=2 --ntasks-per-node=2 --cpus-per-task=2 --gpus-per-task=1 --mem=0 --time=00:10:00 ./xthi srun: job 2735921 queued and waiting for resources srun: job 2735921 has been allocated resources Rank 0, thread 0, on gpub003.delta.ncsa.illinois.edu. core = 0,1,(6.548536 seconds). Rank 0, thread 1, on gpub003.delta.ncsa.illinois.edu. core = 0,1,(6.548521 seconds). Rank 1, thread 1, on gpub003.delta.ncsa.illinois.edu. core = 2,3,(18.908121 seconds). Rank 1, thread 0, on gpub003.delta.ncsa.illinois.edu. core = 2,3,(18.908134 seconds). Rank 2, thread 0, on gpub004.delta.ncsa.illinois.edu. core = 0,1,(10.076774 seconds). Rank 2, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 0,1,(10.076761 seconds). Rank 3, thread 0, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366058 seconds). Rank 3, thread 1, on gpub004.delta.ncsa.illinois.edu. core = 2,3,(16.366045 seconds). |
Cray Programming Environments
along with a PrgEnv-<gnu,cray> , module load crapye-x86-milan craype-accel-ncsa cuda ( compile and runtime )
choose a programming environment (run one of the module load lines below ):
Code Block title module load examples module load PrgEnv-cray craype-x86-milan craype-accel-ncsa cuda module load PrgEnv-gnu craype-x86-milan craype-accel-ncsa cuda
- Use the cray compiler wrappers: cc, CC, ftn for c, c++ and fortran codes respectively. Do not use the mpi* compiler wrappers that are in your $PATH.
- cudaMalloc (classic gpudirect) supported for single node cases ( no multi-node gpudirect at this time )
- cuda managed/unified memory support multi-node gpudirect
Code Block title use cudaMallocManaged where possible #ifdef _ENABLE_CUDA_ case CUDA: // do not use cudaMalloc if you can avoid it CUDA_CHECK(cudaMalloc((void **)buffer, size)); break; case MANAGED: // use newer cuda managed memory instead, letting the cuda driver take care of what goes where at runtime CUDA_CHECK(cudaMallocManaged((void **)buffer, size, cudaMemAttachGlobal)); break;
- see also: https://cpe.ext.hpe.com/docs/mpt/mpich/intro_mpi.html , "GPU Support in Cray MPICH" for a description of which PrgEnv-<compiler> might best fit your application
- PrgEnv-gnu and PrgEnv-cray work with cuda managed memory (or classic cudaMalloc for single-node jobs )
- PrgEnv-cray
- notes: may need -std=gnu11 -std=c++11 for compiler compatibility with gnu , export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
- https://docs.nvidia.com/cuda/gpudirect-rdma/index.html (pertains mostly to Mellanox/Infiniband networks -- Nvidia documentation )
- ensure your application linked libmpi_gtl_cuda, here's an example from the osu_micro_benchmarks https://mvapich.cse.ohio-state.edu/benchmarks/
./configure CC=cc CXX=CC --enable-cuda ; cd mpi/collective; make osu_reduce; make osu_bcast :
Code Block |
---|
$ ldd osu_reduce | grep cuda libcudart.so.11.0 => /sw/spack/deltas11-2023-03/apps/linux-rhel8-zen/gcc-8.5.0/cuda-11.8.0-zttjnty/lib64/libcudart.so.11.0 (0x00007f0c6d888000) libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f3b8e802000) libmpi_gtl_cuda.so.0 => /opt/cray/pe/lib64/libmpi_gtl_cuda.so.0 (0x00007f3b8e5bc000) [arnoldg@dt-login04 collective]$ srun osu_bcast -d managed Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. # OSU MPI Broadcast Latency Test v5.9 # Size Avg Latency(us) 1 2.21 2 2.20 4 2.20 8 2.20 16 2.20 32 2.32 64 2.27 128 2.67 256 3.90 512 4.03 1024 4.17 2048 4.41 4096 4.76 8192 5.21 16384 5.96 32768 9.42 65536 13.66 131072 19.38 262144 29.41 524288 55.80 1048576 100.87 # old style gpu direct "-d cuda" fails for multi-node cases: [arnoldg@dt-login04 collective]$ srun osu_bcast -d cuda Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. # OSU MPI-CUDA Broadcast Latency Test v5.9 # Size Avg Latency(us) cxil_map: write error cxil_map: write error MPICH ERROR [Rank 1] [job id 2765312.9] [Thu Dec 21 12:58:18 2023] [gpub004.delta.ncsa.illinois.edu] - Abort(339288079) (rank 1 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack: PMPI_Bcast(446)..........: MPI_Bcast(buf=0x7efbe0600000, count=1, MPI_CHAR, root=0, comm=MPI_COMM_WORLD) failed PMPI_Bcast(431)..........: MPIR_CRAY_Bcast(532).....: MPIR_CRAY_Bcast_Tree(162): MPIC_Recv(194)...........: MPID_Recv(380)...........: MPIDI_recv_unsafe(87)....: MPIDI_OFI_do_irecv(356)..: OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Invalid argument) slurmstepd: error: *** STEP 2765312.9 ON gpub003 CANCELLED AT 2023-12-21T12:58:18 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: gpub004: task 1: Exited with exit code 1 srun: error: gpub003: task 0: Killed # gpudirect "-d cuda" works on a single node [arnoldg@dt-login04 collective]$ srun osu_bcast -d cuda Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. # OSU MPI-CUDA Broadcast Latency Test v5.9 # Size Avg Latency(us) 1 159.36 2 158.99 4 158.59 8 159.96 16 160.36 32 160.27 64 160.35 128 160.35 256 160.16 512 147.08 1024 213.06 2048 212.95 4096 212.71 8192 202.39 16384 202.36 32768 201.82 65536 202.83 131072 202.40 262144 203.22 524288 203.47 1048576 205.54 # cuda managed memory with mpi pt2pt example: [arnoldg@dt-login04 pt2pt]$ srun osu_bibw -d managed Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. Warning: OMB could not identify the local rank of the process. This can lead to multiple processes using the same GPU. Please use the get_local_rank script in the OMB repo for this. # OSU MPI Bi-Directional Bandwidth Test v5.9 # Size Bandwidth (MB/s) 1 1.65 2 3.38 4 6.50 8 13.12 16 26.28 32 52.40 64 104.37 128 207.78 256 406.47 512 843.80 1024 1678.60 2048 3311.43 4096 6579.15 8192 13882.65 16384 24171.86 32768 28589.49 65536 32155.62 131072 39795.07 262144 41833.97 524288 43021.68 1048576 43479.76 2097152 43729.48 4194304 43784.78 [arnoldg@dt-login04 pt2pt]$ |