Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

The Delta system is upgrading upgraded its high-speed network sometime in early to in  mid- January. This upgrade involves involved both a hardware change in the network adaptor and a change in the network software stack.

The current Delta high speed network is based on HPE/Cray's Slingshot 10 using a UCX based network software stack with verbs with ConnectX5 network adaptors in a ethernet mode. 

The new Delta high speed network is based on Slingshot 11 using CXI OFI with HPE/Cray Cassini  network adaptors. 

...

All three file systems are available from login node as well as from the compute nodes.

Slingshot 11 Testing

A dedicated Delta login node and a mix of compute nodes from the cpu, gpuA100x4 and gpuA40x4 partitions are configured to support jobs. All nodes see the shared home, projects and scratch file systems. 

Jobs that are submitted via Open OnDemand are NOT able to access the Slingshot 11 test resources. If you need to use a Jupyter notebook then please see the Delta Documentation on manual Jupyter notebook set-up using the specific dt-login04 where needed. 

Login Node

Please use login.delta.ncsa.illinois.edu 

During the test period you will need to specifically log into dt-login04.delta.ncsa.illinois.edu :

ssh username@dt-login04.delta.ncsa.illinois.edu

Programming Environment

There are two types of programming environments supporting the Slingshot 11 network software stack: Standard programming environments and CrayPE programming environment. 

...

Jobs that are submitted to the scheduler from dt-login04 will will automatically be tagged with a ss11 feature which indicates to the scheduler that the jobs are only to be run on nodes with the Slingshot 11 software stack.

Slurm Partitions

All partitions are running ss11 jobs. This is now the default. 

Currently the following resources are available when submitting jobs from dt-login04.  

Partition

nodes available for ss11 testing

cpu16 CPU nodes  : 2048 cores total
gpuA40x4

2 A40 nodes : 8 A40 gpus total

gpuA100x42 A100 nodes : 8 A100 gpus total

Submitting Jobs

You should be able to re-use your existing job scripts that work on the production side of Delta with some modifications.

...

Code Block
module load gcc openmpi/4.1.5+cuda  6   # the default gcc/11.4.0


module load nvhpcgcc openmpi/4.1.5+cuda   # willif loadyour the openmpi/4.1.5+cuda built with nvhpc compilers

# in testing mode
module load gcc openmpi/5.0.1+cuda     # only mpirun is supported, do not use with sruncode requires cuda-aware-mpi semantics

see also: gpudirect s10 vs s11 performance

...