You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 79 Next »

Delta User Guide

Last update: March 10, 2022


Status Updates and Notices

Delta is tentatively scheduled to enter production in Q2 2022.

Introduction

Delta is a dedicated, eXtreme Science and Engineering Science Discovery Environment (XSEDE) allocated resource designed by HPE and NCSA, delivering a highly capable GPU-focused compute environment for GPU and CPU workloads.  Besides offering a mix of standard and reduced precision GPU resources, Delta also offers GPU-dense nodes with both NVIDIA and AMD GPUs.  Delta provides high performance node-local SSD scratch filesystems, as well as both standard lustre and relaxed-POSIX parallel filesystems spanning the entire resource.

Delta's standard CPU nodes are each powered by two 64-core AMD EPYC 7763 ("Milan") processors, with 256 GB of DDR4 memory.  The Delta GPU resource has four node types: one with 4 NVIDIA A100 GPUs (40 GB HBM2 RAM each) connected via NVLINK and 1 64-core AMD EPYC 7763 ("Milan") processor, the second with 4 NVIDIA A40 GPUs (48 GB GDDR6 RAM) connected via PCIe 4.0 and 1 64-core AMD EPYC 7763 ("Milan") processor, the third with 8 NVIDIA A100 GPUs in a dual socket AMD EPYC 7763 (128-cores per node) node with 2 TB of DDR4 RAM and NVLINK,  and the fourth with 8 AMD MI100 GPUs (32GB HBM2 RAM each) in a dual socket AMD EPYC 7763 (128-cores per node) node with 2 TB of DDR4 RAM and PCIe 4.0. 

Delta has 124 standard CPU nodes, 100 4-way A100-based GPU nodes, 100 4-way A40-based GPU nodes, 5 8-way A100-based GPU nodes, and 1 8-way MI100-based GPU node.  Every Delta node has high-performance node-local SSD storage (800 GB for CPU nodes, 1.6 TB for GPU nodes), and is connected to the 7 PB Lustre parallel filesystem via the high-speed interconnect.  The Delta resource uses the SLURM workload manager for job scheduling.  

Delta supports the XSEDE core software stack, including remote login, remote computation, data movement, science workflow support, and science gateway support toolkits.


Figure 1. Delta System

Delta is supported by the National Science Foundation under Grant No. OAC-2005572.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Delta is now accepting proposals.

Top of Page

Account Administration

  • Setting up Your Account
  • Allocation Information 

Configuring Your Account

  • default shell, changing your shell, changing your password
  • environment variables
  • using Modules (or other environment manager)

System Architecture

Delta is designed to help applications transition from CPU-only to GPU or hybrid CPU-GPU codes. Delta has some important architectural features to facilitate new discovery and insight:

  • a single processor architecture (AMD) across all node types: CPU and GPU
  • support for NVIDIA A100 MIG GPU partitioning allowing for fractional use of the A100s if your workload isn't able to exploit an entire A100 efficiently
  • ray tracing hardware support from the NVIDIA A40 GPUs
  • 9 large memory (2 TB) nodes 
  • a low latency and high bandwidth HPE/Cray Slingshot interconnect between compute nodes
  • lustre for home, projects and scratch file systems
  • support for relaxed and non-posix IO
  • shared-node jobs and the single core and single MIG GPU slice
  • Resources for persistent services in support of Gateways, Open OnDemand, Data Transport nodes..., 
  • Unique AMD MI-100 resource  

Model Compute Nodes

The Delta compute ecosystem is composed of 5 node types: dual-socket CPU-only compute nodes, single socket 4-way NVIDIA A100 GPU compute nodes, single socket 4-way NVIDIA A40 GPU compute nodes, dual-socket 8-way NVIDIA A100 GPU compute nodes, and a single socket 8-way AMD MI100 GPU compute nodes. The CPU-only and 4-way GPU nodes have 256 GB of RAM per node while the 8-way GPU nodes have 2 TB of RAM. The CPU-only node has 0.8 TB of local storage while all GPU nodes have 1.6 TB of local storage.

Table. CPU Compute Node Specifications

SpecificationValue

Number of nodes

124

CPUAMD Milan (PCIe Gen4)
Sockets per node2

Cores per socket

64

Cores per node128

Hardware threads per core

1

Hardware threads per node

128

Clock rate (GHz)

~ 2.45

RAM (GB)

256

Cache (MB) L1/L2/L3

 2/32/256

Local storage (TB)

0.8 TB

Table. 4-way NVIDIA A40 GPU Compute Node Specifications 

SpecificationValue
Number of nodes100
GPUNVIDIA A40 

(Vendor page)

GPUs per node4
GPU Memory (GB)48 DDR6 with ECC
CPUAMD Milan
CPU sockets per node1

Cores per socket

64

Cores per node64

Hardware threads per core

1

Hardware threads per node

64

Clock rate (GHz)

~ 2.45

RAM (GB)

256

Cache (MB) L1/L2/L3

 2/32/256

Local storage (TB)

1.6 TB

Table. 4-way NVIDIA A100 GPU Compute Node Specifications 

SpecificationValue
Number of nodes100
GPUNVIDIA A100

(Vendor page)

GPUs per node4
GPU Memory (GB)40 
CPUAMD Milan
CPU sockets per node2

Cores per socket

64

Cores per node128

Hardware threads per core

1

Hardware threads per node

128

Clock rate (GHz)

~ 2.45

RAM (GB)

256

Cache (MB) L1/L2/L3

 2/32/256

Local storage (TB)

1.6 TB

Table. 8-way NVIDIA A100 GPU Large Memory  Compute Node Specifications 

SpecificationValue
Number of nodes5
GPUNVIDIA A100

(Vendor page)

GPUs per node8
GPU Memory (GB)40 
CPUAMD Milan
CPU sockets per node2

Cores per socket

64

Cores per node128

Hardware threads per core

1

Hardware threads per node

128

Clock rate (GHz)

~ 2.45

RAM (GB)

2,048

Cache (MB) L1/L2/L3

 2/32/256

Local storage (TB)

1.6 TB

Table. 8-way AMD MI100 GPU Large Memory Compute Node Specifications 

SpecificationValue
Number of nodes1
GPUAMD MI100  

(Vendor page)

GPUs per node8
GPU Memory (GB)32
CPUAMD Milan
CPU sockets per node2

Cores per socket

64

Cores per node128

Hardware threads per core

1

Hardware threads per node

128

Clock rate (GHz)

~ 2.45

RAM (GB)

2,048

Cache (MB) L1/L2/L3

 2/32/256

Local storage (TB)

1.6 TB

Login Nodes

Three login nodes provide interactive support for code compilation 

Specialized Nodes

Delta will support data transfer nodes or nodes in support of other services.

Network

Delta will be connected to the NPCF core router & exit infrastructure via two 100Gbps connections, NCSA's 400Gbps+ of WAN connectivity will carry traffic to/from users on an optimal peering. 

Delta resources will be inter-connected with HPE/Cray's 100Gbps/200Gbps SlingShot interconnect.  

File Systems

Note:  Users of Delta have access to 3 file systems at the time of system launch, a fourth relaxed-POSIX file system will be made available at a later date. 

Delta
The Delta storage infrastructure provides users with their $HOME and $SCRATCH areas.  These file systems are mounted across all Delta systems and are accessible on the Delta DTN Endpoints.  The aggregate performance of this subsystem is 70GB/s and it has 6PB of usable space.  These file systems run Lustre via DDN's ExaScaler 6 stack (Lustre 2.14 based)

Hardware:
DDN SFA7990XE (Quantity: 3), each unit contains

  • One additional SS9012 enclosure
  • 168 x 16TB SAS Drives
  • 7 x 1.92TB SAS SSDs

Future Hardware:
An additional pool of NVME flash from DDN will be installed in early Spring 2022.  This flash will initially be deployed as a tier for "hot" data in scratch.  This subsystem will have an aggregate performance of 600GB/s and will have 3PB of capacity. As noted above this subsystem will transition to a relax POSIX namespace file system, communications on that timeline will be announced as updates are available.  

Taiga
Taiga is NCSA’s global file system which provides users with their $WORK area.  This file system is mounted across all Delta systems at /taiga (also /taiga/nsf/delta is bind mounted at /projects) and is accessible on both the Delta and Taiga DTN endpoints.  For NCSA & Illinois researchers, Taiga is also mounted on HAL and Radiant.  This storage subsystem has an aggregate performance of 140GB/s and 1PB of its capacity allocated to users of the Delta system. /taiga is a Lustre file system running DDN Exascaler.  

Hardware:
DDN SFA400NVXE (Quantity: 2), each unit contains

  • 4 x SS9012 enclosures
  • NVME for metadata and small files

DDN SFA18XE (Quantity: 1), each unit contains

  • 10 x SS9012 enclosures


File System

Quota

SnapshotsPurged

Key Features

$HOME

25GB. 400,000 files per user.No/TBANoArea for software, scripts, job files, etc. NOT intended as a source/destination for I/O during jobs

$WORK

500 GB. Up to 1-25 TB  by allocation requestNo/TBANoArea for shared data for a project, common data sets, software, results, etc.

$SCRATCH

1000 GB. Up to 1-100 TB by allocation request.NoYes; files older than 30-days (access time)Area for computation, largest allocations, where I/O from jobs should occur
$LOCAL_SCR
namespace mapped to /tmp
TBDNoAfter each jobLocally attached disk for fast small file IO. 

Top of Page

Accessing the System

Direct Access

Direct access to the Delta login nodes can be obtained using ssh. The login nodes support for the CPU and GPU resources on Delta.

  • ssh username@login.delta.ncsa.illinois.edu
    or
  • ssh -l username login.delta.ncsa.illinois.edu

If needed, XSEDE users can lookup their local username at https://portal.xsede.org/group/xup/accounts. If you need to set a NCSA password for direct access please contact help@ncsa.illinois.edu for assistance.

Use of ssh-key pairs is disabled for general use. Please contact NCSA Help at help@ncsa.illinois.edu for key-pair use by Gateway allocations.

XSEDE Single Sign-On Hub

XSEDE users can also access Delta via the XSEDE Single Sign-On Hub.

When reporting a problem to the help desk, please execute the gsissh command with the “-vvv” option and include the verbose output in your problem description.

Citizenship

You share Delta with thousands of other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb.

List any Best Practices or conversely, a list of don’t’s. Some examples:

  • Don’t run jobs on the login nodes
  • Don’t stress filesystem with known-harmful access patterns (many thousands of small files in a single directory)
  • submit an informative help-desk ticket

Managing and Transferring Files

File Systems

Each user will have a home directory, $HOME, that will be located at /u/$USER. For each allocated project a user is a member of there will be symbolic links in $HOME to each projects shared space on the $PROJECT and $SCRATCH file systems.

For example, a user (with username auser) who has an allocated project XYZ_abcd of type XYZ (e.g. XSEDE, ILL, etc) with a local project serial code abcd will see the following entries in their $HOME and entries in the project and scratch file systems. To determine the mapping of XSEDE project to local project please use the TBD command.

$ ls -ld /u/$USER
drwxrwx---+ 12 root root 12345 Feb 21 11:54 /u/$USER

$ ls -l /u/$USER
/u/auser:
total 0
lrwxrwxrwx. 1 root root 24 Feb 21 11:54 project_abcd -> /projects/XYZ_abcd/auser
lrwxrwxrwx. 1 root root 23 Feb 21 11:54 scratch_abcd -> /scratch/XYZ_abcd/auser
...

$ ls -ld /projects/XYZ_abcd
drwxrws---+  45 root   XYZ_abcd      4096 Feb 21 11:54 /projects/XYZ_abcd

$ ls -l /projects/XYZ_abcd
total 0
drwxrws---+ 2 auser XYZ_abcd 6 Feb 21 11:54 auser
drwxrws---+ 2 buser XYZ_abcd 6 Feb 21 11:54 buser
...

$ ls -ld /scratch/XYZ_abcd
drwxrws---+  45 root   XYZ_abcd      4096 Feb 21 11:54 /scratch/XYZ_abcd

$ ls -l /scratch/XYZ_abcd
total 0
drwxrws---+ 2 auser XYZ_abcd 6 Feb 21 11:54 auser
drwxrws---+ 2 buser XYZ_abcd 6 Feb 21 11:54 buser
...


The symbolic links provide convenient short-cuts to project specific locations. 

  • Detail any pertinent environment variables, e.g., $HOME, $WORK, and any built-in aliases.
  • Tips on backups/storage

Transferring your Files

Discuss methods of transferring files and provide command-line examples

Sharing Files with Collaborators

Building Software

The Delta programming environment supports the GNU, AMD (AOCC), Intel and NVIDIA HPC compilers. Support for the HPE/Cray Programming environment is forthcoming. 

Modules provide access to the compiler + MPI environment. 

The default environment is the GCC 11.2.0 compiler + OpenMPI with support for cuda and gdrcopy. nvcc is in the cuda module and is loaded by default

AMD recommended compiler flags for GNU, AOCC, and Intel compilers for Milan processors can be found in the AMD Compiler Options Quick Reference Guide for Epyc 7xx3 processors.

Serial

To build (compile and link) a serial program in Fortran, C, and C++:

gccaoccnvhpc
gfortran myprog.f
gcc myprog.c
g++ myprog.cc
flang myprog.f
clang myprog.c
clang myprog.cc
nvfortran myprog.f
nvc myprog.c
nvc++ myprog.cc

MPI

To build (compile and link) a MPI program in Fortran, C, and C++:

MPI Implementationmodulefiles for MPI/CompilerBuild Commands


OpenMPI
(Home Page / Documentation)

aocc/3.2.0 openmpi

gcc/11.2.0 openmpi

nvhpc/22.2 openmpi

Fortran 77:mpif77 myprog.f
Fortran 90:mpif90 myprog.f90
C:mpicc myprog.c
C++:mpic++ myprog.cc
TBDTBD


OpenMP

To build an OpenMP program, use the -fopenmp / -mp option:

gccaoccnvhpc
gfortran -fopenmp myprog.f
gcc -fopenmp myprog.c
g++ -fopenmp myprog.cc
flang -fopenmp myprog.f
clang -fopenmp myprog.c
clang -fopenmp myprog.cc
nvfortran -mp myprog.f
nvc -mp myprog.c
nvc++ -mp myprog.cc

Hybrid MPI/OpenMP

To build an MPI/OpenMP hybrid program, use the -fopenmp / -mp option with the MPI compiling commands:

GCC
PGI/NVHPC
mpif77 -fopenmp myprog.f
mpif90 -fopenmp myprog.f90
mpicc -fopenmp myprog.c

mpic++ -fopenmp myprog.cc

mpif77 -mp myprog.f
mpif90 -mp myprog.f90
mpicc -mp myprog.c

mpic++ -mp myprog.cc

Cray xthi.c sample code

Document - XC Series User Application Placement Guide CLE6..0UP01 S-2496 | HPE Support

This code can be compiled using the methods show above.  The code will appear in some of the batch script examples below to demonstrate core placement options.

xthi.c
#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sched.h>
#include <mpi.h>
#include <omp.h>

/* Borrowed from util-linux-2.13-pre7/schedutils/taskset.c */
static char *cpuset_to_cstr(cpu_set_t *mask, char *str)
{
  char *ptr = str;
  int i, j, entry_made = 0;
  for (i = 0; i < CPU_SETSIZE; i++) {
    if (CPU_ISSET(i, mask)) {
      int run = 0;
      entry_made = 1;
      for (j = i + 1; j < CPU_SETSIZE; j++) {
        if (CPU_ISSET(j, mask)) run++;
        else break;
      }
      if (!run)
        sprintf(ptr, "%d,", i);
      else if (run == 1) {
        sprintf(ptr, "%d,%d,", i, i + 1);
        i++;
      } else {
        sprintf(ptr, "%d-%d,", i, i + run);
        i += run;
      }
      while (*ptr != 0) ptr++;
    }
  }
  ptr -= entry_made;
  *ptr = 0;
  return(str);
}

int main(int argc, char *argv[])
{
  int rank, thread;
  cpu_set_t coremask;
  char clbuf[7 * CPU_SETSIZE], hnbuf[64];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  memset(clbuf, 0, sizeof(clbuf));
  memset(hnbuf, 0, sizeof(hnbuf));
  (void)gethostname(hnbuf, sizeof(hnbuf));
  #pragma omp parallel private(thread, coremask, clbuf)
  {
    thread = omp_get_thread_num();
    (void)sched_getaffinity(0, sizeof(coremask), &coremask);
    cpuset_to_cstr(&coremask, clbuf);
    #pragma omp barrier
    printf("Hello from rank %d, thread %d, on %s. (core affinity = %s)\n",
            rank, thread, hnbuf, clbuf);
  }
  MPI_Finalize();
  return(0);
}

A version of xthi is also available from ORNL

% git clone https://github.com/olcf/XC30-Training/blob/master/affinity/Xthi.c


OpenACC

To build an OpenACC program, use the -acc option and the -mp option for multi-threaded:

NON-MULTITHREADED
MULTITHREADED
nvfortran -acc myprog.f
nvc -acc myprog.c
nvc++ -acc myprog.cc

nvfortran -acc -mp myprog.f
nvc -acc -mp myprog.c
nvc++ -acc -mp myprog.cc


  • list compilers and recommendations
  • any architecture-specific flags
  • how to build 3rd party software in your account

Software

  • lmod
  • spack/EasyBuild
  • NVIDIA NGC containers
  • OpenCL
  • CUDA
  • URL to XSEDE software inventory 

modules/lmod

Delta provides two sets of modules and a variety of compilers in each set.  The default environment is modtree/gpu which loads a recent version of gnu compilers , the openmpi implementation of MPI, and cuda.  The environment with gpu support will build binaries that run on both the gpu nodes (with cuda) and cpu nodes (potentially with warning messages because those nodes lack cuda drivers).  For situations where the same verions of software need to be deployed to both gpu and cpu nodes but with separate builds, the modtree/cpu environment provides the same default compiler and MPI but without cuda.  Use module spider package_name to search for software in lmod and see the steps to load it for your environment.

module (lmod) commandexample

module list

(display the currently loaded modules)

$ module list

Currently Loaded Modules:
  1) gcc/11.2.0   3) openmpi/4.1.2   5) modtree/gpu
  2) ucx/1.11.2   4) cuda/11.6.1

module load <package_name>

(loads a package or metamodule such as modtree/gpu or netcdf-c)

$ module load modtree/cpu

Due to MODULEPATH changes, the following have been reloaded:
  1) gcc/11.2.0     2) openmpi/4.1.2     3) ucx/1.11.2

The following have been reloaded with a version change:
  1) modtree/gpu => modtree/cpu

module spider <package_name>

(finds modules and displays the ways to load them)

$ module spider openblas

----------------------------------------------------------------------------
  openblas: openblas/0.3.20
----------------------------------------------------------------------------

    You will need to load all module(s) on any one of the lines below before the
 "openblas/0.3.20" module is available to load.

      aocc/3.2.0
      gcc/11.2.0
 
    Help:
      OpenBLAS: An optimized BLAS library

see also: User Guide for Lmod

Please open a service request ticket by sending email to help@ncsa.illinois.edu for help with software not currently installed on the Delta system. For single user or single project use cases the preference is for the user to use the spack software package manager to install software locally against the system spack installation as documented <here>. Delta support staff are available to provide limited assistance. For general installation requests the Delta project office will review each requests for broad use and installation effort.

Launching Applications (TBD)

  • Launching One Serial Application
  • Launching One Multi-Threaded Application
  • Launching One MPI Application
  • Launching One Hybrid (MPI+Threads) Application
  • More Than One Serial Application in the Same Job
  • MPI Applications One at a Time
  • More than One MPI Application Running Concurrently
  • More than One OpenMP Application Running Concurrently

Running Jobs

Job Accounting

The charge unit for Delta is the Service Unit (SU). This corresponds to the equivalent use of one compute core utilizing less than or equal to 2G of memory for one hour, or 1 GPU or fractional GPU using less than the corresponding amount of memory or cores for 1 hour (see table below). Keep in mind that your charges are based on the resources that are reserved for your job and don't necessarily reflect how the resources are used. Charges are based on either the number of cores or the fraction of the memory requested, whichever is larger. The minimum charge for any job is 1 SU.

Node Type

Service Unit Equivalence
CoresGPU FractionHost Memory
CPU Node1N/A2 GB

GPU Node

Quad A10021/7 A1008 GB
Quad A40161 A4064 GB
8-way A10021/7 A10032 GB
8-way MI100161 MI100256 GB

Please note that a weighting factor will discount the charge for the reduced-precision A40 nodes, as well as the novel AMD MI100 based node - this will be documented through the XSEDE SU converter.

Job Accounting Considerations

  • A node-exclusive job that runs on a compute node for one hour will be charged 128 SUs (128 cores x 1 hour)
  • A node-exclusive job that runs on a 4-way GPU node for one hour will be charge 4 SUs (4 GPU x 1 hour)
  • A node-exclusive job that runs on a 8-way GPU node for one hour will be charge 8 SUs (8 GPU x 1 hour)
  • A shared job that runs on an A100 node will be charged for the fractional usage of the A100 (eg, using 1/7 of an A100 for one hour will be 1/7 GPU x 1 hour, or 1/7 SU per hour, except the first hour will be 1 SU (minimum job charge).

Accessing the Compute Nodes

Delta implements the Slurm batch environment to manage access to the compute nodes.  Use the Slurm commands to run batch jobs or for interactive access to compute nodes.  See: https://slurm.schedmd.com/quickstart.html

  • batch jobs
  • interactive sessions with compute node(s) 
  • ssh from a login node directly to a compute node
    • available while you have a running batch or interactive job/session
      • show an example of ssh access, and slurm commands to find the job node(s)

Job Scheduler

https://slurm.schedmd.com/quickstart.html

Describe the job scheduler & scheduling algorithms

Most, if not all, XSEDE resources are running Slurm and this documentation already exists in some form.

https://slurm.schedmd.com/pdfs/summary.pdf

Partitions (Queues)

Describe current partitions.

Table. Delta Production Partitions/Queues

Partition/Queue

Node Type

Max Nodes per Job

Max Duration

Max Jobs in Queue*

Charge Factor

cpu

CPU

TBD

24 hr / 48 hr

TDB

1.0

cpu-interactiveCPUTBD30 minTBD2.0
gpuA100x4quad A100TBD24 hr / 48 hr

TDB

1.0

gpuA100x4-interactivequad-A100TBD30 minTBD2.0
gpuA100x8octa-A100TBD24 hr / 48 hr

TDB

1.0

gpuA100x8-interactiveocta-A100TBD30 minTBD2.0
gpuA40x4quad-A40TBD24 hr / 48 hrTBD0.6
gpuA40x4-interactivequad-A40TBD30 minTBD1.2
gpuMI100x8octa-MI100TBD24 hr / 48 hrTBD

1.0

gpuMI100x8-interactiveocta-MI100TBD30 minTBD2.0

Node Policies

Node-sharing is the default for jobs. Node-exclusive mode can be obtained by specifying all the consumable resources for that node type.

GPU NVIDIA MIG (GPU slicing) for the A100 will be supported.

Pre-emptive jobs will be supported.

Interactive Sessions

Describe any tools for running interactive jobs on the compute nodes.

  • built-in tools for running interactive jobs, e.g. PSC’s interact, TACC’s idev

Sample Job Scripts

Sample job scripts are the most requested documentation.

Provide sample job scripts for common job type scenarios. 

  • Serial jobs

    serial example script
    $ cat job.slurm
    #!/bin/bash
    #SBATCH --mem=16g
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=1    # <- match to OMP_NUM_THREADS
    #SBATCH --partition=cpu      # <- or one of: gpuA100x4 gpuA40x4 gpuA100x8 gpuMI100x8
    #SBATCH --account=account_name
    #SBATCH --job-name=myjobtest
    #SBATCH --time=00:10:00      # hh:mm:ss for the job
    ### GPU options ###
    ##SBATCH --gpus-per-node=2
    ##SBATCH --gpu-bind=none     # <- or closest
     
    module purge # drop modules and explicitly load the ones needed
                 # (good job metadata and reproducibility)
    module load python  # ... or any appropriate modules
    module list  # job documentation and metadata
    echo "job is starting on `hostname`"
    srun python3 myprog.py
  • MPI  

    mpi example script
    #!/bin/bash
    #SBATCH --mem=16g
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=32
    #SBATCH --cpus-per-task=1    # <- match to OMP_NUM_THREADS
    #SBATCH --partition=cpu      # <- or one of: gpuA100x4 gpuA40x4 gpuA100x8 gpuMI100x8
    #SBATCH --account=account_name
    #SBATCH --job-name=mympi
    #SBATCH --time=00:10:00      # hh:mm:ss for the job
    ### GPU options ###
    ##SBATCH --gpus-per-node=2
    ##SBATCH --gpu-bind=none     # <- or closest
    
    module purge # drop modules and explicitly load the ones needed
                 # (good job metadata and reproducibility)
    module load gcc/11.2.0 openmpi  # ... or any appropriate modules
    module list  # job documentation and metadata
    echo "job is starting on `hostname`"
    srun osu_reduce
  • OpenMP   

    openmp example script
    #!/bin/bash
    #SBATCH --mem=16g
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=32   # <- match to OMP_NUM_THREADS
    #SBATCH --partition=cpu      # <- or one of: gpuA100x4 gpuA40x4 gpuA100x8 gpuMI100x8
    #SBATCH --account=account_name
    #SBATCH --job-name=myopenmp
    #SBATCH --time=00:10:00      # hh:mm:ss for the job
    ### GPU options ###
    ##SBATCH --gpus-per-node=2
    ##SBATCH --gpu-bind=none     # <- or closest
    
    module purge # drop modules and explicitly load the ones needed
                 # (good job metadata and reproducibility)
    module load gcc/11.2.0  # ... or any appropriate modules
    module list  # job documentation and metadata
    echo "job is starting on `hostname`"
    export OMP_NUM_THREADS=32
    srun stream_gcc
  • Hybrid (MPI + OpenMP or MPI+X)

    mpi+x example script
    #!/bin/bash
    #SBATCH --mem=16g
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=4
    #SBATCH --cpus-per-task=4    # <- match to OMP_NUM_THREADS
    #SBATCH --partition=cpu      # <- or one of: gpuA100x4 gpuA40x4 gpuA100x8 gpuMI100x8
    #SBATCH --account=account_name
    #SBATCH --job-name=mympi+x
    #SBATCH --time=00:10:00      # hh:mm:ss for the job
    ### GPU options ###
    ##SBATCH --gpus-per-node=2
    ##SBATCH --gpu-bind=none     # <- or closest
    
    module purge # drop modules and explicitly load the ones needed
                 # (good job metadata and reproducibility)
    module load gcc/11.2.0 openmpi # ... or any appropriate modules
    module list  # job documentation and metadata
    echo "job is starting on `hostname`"
    export OMP_NUM_THREADS=4
    srun xthi
  • Parametric / Array / HTC jobs

Job Management 

Batch jobs are submitted through a job script using the sbatch command. Job scripts generally start with a series of SLURM directives that describe requirements of the job such as number of nodes, wall time required, etc… to the batch system/scheduler (SLURM directives can also be specified as options on the sbatch command line; command line options take precedence over those in the script). The rest of the batch script consists of user commands.

The syntax for sbatch is:

sbatch [list of sbatch options] script_name

The main sbatch options are listed below.  Refer to the sbatch man page for options.

  • The common resource_names are:
    --time=time

    time=maximum wall clock time (d-hh:mm:ss) [default: maximum limit of the queue(partition) summitted to]

    --nodes=n

    --ntasks=p Total number of cores for the batch job

    --ntasks-per-node=p Number of cores per node

    n=number of N-core nodes [default: 1 node]
    p=how many cores(ntasks) per job or per node(ntasks-per-node) to use (1 through 128) [default: 1 core]

    Examples:
    --time=00:30:00
    --nodes=2
    --ntasks=256

    or

    --time=00:30:00
    --nodes=2
    --ntasks-per-node=128
     

    Memory: The compute nodes have at lest 256GB. 

    Example:
    --time=00:30:00
    --nodes=2
    --ntask=256
    --mem=118000

    or

    --time=00:30:00
    --nodes=2
    --ntasks-per-node=64
    --mem-per-cpu=7375


squeue/scontrol/sinfo

Commands that display batch job and partition information .

SLURM EXAMPLE COMMANDDESCRIPTION
squeue -aList the status of all jobs on the system.
squeue -u $USERList the status of all your jobs in the batch system.
squeue -j JobIDList nodes allocated to a running job in addition to basic information..
scontrol show job JobIDList detailed information on a particular job.
sinfo -aList summary information on all the partition.

See the manual (man) pages for other available options.


Useful Batch Job Environment Variables

DESCRIPTION

SLURM ENVIRONMENT VARIABLE

DETAIL DESCRIPTION

JobID$SLURM_JOB_IDJob identifier assigned to the job
Job Submission Directory$SLURM_SUBMIT_DIRBy default, jobs start in the directory that the job was submitted from. So the "cd $SLURM_SUBMIT_DIR" command is not needed.
Machine(node) list$SLURM_NODELISTvariable name that contains the list of nodes assigned to the batch job
Array JobID$SLURM_ARRAY_JOB_ID
$SLURM_ARRAY_TASK_ID
each member of a job array is assigned a unique identifier

See the sbatch man page for additional environment variables available.

srun

The srun command initiates an interactive job on the compute nodes.

For example, the following command:

srun --time=00:30:00 --nodes=1 --ntasks-per-node=64 --pty /bin/bash

will run an interactive job in the default queue with a wall clock limit of 30 minutes, using one node and 16 cores per node. You can also use other sbatch options such as those documented above.

After you enter the command, you will have to wait for SLURM to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes for smaller amounts of time, the wait should be shorter because your job will backfill among larger jobs. You will see something like this:

srun: job 123456 queued and waiting for resources

Once the job starts, you will see:

srun: job 123456 has been allocated resources

and will be presented with an interactive shell prompt on the launch node. At this point, you can use the appropriate command to start your program.

When you are done with your runs, you can use the exit command to end the job.

scancel

The scancel command deletes a queued job or kills a running job.

  • scancel JobID deletes/kills a job.

Refunds

Refunds are considered, when appropriate, for jobs that failed due to circumstances beyond user control.

XSEDE users and project that wish to request a refund should see the XSEDE Refund Policy section located here.

Other allocated users and projects wishing to request a refund should email help@ncsa.illinois.edu. Please include the batch job ids and the standard error and output files produced by the job(s). 

Visualization

Delta A40 nodes support NVIDIA raytracing hardware.

  • describe visualization capabilities & software.
  • how to establish VNC/DVC/remote desktop

Containers

Delta provides NVIDIA NGC containers we have pre-built with Singularity.  Look for the latest binary containers in /sw/external/NGC/ The containers are used as shown in the sample scripts below:

PyTorch example script
#!/bin/bash
#SBATCH --mem=16g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1     # <- match to OMP_NUM_THREADS
#SBATCH --partition=gpuA100x4 # <- one of: gpuA100x4 gpuA40x4 gpuA100x8 gpuMI100x8
##SBATCH --account=account_name
#SBATCH --job-name=pytorchNGC
### GPU options ###
#SBATCH --gpus-per-node=1
#SBATCH --gpu-bind=none     # <- or closest
 
module purge # drop modules and explicitly load the ones needed
             # (good job metadata and reproducibility)

module list  # job documentation and metadata

echo "job is starting on `hostname`"

# run the container binary with arguments: python3 <program.py>
/sw/external/NGC/pytorch:22.02-py3 python3 tensor_gpu.py
Tensorflow example script
#!/bin/bash
#SBATCH --mem=16g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1     # <- match to OMP_NUM_THREADS
#SBATCH --partition=gpuA100x4 # <- one of: gpuA100x4 gpuA40x4 gpuA100x8 gpuMI100x8
##SBATCH --account=account_name
#SBATCH --job-name=tfNGC
### GPU options ###
#SBATCH --gpus-per-node=2
#SBATCH --gpu-bind=none     # <- or closest
 
module purge # drop modules and explicitly load the ones needed
             # (good job metadata and reproducibility)

module list  # job documentation and metadata

echo "job is starting on `hostname`"

# run the container binary with arguments: python3 <program.py>
/sw/external/NGC/tensorflow:22.02-tf2-py3 python3 tf_gpu.py

Container list (as of March, 2022)

catalog.txt
caffe:20.03-py3
caffe2:18.08-py3
cntk:18.08-py3 , Microsoft Cognitive Toolkit
digits:21.09-tensorflow-py3
matlab:r2021b
mxnet:21.09-py3
pytorch:22.02-py3
tensorflow:22.02-tf1-py3
tensorflow:22.02-tf2-py3
tensorrt:22.02-py3
theano:18.08
torch:18.08-py2

see also: https://catalog.ngc.nvidia.com/orgs/nvidia/containers

Protected Data (N/A)

IF APPLICABLE

  • Describe the system’s capabilities for handling protected data.
  • Data Retention Policies
  • How to run jobs with protected data.
  • Describe any mandated workflows.

Help

For assistance with the use of Delta

Acknowledge

To acknowledge the NCSA Delta system in particular, please include the following

This research is part of the Delta research computing project, which is supported by the National Science Foundation (award OCI 2005572), and the State of Illinois. Delta is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

To include acknowledgement of XSEDE contributions to a publication or presentation please see https://portal.xsede.org/acknowledge and https://www.xsede.org/for-users/acknowledgement.

References

List any supporting documentation resources


  • No labels