For complete SLURM documentation, see https://slurm.schedmd.com/. Here we only show simple examples with system-specific instructions.

HAL Slurm Wrapper Suite (Recommended)

Introduction

The HAL Slurm Wrapper Suite was designed to help users use the HAL system easily and efficiently. The current version is "swsuite-v0.4", which includes

srun (slurm command) → swrun : request resources to run interactive jobs.

sbatch (slurm command) → swbatch : request resource to submit a batch script to Slurm.

squeue (slurm command) → swqueue : check current running jobs and computational resource status.

Rule of Thumb

Usage

Request Only As Much As You Can Make Use Of

Many applications require some amount of modification to make use of more than one GPUs for computation. Almost all programs require nontrivial optimizations to be able to run efficiently on more than one node (partitions gpux8 and larger). Monitor your usage and avoid occupying resources that you cannot make use of.

New Job Queues

Partition NamePriorityMax WalltimeNodes
Allowed
Min-Max CPUs
Per Node Allowed
Min-Max Mem
Per Node Allowed
GPU
Allowed
Local ScratchDescription
gpux1normal72 hrs116-4019.2-48 GB1nonedesigned to access 1 GPU on 1 node to run sequential and/or parallel jobs.
gpux2normal72 hrs132-8038.4-96 GB2nonedesigned to access 2 GPUs on 1 node to run sequential and/or parallel jobs.
gpux3normal72 hrs148-12057.6-144 GB3nonedesigned to access 3 GPUs on 1 node to run sequential and/or parallel jobs.
gpux4normal72 hrs164-16076.8-192 GB4nonedesigned to access 4 GPUs on 1 node to run sequential and/or parallel jobs.
gpux8normal72 hrs264-16076.8-192 GB8nonedesigned to access 8 GPUs on 2 nodes to run sequential and/or parallel jobs.
gpux12normal72 hrs364-16076.8-192 GB12nonedesigned to access 12 GPUs on 3 nodes to run sequential and/or parallel jobs.
gpux16normal72 hrs464-16076.8-192 GB16nonedesigned to access 16 GPUs on 4 nodes to run sequential and/or parallel jobs.
cpun1normal72 hrs196-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 1 node to run sequential and/or parallel jobs.
cpun2normal72 hrs296-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 2 nodes to run sequential and/or parallel jobs.
cpun4normal72 hrs496-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 4 nodes to run sequential and/or parallel jobs.
cpun8normal72 hrs896-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 8 nodes to run sequential and/or parallel jobs.
cpun16normal72 hrs1696-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 16 nodes to run sequential and/or parallel jobs.
cpu_mininormal72 hrs18-89.6-9.6 GB0nonedesigned to access 8 CPUs on 1 node to run tensorboard jobs.

HAL Wrapper Suite Example Job Scripts

New users should check the example job scripts at "/opt/samples/runscripts" and request adequate resources.

Script Name

Job Type

Partition

Walltime

NodesCPUGPU

Memory

Description
run_gpux1_16cpu_24hrs.shinteractivegpux124 hrs116119.2 GBsubmit interactive job, 1x node for 24 hours w/ 12x CPU 1x GPU task in "gpux1" partition.
run_gpux2_32cpu_24hrs.shinteractivegpux224 hrs132238.4 GBsubmit interactive job, 1x node for 24 hours w/ 24x CPU 2x GPU task in "gpux2" partition.
sub_gpux1_16cpu_24hrs.swbbatchgpux124 hrs116119.2 GBsubmit batch job, 1x node for 24 hours w/ 12x CPU 1x GPU task in "gpux1" partition.
sub_gpux2_32cpu_24hrs.swbbatchgpux224 hrs132238.4 GBsubmit batch job, 1x node for 24 hours w/ 24x CPU 2x GPU task in "gpux2" partition.
sub_gpux4_64cpu_24hrs.swbbatchgpux424 hrs164476.8 GBsubmit batch job, 1x node for 24 hours w/ 48x CPU 4x GPU task in "gpux4" partition.
sub_gpux8_128cpu_24hrs.swbbatchgpux824 hrs21288153.6 GBsubmit batch job, 2x node for 24 hours w/ 96x CPU 8x GPU task in "gpux8" partition.
sub_gpux16_256cpu_24hrs.swbbatchgpux1624 hrs425616 153.6 GBsubmit batch job, 4x node for 24 hours w/ 192x CPU 16x GPU task in "gpux16" partition.

Native SLURM style

Submit Interactive Job with "srun"

srun --partition=debug --pty --nodes=1 \
     --ntasks-per-node=16 --cores-per-socket=4 \
     --threads-per-core=4 --sockets-per-node=1 \
     --mem-per-cpu=1500 --gres=gpu:v100:1 \
     --time 01:30:00 --wait=0 \
     --export=ALL /bin/bash

Submit Batch Job

sbatch [job_script]

Check Job Status

squeue                # check all jobs from all users 
squeue -u [user_name] # check all jobs belong to user_name

Cancel Running Job

scancel [job_id] # cancel job with [job_id]

PBS style

Some PBS commands are supported by SLURM.

Check Node Status

pbsnodes

Check Job Status

qstat -f [job_number]

Check Queue Status

qstat

Delete Job

qdel [job_number]

Submit Batch Job

$ cat test.pbs
#!/usr/bin/sh
#PBS -N test
#PBS -l nodes=1
#PBS -l walltime=10:00

hostname
$ qsub test.pbs
107
$ cat test.pbs.o107
hal01.hal.ncsa.illinois.edu