View Source

For complete SLURM documentation, see https://slurm.schedmd.com/. Here we only show simple examples with system-specific instructions.

HAL Slurm Wrapper Suite (Recommended)

Introduction

The HAL Slurm Wrapper Suite was designed to help users use the HAL system easily and efficiently. The current version is "swsuite-v0.4", which includes

srun (slurm command) → swrun : request resources to run interactive jobs.

sbatch (slurm command) → swbatch : request resource to submit a batch script to Slurm.

squeue (slurm command) → swqueue : check current running jobs and computational resource status.

Rule of Thumb

Minimize the required input options.
Consistent with the original "slurm" run-script format.
Submits job to suitable partition based on the number of GPUs needed (number of nodes for CPU partition).

Usage

swrun -p <partition_name> -c <cpu_per_gpu> -t <walltime> -r <reservation_name>
- <partition_name> (required) : cpun1, cpun2, cpun4, cpun8, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
- <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
- <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
- <reservation_name> (optional) : reservation name granted to user.
- example: swrun -p gpux4 -c 36 -t 72 (request a full node: 1x node, x4 node, 144x cpus, 72x hours)
swbatch <run_script>
- <run_script> (required) : same as original slurm batch.
- <job_name> (optional) : job name.
- <output_file> (optional) : output file name.
- <error_file> (optional) : error file name.
- <partition_name> (required) : cpun1, cpun2, cpun4, cpun8, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
- <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
- <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
- <reservation_name> (optional) : reservation name granted to user.
- example: swbatch demo.swb
  #!/bin/bash #SBATCH --job-name="demo" #SBATCH --output="demo.%j.%N.out" #SBATCH --error="demo.%j.%N.err" #SBATCH --partition=gpux1 srun hostname
swqueue
- example: swqueue

New Job Queues

Partition Name	Priority	Max Walltime	Nodes Allowed	Min-Max CPUs Per Node Allowed	Min-Max Mem Per Node Allowed	GPU Allowed	Local Scratch	Description
gpux1	normal	72 hrs	1	16-40	21.6-48 GB	1	none	designed to access 1 GPU on 1 node to run sequential and/or parallel job.
gpux2	normal	72 hrs	1	32-80	36-108 GB	2	none	designed to access 2 GPUs on 1 node to run sequential and/or parallel job.
gpux3	normal	72 hrs	1	48-120	54-162 GB	3	none	designed to access 3 GPUs on 1 node to run sequential and/or parallel job.
gpux4	normal	72 hrs	1	64-160	72-216 GB	4	none	designed to access 4 GPUs on 1 node to run sequential and/or parallel job.
gpux8	normal	72 hrs	2	64-160	72-216 GB	8	none	designed to access 8 GPUs on 2 nodes to run sequential and/or parallel job.
gpux12	normal	72 hrs	3	64-160	72-216 GB	12	none	designed to access 12 GPUs on 3 nodes to run sequential and/or parallel job.
gpux16	normal	72 hrs	4	64-160	72-216 GB	16	none	designed to access 16 GPUs on 4 nodes to run sequential and/or parallel job.
cpu_mini	normal	72 hrs	1	8-8			none
cpun1	normal	72 hrs	1	96-96	144-144 GB	0	none	designed to access 96 CPUs on 1-16 node to run sequential and/or parallel job.
cpun2	normal	72 hrs	2	96-96			none
cpun4	normal	72 hrs	4	96-96			none
cpun8	normal	72 hrs	8	96-96			none
cpun16	normal	72 hrs	16	96-96			none

HAL Wrapper Suite Example Job Scripts

New users should check the example job scripts at "/opt/samples/runscripts" and request adequate resources.

Script Name	Job Type	Partition	Walltime	Nodes	CPU	GPU	Memory	Description
run_gpux1_12cpu_24hrs.sh	interactive	gpux1	24 hrs	1	12	1	18 GB	submit interactive job, 1x node for 24 hours w/ 12x CPU 1x GPU task in "gpux1" partition.
run_gpux2_24cpu_24hrs.sh	interactive	gpux2	24 hrs	1	24	2	36 GB	submit interactive job, 1x node for 24 hours w/ 24x CPU 2x GPU task in "gpux2" partition.
sub_gpux1_12cpu_24hrs.sb	batch	gpux1	24 hrs	1	12	1	18 GB	submit batch job, 1x node for 24 hours w/ 12x CPU 1x GPU task in "gpux1" partition.
sub_gpux2_24cpu_24hrs.sb	batch	gpux2	24 hrs	1	24	2	36 GB	submit batch job, 1x node for 24 hours w/ 24x CPU 2x GPU task in "gpux2" partition.
sub_gpux4_48cpu_24hrs.sb	batch	gpux4	24 hrs	1	48	4	72 GB	submit batch job, 1x node for 24 hours w/ 48x CPU 4x GPU task in "gpux4" partition.
sub_gpux8_96cpu_24hrs.sb	batch	gpux8	24 hrs	2	96	8	144 GB	submit batch job, 2x node for 24 hours w/ 96x CPU 8x GPU task in "gpux8" partition.
sub_gpux16_192cpu_24hrs.sb	batch	gpux16	24 hrs	4	192	16	288 GB	submit batch job, 4x node for 24 hours w/ 192x CPU 16x GPU task in "gpux16" partition.

Native SLURM style

Submit Interactive Job with "srun"

srun --partition=debug --pty --nodes=1 \
     --ntasks-per-node=12 --cores-per-socket=3 \
     --threads-per-core=4 --sockets-per-node=1 \
     --mem-per-cpu=1500 --gres=gpu:v100:1 \
     --time 01:30:00 --wait=0 \
     --export=ALL /bin/bash

Submit Batch Job

sbatch [job_script]

Check Job Status

squeue                # check all jobs from all users 
squeue -u [user_name] # check all jobs belong to user_name

Cancel Running Job

scancel [job_id] # cancel job with [job_id]

PBS style

Some PBS commands are supported by SLURM.

Check Node Status

pbsnodes

Check Job Status

qstat -f [job_number]

Check Queue Status

qstat

Delete Job

qdel [job_number]

Submit Batch Job

$ cat test.pbs
#!/usr/bin/sh
#PBS -N test
#PBS -l nodes=1
#PBS -l walltime=10:00

hostname
$ qsub test.pbs
107
$ cat test.pbs.o107
hal01.hal.ncsa.illinois.edu