Job management with SLURM

For complete SLURM documentation, see https://slurm.schedmd.com/. Here we only show simple examples with system-specific instructions.

HAL Job Queues

Partition Name	Priority	Max Walltime	Min-Max Nodes Allowed	Max CPUs Per Node	Max Memory Per CPU (GB)	Description
debug	high	12 hrs	1-1	144	1.5	designed to access 1 GPU to run debug or short-term job
solo	normal	72 hrs	1-1	144	1.5	designed to access 1 GPU to run long-term job
batch	normal	72 hrs	2-16	144	1.5	designed to access 2-16 nodes (up to 64 GPUs) to run parallel job

HAL Example Job Scripts (Recommended)

New users should check the example job scripts at "/opt/apps/samples-runscript" and request adequate resources.

Script Name	Job Type	Partition	Max Walltime	Nodes	CPU	GPU	Memory (GB)	Description
run_debug_00gpu_036cpu_0216mem.sh	interactive	debug	12:00:00	1	36	0	216	submit interactive job, 1 full node for 12 hours CPU only task in "debug" partition
run_debug_01gpu_008cpu_0048mem.sh	interactive	debug	12:00:00	1	8	1	48	submit interactive job, 25% of 1 full node for 12 hours GPU task in "debug" partition
run_debug_02gpu_016cpu_0096mem.sh	interactive	debug	12:00:00	1	16	2	96	submit interactive job, 50% of 1 full node for 12 hours GPU task in "debug" partition
run_debug_04gpu_032cpu_0192mem.sh	interactive	debug	12:00:00	1	32	4	192	submit interactive job, 1 full node for 12 hours GPU task in "debug" partition
sub_solo_01node_01gpu_08cpu_0048mem.sb	sbatch	solo	72:00:00	1	8	1	48	submit batch job, 25% of 1 full node for 72 hours GPU task in "solo" partition
sub_solo_01node_02gpu_16cpu_0096mem.sb	sbatch	solo	72:00:00	1	32	4	192	submit batch job, 50% of 1 full node for 72 hours GPU task in "solo" partition
sub_solo_01node_04gpu_32cpu_0192mem.sb	sbatch	solo	72:00:00	1	32	4	192	submit batch job, 1 full node for 72 hours GPU task in "solo" partition
sub_batch_02node_08gpu_064cpu_0384mem.sb	sbatch	batch	72:00:00	2	64	8	384	submit batch job, 2 full nodes for 72 hours GPU task in "batch" partition
sub_batch_16node_64gpu_512cpu_3072mem.sb	sbatch	batch	72:00:00	16	512	64	3072	submit batch job, 16 full nodes for 72 hours GPU task in "batch" partition

Native SLURM style

Submit Interactive Job with "srun"

srun --partition=debug --pty --nodes=1 \
--ntasks-per-node=8 --gres=gpu:v100:1 \
-t 01:30:00 --wait=0 \
--export=ALL /bin/bash

Submit Batch Job

sbatch [job_script]

Check Job Status

squeue                # check all jobs from all users 
squeue -u [user_name] # check all jobs belong to user_name

Cancel Running Job

scancel [job_id] # cancel job with [job_id]

PBS style

Some PBS commands are supported by SLURM.

Check Node Status

pbsnodes

Check Job Status

qstat -f [job_number]

Check Queue Status

qstat

Delete Job

qdel [job_number]

Submit Batch Job

$ cat test.pbs
#!/usr/bin/sh
#PBS -N test
#PBS -l nodes=1
#PBS -l walltime=10:00

hostname
$ qsub test.pbs
107
$ cat test.pbs.o107
hal01.hal.ncsa.illinois.edu

Child pages

Job management with SLURM

HAL Job Queues

HAL Example Job Scripts (Recommended)

Native SLURM style

Submit Interactive Job with "srun"

Submit Batch Job

Check Job Status

Cancel Running Job

PBS style

Check Node Status

Check Job Status

Check Queue Status

Delete Job

Submit Batch Job