For complete SLURM documentation, see https://slurm.schedmd.com/. Here we only show simple examples with system-specific instructions.
HAL Slurm Wrapper Suite (Recommended)
Introduction
The HAL Slurm Wrapper Suite was designed to help users use the HAL system easily and efficiently. The current version is "swsuite-v0.1", which includes
srun (slurm command) → swrun : request resources to run interactive jobs.
sbatch (slurm command) → swbatch : request resource to submit a batch script to Slurm.
Rule of Thumb
- Minimize the required input options.
- Consistent with the original "slurm" run-script format.
- Submits job to suitable partition based on the number of GPUs needed.
Usage
- swrun -p <partition_name> -c <cpu_per_gpu> -t <walltime>
- <partition_name> (required) : cpu, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
- <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
- <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
- example: swrun -q gpux4 -c 36 -t 72 (request a full node: 1x node, x4 node, 144x cpus, 72x hours)
- swbatch <run_script>
- <run_script> (required) : same as original slurm batch.
- <job_name> (optional) : job name.
- <output_file> (optional) : output file name.
- <error_file> (optional) : error file name.
- <partition_name> (required) : cpu, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
- <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
- <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
example: swbatch demo.sb
demo.sb#!/bin/bash #SBATCH --job-name="demo" #SBATCH --output="demo.%j.%N.out" #SBATCH --error="demo.%j.%N.err" #SBATCH --partition=gpux1 srun hostname
New Job Queues
Partition Name | Priority | Max Walltime | Nodes Allowed | Min-Max CPUs Per Node Allowed | Min-Max Mem Per Node Allowed | GPU Allowed | Local Scratch | Description |
---|---|---|---|---|---|---|---|---|
gpu-debug | high | 4 hrs | 1 | 12-144 | 18-144 GB | 4 | none | designed to access 1 node to run debug job. |
gpux1 | normal | 72 hrs | 1 | 12-36 | 18-54 GB | 1 | none | designed to access 1 GPU on 1 node to run sequential and/or parallel job. |
gpux2 | normal | 72 hrs | 1 | 24-72 | 36-108 GB | 2 | none | designed to access 2 GPUs on 1 node to run sequential and/or parallel job. |
gpux3 | normal | 72 hrs | 1 | 36-108 | 54-162 GB | 3 | none | designed to access 3 GPUs on 1 node to run sequential and/or parallel job. |
gpux4 | normal | 72 hrs | 1 | 48-144 | 72-216 GB | 4 | none | designed to access 4 GPUs on 1 node to run sequential and/or parallel job. |
cpu | normal | 72 hrs | 1 | 96-96 | 144-144 GB | 0 | none | designed to access 96 CPUs on 1 node to run sequential and/or parallel job. |
gpux8 | low | 72 hrs | 2 | 48-144 | 72-216 GB | 8 | none | designed to access 8 GPUs on 2 nodes to run sequential and/or parallel job. |
gpux12 | low | 72 hrs | 3 | 48-144 | 72-216 GB | 12 | none | designed to access 12 GPUs on 3 nodes to run sequential and/or parallel job. |
gpux16 | low | 72 hrs | 4 | 48-144 | 72-216 GB | 16 | none | designed to access 16 GPUs on 4 nodes to run sequential and/or parallel job. |
HAL Wrapper Suite Example Job Scripts
New users should check the example job scripts at "/opt/samples/runscripts" and request adequate resources.
Script Name | Job Type | Partition | Walltime | Nodes | CPU | GPU | Memory | Description |
---|---|---|---|---|---|---|---|---|
run_gpux1_12cpu_24hrs.sh | interactive | gpux1 | 24 hrs | 1 | 12 | 1 | 18 GB | submit interactive job, 1x node for 24 hours w/ 12x CPU 1x GPU task in "gpux1" partition. |
run_gpux2_24cpu_24hrs.sh | interactive | gpux2 | 24 hrs | 1 | 24 | 2 | 36 GB | submit interactive job, 1x node for 24 hours w/ 24x CPU 2x GPU task in "gpux2" partition. |
sub_gpux1_12cpu_24hrs.sb | batch | gpux1 | 24 hrs | 1 | 12 | 1 | 18 GB | submit batch job, 1x node for 24 hours w/ 12x CPU 1x GPU task in "gpux1" partition. |
sub_gpux2_24cpu_24hrs.sb | batch | gpux2 | 24 hrs | 1 | 24 | 2 | 36 GB | submit batch job, 1x node for 24 hours w/ 24x CPU 2x GPU task in "gpux2" partition. |
sub_gpux4_48cpu_24hrs.sb | batch | gpux4 | 24 hrs | 1 | 48 | 4 | 72 GB | submit batch job, 1x node for 24 hours w/ 48x CPU 4x GPU task in "gpux4" partition. |
sub_gpux8_96cpu_24hrs.sb | batch | gpux8 | 24 hrs | 2 | 96 | 8 | 144 GB | submit batch job, 2x node for 24 hours w/ 96x CPU 8x GPU task in "gpux8" partition. |
sub_gpux16_192cpu_24hrs.sb | batch | gpux16 | 24 hrs | 4 | 192 | 16 | 288 GB | submit batch job, 4x node for 24 hours w/ 192x CPU 16x GPU task in "gpux16" partition. |
Native SLURM style
Submit Interactive Job with "srun"
srun --partition=debug --pty --nodes=1 \ --ntasks-per-node=12 --cores-per-socket=12 --mem-per-cpu=1500 --gres=gpu:v100:1 \ -t 01:30:00 --wait=0 \ --export=ALL /bin/bash
Submit Batch Job
sbatch [job_script]
Check Job Status
squeue # check all jobs from all users squeue -u [user_name] # check all jobs belong to user_name
Cancel Running Job
scancel [job_id] # cancel job with [job_id]
Job Queues
Partition Name | Priority | Max Walltime | Min-Max Nodes Allowed | Max CPUs | Max Memory | Local Scratch (GB) | Description |
---|---|---|---|---|---|---|---|
debug | high | 4 hrs | 1-1 | 144 | 1.5 | None | designed to access 1 node to run debug job |
solo | normal | 72 hrs | 1-1 | 144 | 1.5 | None | designed to access 1 node to run sequential and/or parallel job |
ssd | normal | 72 hrs | 1-1 | 144 | 1.5 | 220 | similar to solo partition with extra local scratch, limited to hal[01-04] |
batch | low | 72 hrs | 2-16 | 144 | 1.5 | None | designed to access 2-16 nodes (up to 64 GPUs) to run parallel job |
HAL Example Job Scripts
New users should check the example job scripts at "/opt/apps/samples-runscript" and request adequate resources.
Script Name | Job Type | Partition | Max | Nodes | CPU | GPU | Memory (GB) | Description |
---|---|---|---|---|---|---|---|---|
run_debug_00gpu_96cpu_216GB.sh | interactive | debug | 4:00:00 | 1 | 96 | 0 | 144 | submit interactive job, 1 full node for 4 hours CPU only task in "debug" partition |
run_debug_01gpu_12cpu_18GB.sh | interactive | debug | 4:00:00 | 1 | 12 | 1 | 18 | submit interactive job, 25% of 1 full node for 4 hours GPU task in "debug" partition |
run_debug_02gpu_24cpu_36GB.sh | interactive | debug | 4:00:00 | 1 | 24 | 2 | 36 | submit interactive job, 50% of 1 full node for 4 hours GPU task in "debug" partition |
run_debug_04gpu_48cpu_72GB.sh | interactive | debug | 4:00:00 | 1 | 48 | 4 | 72 | submit interactive job, 1 full node for 4 hours GPU task in "debug" partition |
sub_solo_01node_01gpu_12cpu_18GB.sb | sbatch | solo | 72:00:00 | 1 | 12 | 1 | 18 | submit batch job, 25% of 1 full node for 72 hours GPU task in "solo" partition |
sub_solo_01node_02gpu_24cpu_36GB.sb | sbatch | solo | 72:00:00 | 1 | 24 | 2 | 36 | submit batch job, 50% of 1 full node for 72 hours GPU task in "solo" partition |
sub_solo_01node_04gpu_48cpu_72GB.sb | sbatch | solo | 72:00:00 | 1 | 48 | 4 | 72 | submit batch job, 1 full node for 72 hours GPU task in "solo" partition |
sub_ssd_01node_01gpu_12cpu_18GB.sb | sbatch | ssd | 72:00:00 | 1 | 12 | 1 | 18 | submit batch job, 25% of 1 full node for 72 hours GPU task in "ssd" partition |
sub_batch_02node_08gpu_96cpu_144GB.sb | sbatch | batch | 72:00:00 | 2 | 96 | 8 | 144 | submit batch job, 2 full nodes for 72 hours GPU task in "batch" partition |
sub_batch_16node_64gpu_768cpu_1152GB.sb | sbatch | batch | 72:00:00 | 16 | 768 | 64 | 1152 | submit batch job, 16 full nodes for 72 hours GPU task in "batch" partition |
PBS style
Some PBS commands are supported by SLURM.
Check Node Status
pbsnodes
Check Job Status
qstat -f [job_number]
Check Queue Status
qstat
Delete Job
qdel [job_number]
Submit Batch Job
$ cat test.pbs #!/usr/bin/sh #PBS -N test #PBS -l nodes=1 #PBS -l walltime=10:00 hostname $ qsub test.pbs 107 $ cat test.pbs.o107 hal01.hal.ncsa.illinois.edu