You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »

For complete SLURM documentation, see https://slurm.schedmd.com/. Here we only show simple examples with system-specific instructions.

HAL Slurm Wrapper Suite (Recommended)

Introduction

The HAL Slurm Wrapper Suite was designed to help users use the HAL system easily and efficiently. The current version is "swsuite-v0.1", which includes

srun → swrun : request resources to run interactive jobs.

sbatch → swbatch : request resource to submit a batch script to Slurm.

Usage

  • swrun -q <queue_name> -c <cpu_per_gpu> -t <walltime>
    • <queue_name> (required) : cpu, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
    • <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
    • <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
    • example: swrun -q gpux4 -c 36 -t 72 (request a full node: 1x node, x4 node, 144x cpus, 72x hours)
  • swbatch <run_script>
    • <run_script> (required) : same as original slurm batch.
    • <job_name> (required) : job name.
    • <output_file> (required) : output file name.
    • <error_file> (required) : error file name.
    • <queue_name> (required) : cpu, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
    • <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
    • <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
    • example: swbatch demo.sb


New Job Queues

Partition NamePriorityMax WalltimeNodes
Allowed
Min-Max CPUs
Per Node Allowed
Min-Max Mem
Per Node Allowed
GPU
Allowed
Local ScratchDescription
gpu-debughigh4 hrs112-14418-144 GB4none
gpux1normal72 hrs112-3618-54 GB1none
gpux2normal72 hrs124-7236-108 GB2none
gpux3normal72 hrs136-10854-162 GB3none
gpux4normal72 hrs148-14472-216 GB4none
cpunormal72 hrs196-96144-144 GB0none
gpux8low72 hrs248-14472-216 GB8none
gpux12low72 hrs348-14472-216 GB12none
gpux16low72 hrs448-14472-216 GB16none

Traditional Job Queues

Partition
Name
PriorityMax
Walltime
Min-Max
Nodes Allowed

Max CPUs
Per Node

Max Memory
Per CPU (GB)

Local Scratch
(GB)
Description
debughigh4 hrs1-11441.5Nonedesigned to access 1 node to run debug job
solonormal72 hrs1-11441.5Nonedesigned to access 1 node to run sequential and/or parallel job
ssdnormal72 hrs1-11441.5220similar to solo partition with extra local scratch, limited to hal[01-04]
batchlow72 hrs2-161441.5Nonedesigned to access 2-16 nodes (up to 64 GPUs) to run parallel job

HAL Example Job Scripts

New users should check the example job scripts at "/opt/apps/samples-runscript" and request adequate resources.

Script
Name
Job
Type
Partition

Max
Walltime

Nodes

CPU

GPU

Memory
(GB)
Description
run_debug_00gpu_96cpu_216GB.shinteractivedebug4:00:001960144submit interactive job, 1 full node for 4 hours CPU only task in "debug" partition
run_debug_01gpu_12cpu_18GB.shinteractivedebug4:00:00112118submit interactive job, 25% of 1 full node for 4 hours GPU task in "debug" partition
run_debug_02gpu_24cpu_36GB.shinteractivedebug4:00:00124236submit interactive job, 50% of 1 full node for 4 hours GPU task in "debug" partition
run_debug_04gpu_48cpu_72GB.shinteractivedebug4:00:00148472submit interactive job, 1 full node for 4 hours GPU task in "debug" partition
sub_solo_01node_01gpu_12cpu_18GB.sbsbatchsolo72:00:00112118submit batch job, 25% of 1 full node for 72 hours GPU task in "solo" partition
sub_solo_01node_02gpu_24cpu_36GB.sbsbatchsolo72:00:00124236submit batch job, 50% of 1 full node for 72 hours GPU task in "solo" partition
sub_solo_01node_04gpu_48cpu_72GB.sbsbatchsolo72:00:00148472submit batch job, 1 full node for 72 hours GPU task in "solo" partition
sub_ssd_01node_01gpu_12cpu_18GB.sbsbatchssd72:00:00112118submit batch job, 25% of 1 full node for 72 hours GPU task in "ssd" partition
sub_batch_02node_08gpu_96cpu_144GB.sbsbatchbatch72:00:002968144submit batch job, 2 full nodes for 72 hours GPU task in "batch" partition
sub_batch_16node_64gpu_768cpu_1152GB.sbsbatchbatch72:00:0016768641152submit batch job, 16 full nodes for 72 hours GPU task in "batch" partition

Native SLURM style

Submit Interactive Job with "srun"

srun --partition=debug --pty --nodes=1 \
--ntasks-per-node=12 --cores-per-socket=12 --mem-per-cpu=1500 --gres=gpu:v100:1 \
-t 01:30:00 --wait=0 \
--export=ALL /bin/bash

Submit Batch Job

sbatch [job_script]

Check Job Status

squeue                # check all jobs from all users 
squeue -u [user_name] # check all jobs belong to user_name

Cancel Running Job

scancel [job_id] # cancel job with [job_id]

PBS style

Some PBS commands are supported by SLURM.

Check Node Status

pbsnodes

Check Job Status

qstat -f [job_number]

Check Queue Status

qstat

Delete Job

qdel [job_number]

Submit Batch Job

$ cat test.pbs
#!/usr/bin/sh
#PBS -N test
#PBS -l nodes=1
#PBS -l walltime=10:00

hostname
$ qsub test.pbs
107
$ cat test.pbs.o107
hal01.hal.ncsa.illinois.edu
  • No labels