You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 32 Next »

For complete SLURM documentation, see https://slurm.schedmd.com/. Here we only show simple examples with system-specific instructions.

HAL Slurm Wrapper Suite (Recommended)

Introduction

The HAL Slurm Wrapper Suite was designed to help users use the HAL system easily and efficiently. The current version is "swsuite-v0.1", which includes

srun → swrun : request resources to run interactive jobs.

sbatch → swbatch : request resource to submit a batch script to Slurm.

Rule of Thumb

  • Minimize the required input options.
  • Consistent with the original "slurm" run-script format.
  • Submits job to suitable partition based on the number of GPUs needed.

Usage

  • swrun -q <queue_name> -c <cpu_per_gpu> -t <walltime>
    • <queue_name> (required) : cpu, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
    • <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
    • <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
    • example: swrun -q gpux4 -c 36 -t 72 (request a full node: 1x node, x4 node, 144x cpus, 72x hours)
  • swbatch <run_script>
    • <run_script> (required) : same as original slurm batch.
    • <job_name> (required) : job name.
    • <output_file> (required) : output file name.
    • <error_file> (required) : error file name.
    • <queue_name> (required) : cpu, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
    • <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
    • <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
    • example: swbatch demo.sb

      demo.sb
      #!/bin/bash
      
      #SBATCH --job-name="demo"
      #SBATCH --output="demo.%j.%N.out"
      #SBATCH --error="demo.%j.%N.err"
      #SBATCH --partition=gpux1
      
      srun hostname

New Job Queues

Partition NamePriorityMax WalltimeNodes
Allowed
Min-Max CPUs
Per Node Allowed
Min-Max Mem
Per Node Allowed
GPU
Allowed
Local ScratchDescription
gpu-debughigh4 hrs112-14418-144 GB4nonedesigned to access 1 node to run debug job.
gpux1normal72 hrs112-3618-54 GB1nonedesigned to access 1 GPU on 1 node to run sequential and/or parallel job.
gpux2normal72 hrs124-7236-108 GB2nonedesigned to access 2 GPUs on 1 node to run sequential and/or parallel job.
gpux3normal72 hrs136-10854-162 GB3nonedesigned to access 3 GPUs on 1 node to run sequential and/or parallel job.
gpux4normal72 hrs148-14472-216 GB4nonedesigned to access 4 GPUs on 1 node to run sequential and/or parallel job.
cpunormal72 hrs196-96144-144 GB0nonedesigned to access 96 CPUs on 1 node to run sequential and/or parallel job.
gpux8low72 hrs248-14472-216 GB8nonedesigned to access 8 GPUs on 2 nodes to run sequential and/or parallel job.
gpux12low72 hrs348-14472-216 GB12nonedesigned to access 12 GPUs on 3 nodes to run sequential and/or parallel job.
gpux16low72 hrs448-14472-216 GB16nonedesigned to access 16 GPUs on 4 nodes to run sequential and/or parallel job.

Native SLURM style

Submit Interactive Job with "srun"

srun --partition=debug --pty --nodes=1 \
--ntasks-per-node=12 --cores-per-socket=12 --mem-per-cpu=1500 --gres=gpu:v100:1 \
-t 01:30:00 --wait=0 \
--export=ALL /bin/bash

Submit Batch Job

sbatch [job_script]

Check Job Status

squeue                # check all jobs from all users 
squeue -u [user_name] # check all jobs belong to user_name

Cancel Running Job

scancel [job_id] # cancel job with [job_id]

Job Queues

Partition
Name
PriorityMax
Walltime
Min-Max
Nodes Allowed

Max CPUs
Per Node

Max Memory
Per CPU (GB)

Local Scratch
(GB)
Description
debughigh4 hrs1-11441.5Nonedesigned to access 1 node to run debug job
solonormal72 hrs1-11441.5Nonedesigned to access 1 node to run sequential and/or parallel job
ssdnormal72 hrs1-11441.5220similar to solo partition with extra local scratch, limited to hal[01-04]
batchlow72 hrs2-161441.5Nonedesigned to access 2-16 nodes (up to 64 GPUs) to run parallel job

HAL Example Job Scripts

New users should check the example job scripts at "/opt/apps/samples-runscript" and request adequate resources.

Script
Name
Job
Type
Partition

Max
Walltime

Nodes

CPU

GPU

Memory
(GB)
Description
run_debug_00gpu_96cpu_216GB.shinteractivedebug4:00:001960144submit interactive job, 1 full node for 4 hours CPU only task in "debug" partition
run_debug_01gpu_12cpu_18GB.shinteractivedebug4:00:00112118submit interactive job, 25% of 1 full node for 4 hours GPU task in "debug" partition
run_debug_02gpu_24cpu_36GB.shinteractivedebug4:00:00124236submit interactive job, 50% of 1 full node for 4 hours GPU task in "debug" partition
run_debug_04gpu_48cpu_72GB.shinteractivedebug4:00:00148472submit interactive job, 1 full node for 4 hours GPU task in "debug" partition
sub_solo_01node_01gpu_12cpu_18GB.sbsbatchsolo72:00:00112118submit batch job, 25% of 1 full node for 72 hours GPU task in "solo" partition
sub_solo_01node_02gpu_24cpu_36GB.sbsbatchsolo72:00:00124236submit batch job, 50% of 1 full node for 72 hours GPU task in "solo" partition
sub_solo_01node_04gpu_48cpu_72GB.sbsbatchsolo72:00:00148472submit batch job, 1 full node for 72 hours GPU task in "solo" partition
sub_ssd_01node_01gpu_12cpu_18GB.sbsbatchssd72:00:00112118submit batch job, 25% of 1 full node for 72 hours GPU task in "ssd" partition
sub_batch_02node_08gpu_96cpu_144GB.sbsbatchbatch72:00:002968144submit batch job, 2 full nodes for 72 hours GPU task in "batch" partition
sub_batch_16node_64gpu_768cpu_1152GB.sbsbatchbatch72:00:0016768641152

submit batch job, 16 full nodes for 72 hours GPU task in "batch" partition

PBS style

Some PBS commands are supported by SLURM.

Check Node Status

pbsnodes

Check Job Status

qstat -f [job_number]

Check Queue Status

qstat

Delete Job

qdel [job_number]

Submit Batch Job

$ cat test.pbs
#!/usr/bin/sh
#PBS -N test
#PBS -l nodes=1
#PBS -l walltime=10:00

hostname
$ qsub test.pbs
107
$ cat test.pbs.o107
hal01.hal.ncsa.illinois.edu
  • No labels