You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 52 Next »

For complete SLURM documentation, see https://slurm.schedmd.com/. Here we only show simple examples with system-specific instructions.

HAL Slurm Wrapper Suite (Recommended)

Introduction

The HAL Slurm Wrapper Suite was designed to help users use the HAL system easily and efficiently. The current version is "swsuite-v0.4", which includes

srun (slurm command) → swrun : request resources to run interactive jobs.

sbatch (slurm command) → swbatch : request resource to submit a batch script to Slurm.

squeue (slurm command) → swqueue : check current running jobs and computational resource status.

Rule of Thumb

  • Minimize the required input options.
  • Consistent with the original "slurm" run-script format.
  • Submits job to suitable partition based on the number of GPUs needed (number of nodes for CPU partition).

Usage

  • swrun -p <partition_name> -c <cpu_per_gpu> -t <walltime> -r <reservation_name>
    • <partition_name> (required) : cpun1, cpun2, cpun4, cpun8, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
    • <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
    • <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
    • <reservation_name> (optional) : reservation name granted to user.
    • example: swrun -p gpux4 -c 36 -t 72 (request a full node: 1x node, x4 node, 144x cpus, 72x hours)
  • swbatch <run_script>
    • <run_script> (required) : same as original slurm batch.
    • <job_name> (optional) : job name.
    • <output_file> (optional) : output file name.
    • <error_file> (optional) : error file name.
    • <partition_name> (required) : cpun1, cpun2, cpun4, cpun8, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
    • <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
    • <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
    • <reservation_name> (optional) : reservation name granted to user.
    • example: swbatch demo.swb

      demo.swb
      #!/bin/bash
      
      #SBATCH --job-name="demo"
      #SBATCH --output="demo.%j.%N.out"
      #SBATCH --error="demo.%j.%N.err"
      #SBATCH --partition=gpux1
      
      srun hostname
  • swqueue
    • example: swqueue

New Job Queues

Partition NamePriorityMax WalltimeNodes
Allowed
Min-Max CPUs
Per Node Allowed
Min-Max Mem
Per Node Allowed
GPU
Allowed
Local ScratchDescription
gpux1normal72 hrs116-4019.2-48 GB1nonedesigned to access 1 GPU on 1 node to run sequential and/or parallel jobs.
gpux2normal72 hrs132-8038.4-96 GB2nonedesigned to access 2 GPUs on 1 node to run sequential and/or parallel jobs.
gpux3normal72 hrs148-12057.6-144 GB3nonedesigned to access 3 GPUs on 1 node to run sequential and/or parallel jobs.
gpux4normal72 hrs164-16076.8-192 GB4nonedesigned to access 4 GPUs on 1 node to run sequential and/or parallel jobs.
gpux8normal72 hrs264-16076.8-192 GB8nonedesigned to access 8 GPUs on 2 nodes to run sequential and/or parallel jobs.
gpux12normal72 hrs364-16076.8-192 GB12nonedesigned to access 12 GPUs on 3 nodes to run sequential and/or parallel jobs.
gpux16normal72 hrs464-16076.8-192 GB16nonedesigned to access 16 GPUs on 4 nodes to run sequential and/or parallel jobs.
cpun1normal72 hrs196-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 1 node to run sequential and/or parallel jobs.
cpun2normal72 hrs296-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 2 nodes to run sequential and/or parallel jobs.
cpun4normal72 hrs496-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 4 nodes to run sequential and/or parallel jobs.
cpun8normal72 hrs896-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 8 nodes to run sequential and/or parallel jobs.
cpun16normal72 hrs1696-96115.2-115.2 GB0nonedesigned to access 96 CPUs on 16 nodes to run sequential and/or parallel jobs.
cpu_mininormal72 hrs18-89.6-9.6 GB0nonedesigned to access 8 CPUs on 1 node to run tensorboard jobs.

HAL Wrapper Suite Example Job Scripts

New users should check the example job scripts at "/opt/samples/runscripts" and request adequate resources.

Script Name

Job Type

Partition

Walltime

NodesCPUGPU

Memory

Description
run_gpux1_12cpu_24hrs.shinteractivegpux124 hrs112118 GBsubmit interactive job, 1x node for 24 hours w/ 12x CPU 1x GPU task in "gpux1" partition.
run_gpux2_24cpu_24hrs.shinteractivegpux224 hrs124236 GBsubmit interactive job, 1x node for 24 hours w/ 24x CPU 2x GPU task in "gpux2" partition.
sub_gpux1_12cpu_24hrs.sbbatchgpux124 hrs112118 GBsubmit batch job, 1x node for 24 hours w/ 12x CPU 1x GPU task in "gpux1" partition.
sub_gpux2_24cpu_24hrs.sbbatchgpux224 hrs124236 GBsubmit batch job, 1x node for 24 hours w/ 24x CPU 2x GPU task in "gpux2" partition.
sub_gpux4_48cpu_24hrs.sbbatchgpux424 hrs148472 GBsubmit batch job, 1x node for 24 hours w/ 48x CPU 4x GPU task in "gpux4" partition.
sub_gpux8_96cpu_24hrs.sbbatchgpux824 hrs2968144 GBsubmit batch job, 2x node for 24 hours w/ 96x CPU 8x GPU task in "gpux8" partition.
sub_gpux16_192cpu_24hrs.sbbatchgpux1624 hrs419216288 GBsubmit batch job, 4x node for 24 hours w/ 192x CPU 16x GPU task in "gpux16" partition.

Native SLURM style

Submit Interactive Job with "srun"

srun --partition=debug --pty --nodes=1 \
     --ntasks-per-node=16 --cores-per-socket=4 \
     --threads-per-core=4 --sockets-per-node=1 \
     --mem-per-cpu=1500 --gres=gpu:v100:1 \
     --time 01:30:00 --wait=0 \
     --export=ALL /bin/bash

Submit Batch Job

sbatch [job_script]

Check Job Status

squeue                # check all jobs from all users 
squeue -u [user_name] # check all jobs belong to user_name

Cancel Running Job

scancel [job_id] # cancel job with [job_id]

PBS style

Some PBS commands are supported by SLURM.

Check Node Status

pbsnodes

Check Job Status

qstat -f [job_number]

Check Queue Status

qstat

Delete Job

qdel [job_number]

Submit Batch Job

$ cat test.pbs
#!/usr/bin/sh
#PBS -N test
#PBS -l nodes=1
#PBS -l walltime=10:00

hostname
$ qsub test.pbs
107
$ cat test.pbs.o107
hal01.hal.ncsa.illinois.edu
  • No labels