...
For complete SLURM documentation, see https://slurm.schedmd.com/. Here we only show simple examples with system-specific instructions.
HAL Slurm Wrapper Suite (Recommended)
Introduction
The HAL Slurm Wrapper Suite was designed to help users use the HAL system easily and efficiently. The current version is "swsuite-v0.1", which includes
srun → swrun : request resources to run interactive jobs.
sbatch → swbatch : request resource to submit a batch script to Slurm.
Usage
- swrun -q <queue_name> -c <cpu_per_gpu> -t <walltime>
- <queue_name> (required) : cpu, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
- <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
- <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
- example: swrun -q gpux4 -c 36 -t 72 (request a full node: 1x node, x4 node, 144x cpus, 72x hours)
- swbatch <run_script>
- <run_script> (required) : same as original slurm batch.
- <job_name> (required) : job name.
- <output_file> (required) : output file name.
- <error_file> (required) : error file name.
- <queue_name> (required) : cpu, gpux1, gpux2, gpux3, gpux4, gpux8, gpux12, gpux16.
- <cpu_per_gpu> (optional) : 12 cpus (default), range from 12 cpus to 36 cpus.
- <walltime> (optional) : 24 hours (default), range from 1 hour to 72 hours.
- example: swbatch demo.sb
New Job Queues
Partition Name | Priority | Max Walltime | Nodes Allowed | Min-Max CPUs Per Node Allowed | Min-Max Mem Per Node Allowed | GPU Allowed | Local Scratch | Description |
---|---|---|---|---|---|---|---|---|
gpu-debug | high | 4 hrs | 1 | 12-144 | 18-144 GB | 4 | none | |
gpux1 | normal | 72 hrs | 1 | 12-36 | 18-54 GB | 1 | none | |
gpux2 | normal | 72 hrs | 1 | 24-72 | 36-108 GB | 2 | none | |
gpux3 | normal | 72 hrs | 1 | 36-108 | 54-162 GB | 3 | none | |
gpux4 | normal | 72 hrs | 1 | 48-144 | 72-216 GB | 4 | none | |
cpu | normal | 72 hrs | 1 | 96-96 | 144-144 GB | 0 | none | |
gpux8 | low | 72 hrs | 2 | 48-144 | 72-216 GB | 8 | none | |
gpux12 | low | 72 hrs | 3 | 48-144 | 72-216 GB | 12 | none | |
gpux16 | low | 72 hrs | 4 | 48-144 | 72-216 GB | 16 | none |
Traditional Job Queues
Partition Name | Priority | Max Walltime | Min-Max Nodes Allowed | Max CPUs | Max Memory | Local Scratch (GB) | Description |
---|---|---|---|---|---|---|---|
debug | high | 4 hrs | 1-1 | 144 | 1.5 | None | designed to access 1 node to run debug job |
solo | normal | 72 hrs | 1-1 | 144 | 1.5 | None | designed to access 1 node to run sequential and/or parallel job |
ssd | normal | 72 hrs | 1-1 | 144 | 1.5 | 220 | similar to solo partition with extra local scratch, limited to hal[01-04] |
batch | low | 72 hrs | 2-16 | 144 | 1.5 | None | designed to access 2-16 nodes (up to 64 GPUs) to run parallel job |
HAL Example Job Scripts
...
New users should check the example job scripts at "/opt/apps/samples-runscript" and request adequate resources.
...