"My name is HAL. I became operational on March 25 2019 at the Innovative Systems Lab in Urbana, Illinois. My creators are putting me to the fullest possible use, which is all I think that any conscious entity can ever hope to do." (

paraphrazed from

paraphrased from https://en.wikipedia.org/wiki/HAL_9000)

In publications and presentations that use results obtained on this system, please include the following acknowledgement: “This work utilizes resources supported by the National Science Foundation’s Major Research Instrumentation program, grant #1725729, as well as the University of Illinois at Urbana-Champaign”.

Also, please include the following reference in your publications: “V. Kindratenko, D. Mu, Y. Zhan, J. Maloney, S. Hashemi, B. Rabe, K. Xu, R. Campbell, J. Peng, and W. Gropp. HAL: Computer System for Scalable Deep Learning. In Practice and Experience in Advanced Research Computing (PEARC ’20), July 26–30, 2020, Portland, OR, USA. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3311790.3396649”.

Hardware-Accelerated Learning (HAL) cluster

Info
Effective May 19, 2020, two-factor authentication via NCSA Duo is now required for SSH logins on HAL. See https://go.ncsa.illinois.edu/2fa for instructions to sign up.

Host name: hal.ncsa.illinois.edu

Hardware

16 IBM AC922 nodes
- IBM 8335-GTH AC922 server
  - 2x 20-core IBM POWER9 CPU @ 2.4GHz
  - 256 GB DDR4
- 4x NVIDIA V100 GPUs
  - 5120 cores
  - 16 GB HBM 2
- 2-Port EDR 100 Gb/s IB ConnectX-5 Adapter
1 IBM 9006-22P storage node
- 72TB Hardware RAID array, NFS-mounted on all nodes via IB EDR
Storage upgrade TBD

Software

- NFS
3 DDN GS400NVE Flash Arrays
- 360 TB usable, NVME SSD-based storage
- Spectrum Scale File System

Software

RedHat 8.4
CUDA 11.2.2
- cuDNN 8.1.1
- NCCL 2.8.3
NVidia HPC-SDK 21.5
PowerAI 1.7.0
OpenCE 1.3.1
SLURM 20.02.3

Documentation

RHEL 7.6
CUDA 10.1.105
- cuDNN 7.5.0
- NCCL 2.4.2
IBM XLC and IBM XLFORTRAN 16.1.1
Advance toolchain for Linux on Power 12.0
PGI Community Edition 19.4
PowerAI 1.6.0
SLURM

Documentation

Job management with SLURM
Modules management
Getting started with WMLCE (former PowerAI)
Using Jupyter Notebook How to Customize Python Environment on HAL
Working with containersInstalling python packagesContainers
Profiling GPU Programs
Data Movement In/Out of HAL
Distributed Training on HAL System

Science on HAL

Software for HAL

To request access: fill out this form. Make sure to follow the link on in the application confirmation page email to request actual system account.

Frequently Asked Questions

To report problems: email us.

For our new users: New User Guide for HAL System

User group Slack space: https://join.slack.com/t/halillinoisncsa

Real-time Dashboards: Here

HAL OnDemand portal: system status: https://hal-monitorondemand.ncsa.illinois.edu:3000//

Globus Endpoint: ncsa#hal

Quick start guide: (for complete details see Documentation section on the left)

To connect to the cluster:

Code Block
ssh <username>@hal.ncsa.illinois.edu

To submit interactive job:

Code Block

language	bash

swrun -p gpux1

or

Code Block

language	bash

srun --partition=gpux1 --pty --nodes=1 \
 --ntasks-per-node=12 --cores-per-socket=3 \
 --threads-per-core=4 --sockets-per-node=1 \
 --gres=gpu:v100:1 --mem-per-cpu=1500 \
 --time=2:00:00 --wait=0 --export=ALL /bin/bash

To submit a batch job:

Code Block
swbatch run_script.swb

or

Code Block
sbatch run_script.sb

The following information is out of date - see Job management with SLURM instead.

See run_script.swb and run_script.sb for a basic example.

Job Queue time limits:

"debug" queue: 4 hours
"gpux<n>" and "cpun<n>" queues: 72 hours 24 hours

Resource limits:

5 concurrently running jobs
concurrently allocated resources
- 5 nodes
- 16 GPUs
For larger/more numerous jobs, please contact admins for a special arrangement and/or a reservation

To load the OpenCE module (provides PyTorch, Tensorflow and other ML tools)To load IBM Watson Machine Learning Community Edition (former IBM PowerAI) module:

Code Block
module load wmlceopence

To see CLI scheduler status:

Code Block
swqueue

Main -> Systems -> HAL

Contact us

Request access to this system: Application

Contact ISL staff: Email Address

Visit: NCSA, room 3050E

Image RemovedImage Added

Child pages

Versions Compared

Old Version 83

New Version Current

Key

Contact us

Child pages

Page History

Versions Compared

Old Version 83

New Version Current

Key

Contact us