Click the link above if you are not automatically redirected in 7 seconds.
Notice
This is a beta version of deployment. If you meet any problems, please go to HAL Slack channel for help.
Always remember to tear down your Ray cluster after using!
Introduction
This is the instruction for setting up a Ray cluster with autoscaling capability on HAL cluster. With the autoscaling functionality, the Ray cluster tries to add more "nodes" to the existing Ray cluster by submitting new SLURM jobs, and disconnect "nodes" by cancelling SLURM jobs when idle.
Deployment Instructions
The deployment process includes getting the ray library, modify the ray library to support autoscaler on HAL, and configure launch-specific parameters.
Get the Ray library into your private environment
Since Ray is not directly available on HAL and cannot be installed directly from pip, we need to do modification to the ray library based on the one provided by openCE.
To do so first load opence/1.6.1, using:
module load opence/1.6.1
Then clone this module to your own environment using this command:
conda create --name <env_name> --clone opence/1.6.1
You can use any environment name you like (make sure you remember it). This step can take about 30 minutes.
The path to ray after you've cloned the environment looks something like: /home/<username>/.conda/envs/<env_name>/lib/python3.9/site-packages/ray
To activate the environment, use
conda activate <env_name>
Configure bash to load modules automatically
Ray autoscaler requires conda to be accessible when a shell is opened. To do so, you need to modify the file ~/.bashrc, which is automatically executed every time a bash shell is opened.
1. Run
vi ~/.bashrc
This will open the bashrc file using vim.
2. Hit the 'i' key in order to edit the file. Add these lines under "#User specific environment"
module load opence/1.6.1
3. Hit the escape key to get out of edit mode. Enter ':qw' to save and quit
Deploy the Ray-SLURM autoscaler module into Ray library
Download the autoscaler code and deployment script from https://github.com/TingkaiLiu/Ray-SLURM-autoscaler by running:
git clone https://github.com/TingkaiLiu/Ray-SLURM-autoscaler.git
Several changes on the deployment script is needed specifically for the HAL cluster
1. Open deploy.py and change the following
- Set the RAY_PATH to be the path to your Ray library: /home/<username>/.conda/envs/<env_name>/lib/python3.9/site-packages/ray
- Set the SLURM_IP_LOOKUP table to be
SLURM_IP_LOOKUP = """{ "hal01" : "192.168.20.1", "hal02" : "192.168.20.2", "hal03" : "192.168.20.3", "hal04" : "192.168.20.4", "hal05" : "192.168.20.5", "hal06" : "192.168.20.6", "hal07" : "192.168.20.7", "hal08" : "192.168.20.8", "hal09" : "192.168.20.9", "hal10" : "192.168.20.10", "hal11" : "192.168.20.11", "hal12" : "192.168.20.12", "hal13" : "192.168.20.13", "hal14" : "192.168.20.14", "hal15" : "192.168.20.15", "hal16" : "192.168.20.16", }"""
- Change HEAD and WORKER CPUS/GPUS according to the partition you want to use. See the "Available queues" under "Native SLURM Style" section at Job management with SLURM. If you want to run the Ray head node outside SLURM, set the CPUS/GPUS of the head node to be 0.
2. Change slurm/worker.slurm
- Change line 4 "#SBATCH --gpus-per-task=[_DEPLOY_WORKER_GPUS_]" to "#SBATCH --gres=gpu:[_DEPLOY_WORKER_GPUS_]"
- Under line "set -x" add "SLURM_GPUS_PER_TASK="[_DEPLOY_WORKER_GPUS_]"
3. Change slurm/head.slurm
- Change line 4 "#SBATCH --gpus-per-task=[_DEPLOY_HEAD_GPUS_]" to "#SBATCH --gres=gpu:[_DEPLOY_HEAD_GPUS_]"
- Under line "set -x" add "SLURM_GPUS_PER_TASK="[_DEPLOY_HEAD_GPUS_]"
4. Run deploy.py
python3 deploy.py
This should generate the ray-slurm.yaml file for cluster launching.
Up to this point. The Ray autoscaler should be already enabled.
Configuration for specific cluster launch
After the module is deployment, you may want different configuration every time you launch a Ray cluster. For example, you may want to change the maximum number of nodes in your cluster for different workloads. Those launch-specific changes requires only changes to the cluster config yaml file.
For now:
- Change "init_command" on line 43 and 60 to be activating your own environment
- Set "under_slurm:" on line 37 to be 0 or 1 based on your need. (See the Github Doc for explanation)
- If you don't have a reservation, comment out lines 45 and 62 -> lines that say - "#SBATCH --reservation=username" (make sure the comment lines up correctly with rest of code)
- Change line 46 and 63 according to the partition you try to use
To start the Ray cluster, run
ray up ray-slurm.yaml --no-config-cache
If under_slurm is set to be 1, It is required that there is at least one idle node. Otherwise, the launching process keeps retrying until there is an idle node.
If you forced-terminated the launching process, run "ray down ray-slurm.yaml" to perform garbage collection.
If the "ray up" command runs successfully, the Ray cluster with autoscaling functionality should be started at this point. To connect to the Ray cluster in your Python code,
- If you launch head outside SLURM (under_slurm = 0), use
ray.init(address="192.168.20.203:<gcs_port>", redis_password="<The password generated at start time>")
- If you launch head inside SLURM (under_slurm = 1), you need to find the head node ip from the print out message and use Ray client to connect to it
ray.init(address="ray://<head_ip>:<gcs_port>", redis_password="<The password generated at start time>")
Always remember to tear down your Ray cluster after using!
To tear down the Ray cluster, run
ray down ray-slurm.yaml
Testing
You can check whether the Ray cluster is autoscaling when you launch heavy workload with Ray by the following:
- Check whether there are output message starting with "(scheduler"
- Run "squeue -u <username>" to see whether new SLURM jobs are submitted automatically under your username
- Check the .out file produced by SLURM
AcknowledgmentThis document is contributed by Tingkai Liu, Will Tegge, and Arnav Mehta