Notice

This is a beta version of deployment. If you meet any problems, please go to HAL Slack channel for help. 


Always remember to tear down your Ray cluster after using!

Introduction

This is the instruction for setting up a Ray cluster with autoscaling capability on HAL cluster. With the autoscaling functionality, the Ray cluster tries to add more "nodes" to the existing Ray cluster by submitting new SLURM jobs, and disconnect "nodes" by cancelling SLURM jobs when idle. 


Deployment Instructions

The deployment process includes getting the ray library, modify the ray library to support autoscaler on HAL, and configure launch-specific parameters. 

Get the Ray library into your private environment

Since Ray is not directly available on HAL and cannot be installed directly from pip, we need to do modification to the ray library based on the one provided by openCE.

To do so first load opence/1.6.1, using: 

module load opence/1.6.1


Then clone this module to your own environment using this command: 

conda create --name <env_name> --clone opence/1.6.1

You can use any environment name you like (make sure you remember it). This step can take about 30 minutes. 

The path to ray after you've cloned the environment looks something like:  /home/<username>/.conda/envs/<env_name>/lib/python3.9/site-packages/ray


To activate the environment, use

conda activate <env_name>


Configure bash to load modules automatically

Ray autoscaler requires conda to be accessible when a shell is opened. To do so, you need to modify the file ~/.bashrc, which is automatically executed every time a bash shell is opened. 

1. Run

vi ~/.bashrc


This will open the bashrc file using vim. 


2. Hit the 'i' key in order to edit the file. Add these lines under "#User specific environment"

module load opence/1.6.1


3. Hit the escape key to get out of edit mode. Enter ':qw' to save and quit

Deploy the Ray-SLURM autoscaler module into Ray library


Download the autoscaler code and deployment script from https://github.com/TingkaiLiu/Ray-SLURM-autoscaler by running:

git clone https://github.com/TingkaiLiu/Ray-SLURM-autoscaler.git


Several changes on the deployment script is needed specifically for the HAL cluster


1. Open deploy.py and change the following


SLURM_IP_LOOKUP = """{
    "hal01" : "192.168.20.1",
    "hal02" : "192.168.20.2",
    "hal03" : "192.168.20.3",
    "hal04" : "192.168.20.4",
    "hal05" : "192.168.20.5",
    "hal06" : "192.168.20.6",
    "hal07" : "192.168.20.7",
    "hal08" : "192.168.20.8",
    "hal09" : "192.168.20.9",
    "hal10" : "192.168.20.10",
    "hal11" : "192.168.20.11",
    "hal12" : "192.168.20.12",
    "hal13" : "192.168.20.13",
    "hal14" : "192.168.20.14",
    "hal15" : "192.168.20.15",
    "hal16" : "192.168.20.16",
}"""



2. Change slurm/worker.slurm


3. Change slurm/head.slurm


4. Run deploy.py

python3 deploy.py

This should generate the ray-slurm.yaml file for cluster launching. 

Up to this point. The Ray autoscaler should be already enabled. 

Configuration for specific cluster launch

After the module is deployment, you may want different configuration every time you launch a Ray cluster. For example, you may want to change the maximum number of nodes in your cluster for different workloads. Those launch-specific changes requires only changes to the cluster config yaml file. 

For now: 


To start the Ray cluster, run 

ray up ray-slurm.yaml --no-config-cache


If under_slurm is set to be 1, It is required that there is at least one idle node. Otherwise, the launching process keeps retrying until there is an idle node. 

If you forced-terminated the launching process, run "ray down ray-slurm.yaml" to perform garbage collection. 


If the "ray up" command runs successfully, the Ray cluster with autoscaling functionality should be started at this point. To connect to the Ray cluster in your Python code, 

ray.init(address="192.168.20.203:<gcs_port>", redis_password="<The password generated at start time>")
ray.init(address="ray://<head_ip>:<gcs_port>", redis_password="<The password generated at start time>")


Always remember to tear down your Ray cluster after using!

To tear down the Ray cluster, run 

ray down ray-slurm.yaml

Testing

You can check whether the Ray cluster is autoscaling when you launch heavy workload with Ray by the following:



Acknowledgment

This document is contributed by Tingkai Liu, Will Tegge, and Arnav Mehta