Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is a beta version of deployment. If you meet any problems, please go to HAL Slack channel for help. 

For the current version of deployment, it is required that there is at least one idle compute node on HAL during the cluster start time. 


Always remember to tear down your Ray cluster after using!

...

  •  Change HEAD and WORKER CPUS/GPUS according to the partition you want to use. See the "Available queues" under "Native SLURM Style" section at Job management with SLURM.  If you want to run the Ray head node outside SLURM, set the CPUS/GPUS of the head node to be 0.


2. Change slurm/worker.slurm

...

After the module is deployment, you may want different configuration every time you launch a Ray cluster. For example, you may want to change the maximum number of nodes in your cluster for different workloads. Those launch-specific changes requires only changes to the cluster config yaml file. 

...

For now: 

  • Change "init_command" on line 43 and 60 to be activating your own environment
  • Set "under_slurm:" on line 37 to be 0 or 1 based on your need. (See the Github Doc for explanation)
  • If you don't have a reservation, comment out lines 45 and 62 -> lines that say - "#SBATCH --reservation=username"   (make sure the comment lines up correctly with rest of code)
  • Change line 46 and 63 according to the partition you try to use

...

Code Block
ray up ray-slurm.yaml --no-config-cache


If under_slurm is set to be 1, It is required that there is at least one idle node. Otherwise, the launching process keeps retrying until there is an idle node

If you forced-terminated the launching process, run "ray down ray-slurm.yaml" to perform garbage collection


If the above "ray up" command runs successfully, the Ray cluster with autoscaling functionality should be started at this point. To connect to the Ray cluster in your Python code, 

...

Code Block
ray.init(address="192.168.20.203:<gcs_port>", redis_password="<The password generated at start time>")
  • If you launch head inside SLURM (under_slurm = 1), you need to find the head node ip from the print out message and use Ray client to connect to it
Code Block
ray.init(address="ray://<head_ip>:<gcs_port>", redis_password="<The password generated at start time>")


Always remember to tear down your Ray cluster after using!

...