Date: Thu, 28 Mar 2024 09:06:37 -0500 (CDT) Message-ID: <2109077777.1121.1711634797520@wiki.ncsa.illinois.edu> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_1120_1424153045.1711634797518" ------=_Part_1120_1424153045.1711634797518 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
IBM's DDL library is included in the PowerAI module (the most recent ver= sion is 'wmlce'). It supports TensorFlow, Pytorch, and Caffe frameworks. It= requires very little code modification, and is well-documented at the IBM K= nowledge Center. In the context of the HAL cluster, this is one of the = easiest ways to run a training job across multiple nodes and allows users t= o train using more than the 4 GPUs hosted on a single node. However, it is = not portable to machines without the WML CE (PowerAI) software distribution= .
The only setup that is required is to load the WML CE (PowerAI) module.<= /p>
module = load wmlce
If you also want a copy of the example scripts provided by IBM, they can= be copied to your home directory with
ddl-ten= sorflow-install-samples [example_dir]
The following job submission script can be run with any valid DDL progra= m using 'sbatch'. For instance, this submission script runs IBM's MNI= ST example script across all 4 GPUs on two nodes.
#!/bin/= bash #SBATCH --job-name=3D"ddl_mnist" #SBATCH --output=3D"ddl_mnist.%j.%N.out" #SBATCH --error=3D"ddl_mnist.%j.%N.err"=20 #SBATCH --partition=3Dgpu=20 #SBATCH --nodes=3D2 #SBATCH --ntasks-per-node=3D4 #SBATCH --cpus-per-task=3D36 #SBATCH --gres=3Dgpu:v100:4 #SBATCH --export=3DALL=20 #SBATCH -t 1:00:00=20 NODE_LIST=3D$( scontrol show hostname $SLURM_JOB_NODELIST | sed -z 's/\n/,/= g' ) NODE_LIST=3D${NODE_LIST%?} echo $NODE_LIST ddlrun --no_ddloptions -cores 18 -H $NODE_LIST python ~/[example_dir]/ddl-t= ensorflow/examples/mnist/mnist-init.py --ddl_options=3D"-mode b:4x2" --mpia= rg "--mca btl_openib_allow_ib 1 --mca orte_base_help_aggregate 1"
Horovod is a framework that allows users of popular machine learning fra= meworks such as TensorFlow, Keras, and PyTorch to easily adapt their applic= ations to run across multiple GPUs or nodes. It requires a short environmen= t setup and (like DDL) minor code modifications, but is portable and popula= r. Newer versions of PowerAI may not require the environment setup, and thi= s page will be updated accordingly once testing is finished on our system.<= /p>
The first step is to clone the PowerAI conda environment. This copies th= e packages in the environment to the user's home directory and as such allo= ws the user to install or uninstall packages.
conda c= reate --name [your_env_name] --clone powerai_env
The next step is to remove Spectrum MPI from the cloned environment, whi= ch will cause it to default to OpenMPI. First, we will remove the script th= at sets Spectrum MPI environment variables.
rm /hom= e/[username]/.conda/envs/[your_env_name]/etc/conda/activate.d/spectrum-mpi.= sh
Next, load your cloned environment.
conda a= ctivate [your_env_name]
Now, we'll uninstall both Spectrum MPI and Horovod, as we'll need to re-= install with OpenMPI as the default MPI implementation.
conda u= ninstall spectrum-mpi pip uninstall horovod -y
Finally, reinstall Horovod.
HOROVOD= _GPU_ALLREDUCE=3DNCCL pip install --no-cache-dir horovod
Once your environment is properly configured, you can submit batch jobs = that utilize the GPUs from multiple nodes to perform a single distributed t= raining run. The following submission script can be run with any valid Horo= vod program. For instance, the following runs the Horovod example script at= https://github.com/h= orovod/horovod/blob/master/examples/tensorflow_mnist.py.
#!/bin/= bash #SBATCH --job-name=3D"hvd_tutorial" #SBATCH --output=3D"hvd_tutorial.%j.%N.out" #SBATCH --error=3D"hvd_tutorial.%j.%N.err"=20 #SBATCH --partition=3Dgpu #SBATCH --nodes=3D2 #SBATCH --ntasks-per-node=3D4 #SBATCH --cpus-per-task=3D36 #SBATCH --gres=3Dgpu:v100:4 #SBATCH --export=3DALL=20 #SBATCH -t 1:00:00=20 NODE_LIST=3D$( scontrol show hostname $SLURM_JOB_NODELIST | sed -z 's/\n/\:= 4,/g' ) NODE_LIST=3D${NODE_LIST%?} echo $NODE_LIST mpirun -np $SLURM_NTASKS -H $NODE_LIST -bind-to none -map-by slot -x NCCL_D= EBUG=3DINFO -x NCCL_SOCKET_IFNAME=3D^lo -x LD_LIBRARY_PATH -x PATH -mca pml= ob1 -mca btl ^openib -mca btl_openib_verbose 1 -mca btl_tcp_if_incle 192.1= 68.0.0/16 -mca oob_tcp_if_include 192.168.0.0/16 python tensorflow_mnist.py=
Github repo for all source code and details: https://github.com/richardkxu/distributed-pytorch
Jupyter notebook tutorial for the key points: https://github.com/richardkxu/di= stributed-pytorch/blob/master/ddp_apex_tutorial.ipynb
ImageNet benchmark results for performance analysis and visualization:&n= bsp;HAL Benchmarks 2020<= /p>