...
NVIDIA requires that one registers before downloading CUDA, cuDNN and nccl which makes it impractical to download them as part of an automated build, thus the first step is to download download cuda_9.1.85.2_linux
, cuda_9.1.85.1_linux
, cuda_9.1.85.3_linux
, cuda_9.1.85_387.26_linux
, cudnn-9.0-linux-x64-v7.5.0.56.tgz
and nccl-repo-ubuntu1604-2.4.2-ga-cuda9.0_1-1_amd64.deb
from the NVIDIA servers.
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
qsub -I -l nodes=1:x:ppn=16 -l walltime=3:00:00 -l gres=shifter16shifter module load shifter shifterimg pull $USER/tensorflow:16.04 |
...
Info | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
TensorFlow's configure script is designed for interactive use. To gain interactive access to the container one can have Shifter start an ssh daemon in it (this is documented in the Blue Waters Shifter documentation):
where |
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
# set up env variables so that configure does not actually ask any questions
# skeleton from https://gist.github.com/PatWie/0c915d5be59a518f934392219ca65c3d
# actual numbers from compiling locally to be able to respond to interactive
# prompt, then (mostly) from .tf_configure.bazelrc
export PYTHON_BIN_PATH=/usr/bin/python3
export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"
export CUDA_TOOLKIT_PATH=/usr/local/cuda
export CUDNN_INSTALL_PATH=/usr/local/cuda-9.1
export NCCL_INSTALL_PATH=/usr/local/cuda/lib64
export TF_NEED_GCP=0
export TF_NEED_CUDA=1
export TF_CUDA_VERSION=9.1
export TF_CUDA_COMPUTE_CAPABILITIES=3.5
export TF_NEED_IGNITE=0
export TF_NEED_ROCM=0
export TF_NEED_HDFS=0
export TF_NEED_OPENCL=0
export TF_NEED_JEMALLOC=1
export TF_ENABLE_XLA=0
export TF_NEED_VERBS=0
export TF_CUDA_CLANG=0
export TF_CUDNN_VERSION=7
export TF_NEED_MKL=0
export TF_DOWNLOAD_MKL=0
export TF_NEED_AWS=0
export TF_NEED_MPI=1
export MPI_HOME=/usr/lib/mpich
export TF_NEED_GDR=0
export TF_NEED_S3=0
export TF_NEED_OPENCL_SYCL=0
export TF_SET_ANDROID_WORKSPACE=0
export TF_NEED_COMPUTECPP=0
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export CC_OPT_FLAGS="-march=native"
export TF_NEED_KAFKA=0
export TF_NEED_TENSORRT=0
export TF_NCCL_VERSION=2
export GCC_HOST_COMPILER_PATH=$(which gcc)
export CC_OPT_FLAGS="-march=native"
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
./configure |
At long last we are able to build TensorFlow. Before starting the build process it is advisable though to redirect bazel's cache from $HOME/.cache
to our work directory to keep IO requests away from the (slower) Lustre file system and redirect them to /work
(fast since in /dev/shm
).
...
The full compile script as well as a pbs script to submit via qsub can be found here and here. The final pip wheel file is tensorflow-1.12.1-cp35-cp35m-linux_x86_64.whl.
Installing TensorFlow on Blue Installing TensorFlow on Blue Waters
Having successfully built a TensorFlow wheel on Blue Waters it can be installed in a virtualenv spun off from the python3 installation in the container.
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
mkdir tensorflow cd tensorflow /usr/bin/python3 -m virtualenv --system-site-packages -p $(which python3) $PWD source bin/activate pip3 install ../packages/tensorflow-1.12.1-cp35-cp35m-linux_x86_64.wh-site-packages --no-download -p /usr/bin/python3 $PWD source bin/activate pip3 install numpy==1.13.3 h5py==2.7.1 grpcio==1.8.6 pip3 install ../packages/tensorflow-1.12.1-cp35-cp35m-linux_x86_64.whl |
These commands need to execute inside the container for example by putting them into a script file install.sh
and using aprun:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:10:0
#PBS -l gres=shifter
cd $PBS_O_WORKDIR
module load shifter
aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- /bin/bash ./install.sh |
Test
These tests showcase how to use the container and tensorflow. We will run them using a somewhat more complex invocation of shifter to link the Cray libraries to the container using the /opt/cray
mount point. We can obtain a limited interactive shell inside of the container:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash #PBS -l nodes=1:xk:ppn=16 #PBS -l walltime=0:10:0 #PBS -l gres=shifter16shifter cd $PBS_O_WORKDIR module load cudatoolkit module unload PrgEnv-cray module load PrgEnv-gnu module load cray-mpich-abi module load shifter export CUDA_VISIBLE_DEVICES=0 export TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH" aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- /bin/bash -i |
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash #PBS -l nodes=2:xk:ppn=16 #PBS -l walltime=0:30:0 #PBS -l gres=shifter16shifter cd $PBS_O_WORKDIR module load cudatoolkit module unload PrgEnv-cray module load PrgEnv-gnu module load cray-mpich-abi module load shifter export CUDA_VISIBLE_DEVICES=0 TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH" NODES=$(sort -u $PBS_NODEFILE | wc -l) aprun -b -n $NODES -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- bash -c "LD_LIBRARY_PATH=$TF_LD_LIBRARY_PATH tests/bin/simpleMPI" |
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash #PBS -l nodes=1:xk:ppn=16 #PBS -l walltime=0:30:0 #PBS -l gres=shifter16shifter cd $PBS_O_WORKDIR module load cudatoolkit module unload PrgEnv-cray module load PrgEnv-gnu module load cray-mpich-abi module load shifter export CUDA_VISIBLE_DEVICES=0 TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH" aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- bash -c "LD_LIBRARY_PATH=$TF_LD_LIBRARY_PATH tests/tensorflow.sh" |
...
All scripts and code fragments shown can be downloaded here and the pip wheel file from tensorflow-1.12.1-cp35-cp35m-linux_x86_64.whl.