Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Reverted from v. 10

npCompiling Compiling TensorFlow can be non-trivial in particular on a system not directly supported like Blue Waters. There are a number of challenges that one faces:

...

NVIDIA requires that one registers before downloading CUDA, cuDNN and nccl which makes it impractical to download them as part of an automated build, thus the first step is to download download cuda_9.1.85.2_linuxcuda_9.1.85.1_linux, cuda_9.1.85.3_linux, cuda_9.1.85_387.26_linuxcudnn-9.0-linux-x64-v7.5.0.56.tgz and nccl-repo-ubuntu1604-2.4.2-ga-cuda9.0_1-1_amd64.deb from the NVIDIA servers.

...

Code Block
languagebash
titleinteractive shifter session
linenumberstrue
collapsetrue
qsub -I -l nodes=1:x:ppn=16 -l walltime=3:00:00 -l gres=shifter16shifter
module load shifter
shifterimg pull $USER/tensorflow:16.04

...

Info
titleInteractive logins to shifter containers

TensorFlow's configure script is designed for interactive use. To gain interactive access to the container one can have Shifter start an ssh daemon in it (this is documented in the Blue Waters Shifter documentation):

Code Block
languagebash
titlesshd
linenumberstrue
collapsetrue
# note the lowercase -v after UDI to mount volumes
qsub -I -l nodes=1:x:ppn=16 -l walltime=3:00:00 -l gres=shifter16shifter -v UDI="$USER/tensorflow:16.04 -v /dev/shm:/work"

export CRAY_ROOTFS=SHIFTER
aprun -b -n 1 -N 1 -d 16 -cc none /bin/bash -c 'Connect to: $(hostname) ; sleep 86400' &

ssh -F $HOME/.shifter/config nidXXXX

./configure

where nidXXXXX is the name of the compute node output by the aprun command (which sleeps so that the container stays around) and use this to interactively configure TensorFlow.

...

Code Block
languagebash
titleconfigure
linenumberstrue
collapsetrue
# set up env variables so that configure does not actually ask any questions
# skeleton from https://gist.github.com/PatWie/0c915d5be59a518f934392219ca65c3d
# actual numbers from compiling locally to be able to respond to interactive
# prompt, then (mostly) from .tf_configure.bazelrc

export PYTHON_BIN_PATH=/usr/bin/python3
export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"
export CUDA_TOOLKIT_PATH=/usr/local/cuda
export CUDNN_INSTALL_PATH=/usr/local/cuda-9.1
export NCCL_INSTALL_PATH=/usr/local/cuda/lib64

export TF_NEED_GCP=0
export TF_NEED_CUDA=1
export TF_CUDA_VERSION=9.1
export TF_CUDA_COMPUTE_CAPABILITIES=3.5
export TF_NEED_IGNITE=0
export TF_NEED_ROCM=0
export TF_NEED_HDFS=0
export TF_NEED_OPENCL=0
export TF_NEED_JEMALLOC=1
export TF_ENABLE_XLA=0
export TF_NEED_VERBS=0
export TF_CUDA_CLANG=0
export TF_CUDNN_VERSION=7
export TF_NEED_MKL=0
export TF_DOWNLOAD_MKL=0
export TF_NEED_AWS=0
export TF_NEED_MPI=1
export MPI_HOME=/usr/lib/mpich
export TF_NEED_GDR=0
export TF_NEED_S3=0
export TF_NEED_OPENCL_SYCL=0
export TF_SET_ANDROID_WORKSPACE=0
export TF_NEED_COMPUTECPP=0
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export CC_OPT_FLAGS="-march=native"
export TF_NEED_KAFKA=0
export TF_NEED_TENSORRT=0
export TF_NCCL_VERSION=2

export GCC_HOST_COMPILER_PATH=$(which gcc)
export CC_OPT_FLAGS="-march=native"

PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

./configure

At long last we are able to build TensorFlow. Before starting the build process it is advisable though to redirect bazel's cache from $HOME/.cache to our work directory to keep IO requests away from the (slower) Lustre file system and redirect them to /work (fast since in /dev/shm).

...

The full compile script as well as a pbs script to submit via qsub can be found here and here. The final pip wheel file is tensorflow-1.12.1-cp35-cp35m-linux_x86_64.whl.

Installing TensorFlow on Installing TensorFlow on Blue Waters

Having successfully built a TensorFlow wheel on Blue Waters it can be installed in a virtualenv spun off from the python3 installation in the container.

Code Block
languagebash
titleinstall tensorflow
linenumberstrue
collapsetrue
mkdir tensorflow
cd tensorflow
/usr/bin/python3 -m virtualenv --system-site-packages -p $(which python3) $PWD
source bin/activate
pip3 install ../packages/tensorflow-1.12.1-cp35-cp35m-linux_x86_64.wh-site-packages --no-download -p /usr/bin/python3 $PWD
source bin/activate
pip3 install numpy==1.13.3 h5py==2.7.1 grpcio==1.8.6
pip3 install ../packages/tensorflow-1.12.1-cp35-cp35m-linux_x86_64.whl

These commands need to execute inside the container for example by putting them into a script file install.sh and using aprun:

Code Block
languagebash
titleaprun to install wheel
linenumberstrue
collapsetrue
#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:10:0
#PBS -l gres=shifter

cd $PBS_O_WORKDIR

module load shifter

aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- /bin/bash ./install.sh

Test

These tests showcase how to use the container and tensorflow. We will run them using a somewhat more complex invocation of shifter to link the Cray libraries to the container using the /opt/cray mount point. We can obtain a limited interactive shell inside of the container:

Code Block
languagebash
titlecomplex aprun
linenumberstrue
collapsetrue
#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:10:0
#PBS -l gres=shifter16shifter

cd $PBS_O_WORKDIR

module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter

export CUDA_VISIBLE_DEVICES=0

export TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"

aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- /bin/bash -i

...

The simpleMPI tests shows how one can combine MPI and CUDA on Blue Waters. The original example was for "bar bare metal" Blue Waters but works from inside of containers just as well:

Code Block
languagebash
titlerun simpleMPI
linenumberstrue
collapsetrue
#!/bin/bash
#PBS -l nodes=2:xk:ppn=16
#PBS -l walltime=0:30:0
#PBS -l gres=shifter16shifter

cd $PBS_O_WORKDIR

module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter

export CUDA_VISIBLE_DEVICES=0

TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"

NODES=$(sort -u $PBS_NODEFILE | wc -l)

aprun -b -n $NODES -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- bash -c "LD_LIBRARY_PATH=$TF_LD_LIBRARY_PATH tests/bin/simpleMPI"

...

Code Block
languagebash
titleTensorFlow submitscipt
linenumberstrue
collapsetrue
#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:30:0
#PBS -l gres=shifter16shifter

cd $PBS_O_WORKDIR

module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter

export CUDA_VISIBLE_DEVICES=0

TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"

aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- bash -c "LD_LIBRARY_PATH=$TF_LD_LIBRARY_PATH tests/tensorflow.sh"

...

All scripts and code fragments shown can be downloaded here and the pip wheel file from tensorflow-1.12.1-cp35-cp35m-linux_x86_64.whl.