...
NVIDIA requires that one registers before downloading CUDA, cuDNN and nccl which makes it impractical to download them as part of an automated build, thus the first step is to download download cuda_9.1.85.2_linux
, cuda_9.1.85.1_linux
, cuda_9.1.85.3_linux
, cuda_9.1.85_387.26_linux
, cudnn-9.0-linux-x64-v7.5.0.56.tgz
and nccl-repo-ubuntu1604-2.4.2-ga-cuda9.0_1-1_amd64.deb
from the NVIDIA servers.
...
Code Block |
---|
language | bash |
---|
title | interactive shifter session |
---|
linenumbers | true |
---|
collapse | true |
---|
|
qsub -I -l nodes=1:x:ppn=16 -l walltime=3:00:00 -l gres=shifter16shifter
module load shifter
shifterimg pull $USER/tensorflow:16.04 |
...
Info |
---|
title | Interactive logins to shifter containers |
---|
|
TensorFlow's configure script is designed for interactive use. To gain interactive access to the container one can have Shifter start an ssh daemon in it (this is documented in the Blue Waters Shifter documentation): Code Block |
---|
language | bash |
---|
title | sshd |
---|
linenumbers | true |
---|
collapse | true |
---|
| # note the lowercase -v after UDI to mount volumes
qsub -I -l nodes=1:x:ppn=16 -l walltime=3:00:00 -l gres=shifter16shifter -v UDI="$USER/tensorflow:16.04 -v /dev/shm:/work"
export CRAY_ROOTFS=SHIFTER
aprun -b -n 1 -N 1 -d 16 -cc none /bin/bash -c 'Connect to: $(hostname) ; sleep 86400' &
ssh -F $HOME/.shifter/config nidXXXX
./configure |
where nidXXXXX is the name of the compute node output by the aprun command (which sleeps so that the container stays around) and use this to interactively configure TensorFlow. |
...
Code Block |
---|
language | bash |
---|
title | configure |
---|
linenumbers | true |
---|
collapse | true |
---|
|
# set up env variables so that configure does not actually ask any questions
# skeleton from https://gist.github.com/PatWie/0c915d5be59a518f934392219ca65c3d
# actual numbers from compiling locally to be able to respond to interactive
# prompt, then (mostly) from .tf_configure.bazelrc
export PYTHON_BIN_PATH=/usr/bin/python3
export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"
export CUDA_TOOLKIT_PATH=/usr/local/cuda
export CUDNN_INSTALL_PATH=/usr/local/cuda-9.1
export NCCL_INSTALL_PATH=/usr/local/cuda/lib64
export TF_NEED_GCP=0
export TF_NEED_CUDA=1
export TF_CUDA_VERSION=9.1
export TF_CUDA_COMPUTE_CAPABILITIES=3.5
export TF_NEED_IGNITE=0
export TF_NEED_ROCM=0
export TF_NEED_HDFS=0
export TF_NEED_OPENCL=0
export TF_NEED_JEMALLOC=1
export TF_ENABLE_XLA=0
export TF_NEED_VERBS=0
export TF_CUDA_CLANG=0
export TF_CUDNN_VERSION=7
export TF_NEED_MKL=0
export TF_DOWNLOAD_MKL=0
export TF_NEED_AWS=0
export TF_NEED_MPI=1
export MPI_HOME=/usr/lib/mpich
export TF_NEED_GDR=0
export TF_NEED_S3=0
export TF_NEED_OPENCL_SYCL=0
export TF_SET_ANDROID_WORKSPACE=0
export TF_NEED_COMPUTECPP=0
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export CC_OPT_FLAGS="-march=native"
export TF_NEED_KAFKA=0
export TF_NEED_TENSORRT=0
export TF_NCCL_VERSION=2
export GCC_HOST_COMPILER_PATH=$(which gcc)
export CC_OPT_FLAGS="-march=native"
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
./configure |
At long last we are able to build TensorFlow. Before starting the build process it is advisable though to redirect bazel's cache from $HOME/.cache
to our work directory to keep IO requests away from the (slower) Lustre file system and redirect them to /work
(fast since in /dev/shm
).
...
Code Block |
---|
language | bash |
---|
title | install tensorflow |
---|
linenumbers | true |
---|
collapse | true |
---|
|
mkdir tensorflow
cd tensorflow
/usr/bin/python3 -m virtualenv --system-site-packages --no-download -p /usr/bin/python3 $PWD
source bin/activate
pip3 install numpy==1.13.3 h5py==2.7.1 grpcio==1.8.6
pip3 install ../packages/tensorflow-1.12.1-cp35-cp35m-linux_x86_64.wh |
...
These commands need to execute inside the container for example by putting them into a script file install.sh
and using aprun:
Code Block |
---|
language | bash |
---|
title | aprun to install wheel |
---|
linenumbers | true |
---|
collapse | true |
---|
|
#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:10:0
#PBS -l gres=shifter
cd $PBS_O_WORKDIR
module load shifter
aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- /bin/bash ./install.sh |
Test
These tests showcase how to use tests showcase how to use the container and tensorflow. We will run them using a somewhat more complex invocation of shifter to link the Cray libraries to the container using the /opt/cray
mount point. We can obtain a limited interactive shell inside of the container:
Code Block |
---|
language | bash |
---|
title | complex aprun |
---|
linenumbers | true |
---|
collapse | true |
---|
|
#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:10:0
#PBS -l gres=shifter16shifter
cd $PBS_O_WORKDIR
module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter
export CUDA_VISIBLE_DEVICES=0
export TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"
aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- /bin/bash -i |
...
Code Block |
---|
language | bash |
---|
title | run simpleMPI |
---|
linenumbers | true |
---|
collapse | true |
---|
|
#!/bin/bash
#PBS -l nodes=2:xk:ppn=16
#PBS -l walltime=0:30:0
#PBS -l gres=shifter16shifter
cd $PBS_O_WORKDIR
module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter
export CUDA_VISIBLE_DEVICES=0
TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"
NODES=$(sort -u $PBS_NODEFILE | wc -l)
aprun -b -n $NODES -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- bash -c "LD_LIBRARY_PATH=$TF_LD_LIBRARY_PATH tests/bin/simpleMPI" |
...
Code Block |
---|
language | bash |
---|
title | TensorFlow submitscipt |
---|
linenumbers | true |
---|
collapse | true |
---|
|
#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:30:0
#PBS -l gres=shifter16shifter
cd $PBS_O_WORKDIR
module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter
export CUDA_VISIBLE_DEVICES=0
TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"
aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- bash -c "LD_LIBRARY_PATH=$TF_LD_LIBRARY_PATH tests/tensorflow.sh" |
...