Page History

...

NVIDIA requires that one registers before downloading CUDA, cuDNN and nccl which makes it impractical to download them as part of an automated build, thus the first step is to download download cuda_9.1.85.2_linux, cuda_9.1.85.1_linux, cuda_9.1.85.3_linux, cuda_9.1.85_387.26_linux, cudnn-9.0-linux-x64-v7.5.0.56.tgz and nccl-repo-ubuntu1604-2.4.2-ga-cuda9.0_1-1_amd64.deb from the NVIDIA servers.

...

Code Block

language	bash
title	interactive shifter session
linenumbers	true
collapse	true

qsub -I -l nodes=1:x:ppn=16 -l walltime=3:00:00 -l gres=shifter16shifter
module load shifter
shifterimg pull $USER/tensorflow:16.04

...

Info

title	Interactive logins to shifter containers

TensorFlow's configure script is designed for interactive use. To gain interactive access to the container one can have Shifter start an ssh daemon in it (this is documented in the Blue Waters Shifter documentation):

Code Block

language	bash
title	sshd
linenumbers	true
collapse	true

# note the lowercase -v after UDI to mount volumes
qsub -I -l nodes=1:x:ppn=16 -l walltime=3:00:00 -l gres=shifter16shifter -v UDI="$USER/tensorflow:16.04 -v /dev/shm:/work"

export CRAY_ROOTFS=SHIFTER
aprun -b -n 1 -N 1 -d 16 -cc none /bin/bash -c 'Connect to: $(hostname) ; sleep 86400' &

ssh -F $HOME/.shifter/config nidXXXX

./configure

where nidXXXXX is the name of the compute node output by the aprun command (which sleeps so that the container stays around) and use this to interactively configure TensorFlow.

...

Code Block

language	bash
title	configure
linenumbers	true
collapse	true

# set up env variables so that configure does not actually ask any questions
# skeleton from https://gist.github.com/PatWie/0c915d5be59a518f934392219ca65c3d
# actual numbers from compiling locally to be able to respond to interactive
# prompt, then (mostly) from .tf_configure.bazelrc

export PYTHON_BIN_PATH=/usr/bin/python3
export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"
export CUDA_TOOLKIT_PATH=/usr/local/cuda
export CUDNN_INSTALL_PATH=/usr/local/cuda-9.1
export NCCL_INSTALL_PATH=/usr/local/cuda/lib64

export TF_NEED_GCP=0
export TF_NEED_CUDA=1
export TF_CUDA_VERSION=9.1
export TF_CUDA_COMPUTE_CAPABILITIES=3.5
export TF_NEED_IGNITE=0
export TF_NEED_ROCM=0
export TF_NEED_HDFS=0
export TF_NEED_OPENCL=0
export TF_NEED_JEMALLOC=1
export TF_ENABLE_XLA=0
export TF_NEED_VERBS=0
export TF_CUDA_CLANG=0
export TF_CUDNN_VERSION=7
export TF_NEED_MKL=0
export TF_DOWNLOAD_MKL=0
export TF_NEED_AWS=0
export TF_NEED_MPI=1
export MPI_HOME=/usr/lib/mpich
export TF_NEED_GDR=0
export TF_NEED_S3=0
export TF_NEED_OPENCL_SYCL=0
export TF_SET_ANDROID_WORKSPACE=0
export TF_NEED_COMPUTECPP=0
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export CC_OPT_FLAGS="-march=native"
export TF_NEED_KAFKA=0
export TF_NEED_TENSORRT=0
export TF_NCCL_VERSION=2

export GCC_HOST_COMPILER_PATH=$(which gcc)
export CC_OPT_FLAGS="-march=native"

PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

./configure

At long last we are able to build TensorFlow. Before starting the build process it is advisable though to redirect bazel's cache from $HOME/.cache to our work directory to keep IO requests away from the (slower) Lustre file system and redirect them to /work (fast since in /dev/shm).

...

Code Block

language	bash
title	install tensorflow
linenumbers	true
collapse	true

mkdir tensorflow
cd tensorflow
/usr/bin/python3 -m virtualenv --system-site-packages --no-download -p /usr/bin/python3 $PWD
source bin/activate
pip3 install numpy==1.13.3 h5py==2.7.1 grpcio==1.8.6
pip3 install ../packages/tensorflow-1.12.1-cp35-cp35m-linux_x86_64.wh

...

whl

These commands need to execute inside the container for example by putting them into a script file install.sh and using aprun:

Code Block

language	bash
title	aprun to install wheel
linenumbers	true
collapse	true

#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:10:0
#PBS -l gres=shifter

cd $PBS_O_WORKDIR

module load shifter

aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- /bin/bash ./install.sh

Test

These tests showcase how to use tests showcase how to use the container and tensorflow. We will run them using a somewhat more complex invocation of shifter to link the Cray libraries to the container using the /opt/cray mount point. We can obtain a limited interactive shell inside of the container:

Code Block

language	bash
title	complex aprun
linenumbers	true
collapse	true

#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:10:0
#PBS -l gres=shifter16shifter

cd $PBS_O_WORKDIR

module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter

export CUDA_VISIBLE_DEVICES=0

export TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"

aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- /bin/bash -i

...

Code Block

language	bash
title	run simpleMPI
linenumbers	true
collapse	true

#!/bin/bash
#PBS -l nodes=2:xk:ppn=16
#PBS -l walltime=0:30:0
#PBS -l gres=shifter16shifter

cd $PBS_O_WORKDIR

module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter

export CUDA_VISIBLE_DEVICES=0

TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"

NODES=$(sort -u $PBS_NODEFILE | wc -l)

aprun -b -n $NODES -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- bash -c "LD_LIBRARY_PATH=$TF_LD_LIBRARY_PATH tests/bin/simpleMPI"

...

Code Block

language	bash
title	TensorFlow submitscipt
linenumbers	true
collapse	true

#!/bin/bash
#PBS -l nodes=1:xk:ppn=16
#PBS -l walltime=0:30:0
#PBS -l gres=shifter16shifter

cd $PBS_O_WORKDIR

module load cudatoolkit
module unload PrgEnv-cray
module load PrgEnv-gnu
module load cray-mpich-abi
module load shifter

export CUDA_VISIBLE_DEVICES=0

TF_LD_LIBRARY_PATH="/work/tensorflow/lib:$(readlink -f /opt/cray/wlm_detect/default/lib64):$(readlink -f /opt/cray/nvidia/default/lib64):/usr/local/cuda/lib64:$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH"

aprun -b -n 1 -N 1 -d 16 -cc none -- shifter --image=rhaas/tensorflow:16.04s -V $(pwd -P):/work -V /dsl/opt/cray:/opt/cray -- bash -c "LD_LIBRARY_PATH=$TF_LD_LIBRARY_PATH tests/tensorflow.sh"

...

Child pages

Versions Compared

Old Version 6

New Version Current

Key

Test