Getting started with WMLCE (former PowerAI)

IBM Watson Machine Learning Community Edition (WMLCE-1.6.1)

PowerAI is an enterprise software distribution that combines popular open-source deep learning frameworks, efficient AI development tools, and accelerated IBM Power Systems servers. It includes the following frameworks:

Framework	Version	Description
Caffe	1.0	Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research and by community contributors.
Caffe2	n/a	Caffe2 is a companion to PyTorch. PyTorch is great for experimentation and rapid development, while Caffe2 is aimed at production environments.
Pytorch	1.1.0	Pytorch is an open-source deep learning platform that provides a seamless path from research prototyping to production deployment. It is developed by Facebook and by community contributors.
TensorFlow	1.14.0	TensorFlow is an end-to-end open-source platform for machine learning. It is developed by Google and by community contributors.

For complete PowerAI documentation, see https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.0/navigation/pai_getstarted.htm. Here we only show simple examples with system-specific instructions.

Simple Example with Caffe

Interactive mode

Get one compute node for interactive use:

swrun -p gpux1

Once on the compute node, load PowerAI module using one of these:

module load powerai/1.6.0-py2.7 # for python2 environment
module load powerai/1.6.0-py3.6 # for python3 environment
module load powerai             # python3 environment by default

Install samples for Caffe:

caffe-install-samples ~/caffe-samples
cd ~/caffe-samples

Download data for MNIST model:

./data/mnist/get_mnist.sh

Convert data and create MNIST model:

./examples/mnist/create_mnist.sh

Train LeNet on MNIST:

./examples/mnist/train_lenet.sh

Batch mode

The same can be accomplished in batch mode using the following caffe_sample.sb script:

wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/caffe_sample.sb
sbatch caffe_sample.sb
squeue

Simple Example with Caffe2

Interactive mode

Get node for interactive use:

srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash

Once on the compute node, load PowerAI module using one of these:

module load powerai/1.6.0-py2.7 # for python2 environment
module load powerai/1.6.0-py3.6 # for python3 environment
module load powerai           # python3 environment by default

Install samples for Caffe2:

caffe2-install-samples ~/caffe2-samples
cd ~/caffe2-samples

Download data with LMDB:

python ./examples/lmdb_create_example.py --output_file lmdb

Train ResNet50 with Caffe2:

python ./examples/resnet50_trainer.py --train_data ./lmdb

Batch mode

The same can be accomplished in batch mode using the following caffe2_sample.sb script:

wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/caffe2_sample.sb
sbatch caffe2_sample.sb
squeue

Simple Example with TensorFlow

Interactive mode

Get node for interactive use:

srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash

Once on the compute node, load PowerAI module using one of these:

module load powerai/1.6.0-py2.7 # for python2 environment
module load powerai/1.6.0-py3.6 # for python3 environment
module load powerai           # python3 environment by default

Copy the following code into file "mnist-demo.py":

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Train on MNIST with keras API:

python ./mnist-demo.py

Batch mode

The same can be accomplished in batch mode using the following tf_sample.sb script:

wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/tf_sample.sb
sbatch tf_sample.sb
squeue

Visualization with TensorBoard

Interactive mode

Get node for interactive use:

srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash

Once on the compute node, load PowerAI module using one of these:

module load powerai/1.6.0-py2.7 # for python2 environment
module load powerai/1.6.0-py3.6 # for python3 environment
module load powerai           # python3 environment by default

Download the code mnist-with-summaries.py to $HOME folder:

cd ~
wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/mnist-with-summaries.py

Train on MNIST with TensorFlow summary and go back to login node:

python ./mnist-with-summaries.py
exit

Batch mode

The same can be accomplished in batch mode using the following tfbd_sample.sb script:

wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/tfbd_sample.sb
sbatch tfbd_sample.sb
squeue

Start the TensorBorad session

After job completed the TensorFlow log files can be found in "~/tensorflow/mnist/logs", start the TensorBoard server on login node:

module load powerai
tensorboard --logdir ~/tensorflow/mnist/logs/ --port [user_pick_port] # please use random number within [6500-6999]

Forward the [user_pick_port] on remote machine to the port 16006 on local machine:

ssh -N -f -L localhost:16006:localhost:[user_pick_port] your_user_name@hal.ncsa.illinois.edu

Paste the follow address into web browser to start the TensorBoard session:

localhost:16006

Simple Example with Pytorch

Interactive mode

Get node for interactive use:

srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash

Once on the compute node, load PowerAI module using one of these:

module load powerai/1.6.0-py2.7 # for python2 environment
module load powerai/1.6.0-py3.6 # for python3 environment
module load powerai           # python3 environment by default

Install samples for Pytorch:

pytorch-install-samples ~/pytorch-samples
cd ~/pytorch-samples

Train on MNIST with Pytorch:

python ./examples/mnist/main.py

Batch mode

The same can be accomplished in batch mode using the following pytorch_sample.sb script:

wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/pytorch_sample.sb
sbatch pytorch_sample.sb
squeue

Major Installed PowerAI Related Anaconda Modules

Name	Version	Description
caffe	1.0	Caffe is a deep learning framework made with expression, speed, and modularity in mind.
cudatoolkit	10.1.105	The NVIDIA® CUDA® Toolkit provides a development environment for creating high performance GPU-accelerated applications.
cudnn	7.5.0+10.1	The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
h5py	2.8.0	The h5py package is a Pythonic interface to the HDF5 binary data format.
jupyter	1.0.0	Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.
matplotlib	2.2.3	Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
nccl	2.4.2	The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.
numpy	1.14.5	NumPy is the fundamental package for scientific computing with Python.
opencv	3.4.2	OpenCV was designed for computational efficiency and with a strong focus on real-time applications.
pytables	3.4.4	PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
pytorch	1.0.1	PyTorch enables fast, flexible experimentation and efficient production through a hybrid front-end, distributed training, and ecosystem of tools and libraries.
scikit-learn	0.19.1	Simple and efficient tools for data mining and data analysis.
scipy	1.1.0	SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering
tensorboard	1.13.0	To make it easier to understand, debug, and optimize TensorFlow programs, we've included a suite of visualization tools called TensorBoard.
tensorflow-gpu	1.13.1	The core open source library to help you develop and train ML models.
torchvision	0.2.1	The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision.

Child pages

Getting started with WMLCE (former PowerAI)

IBM Watson Machine Learning Community Edition (WMLCE-1.6.1)

Simple Example with Caffe

Interactive mode

Batch mode

Simple Example with Caffe2

Interactive mode

Batch mode

Simple Example with TensorFlow

Interactive mode

Batch mode

Visualization with TensorBoard

Interactive mode

Batch mode

Start the TensorBorad session

Simple Example with Pytorch

Interactive mode

Batch mode

Major Installed PowerAI Related Anaconda Modules