IBM PowerAI 1.6.0
PowerAI is an enterprise software distribution that combines popular open source deep learning frameworks, efficient AI development tools, and accelerated IBM Power Systems servers. It includes the following frameworks:
Framework | Version | Description |
---|---|---|
Caffe | 1.0 | Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research and by community contributors. |
Caffe2 | n/a | Caffe2 is a companion to PyTorch. PyTorch is great for experimentation and rapid development, while Caffe2 is aimed at production environments. |
Pytorch | 1.0.1 | Pytorch is an open source deep learning platform that provides a seamless path from research prototyping to production deployment. It is developed by Facebook and by community contributors. |
TensorFlow | 1.13.1 | TensorFlow is an end-to-end open source platform for machine learning. It is developed by Google and by community contributors. |
For complete PowerAI documentation, see https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.0/navigation/pai_getstarted.htm. Here we only show simple examples with system-specific instructions.
Simple Example with Caffe
Interactive mode
Get node for interactive use:
srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash
Once on the compute node, load PowerAI module using one of these:
module load powerai/1.6.0-py2.7 # for python2 environment module load powerai/1.6.0-py3.6 # for python3 environment module load powerai # python3 environment by default
Install samples for Caffe:
caffe-install-samples ~/caffe-samples cd ~/caffe-samples
Download data for MNIST model:
./data/mnist/get_mnist.sh
Convert data and create MNIST model:
./examples/mnist/create_mnist.sh
Train LeNet on MNIST:
./examples/mnist/train_lenet.sh
Batch mode
The same can be accomplished in batch mode using the following caffe_sample.sb script:
wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/caffe_sample.sb sbatch caffe_sample.sb squeue
Simple Example with Caffe2
Interactive mode
Get node for interactive use:
srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash
Once on the compute node, load PowerAI module using one of these:
module load powerai/1.6.0-py2.7 # for python2 environment module load powerai/1.6.0-py3.6 # for python3 environment module load powerai # python3 environment by default
Install samples for Caffe2:
caffe2-install-samples ~/caffe2-samples cd ~/caffe2-samples
Download data with LMDB:
python ./examples/lmdb_create_example.py --output_file lmdb
Train ResNet50 with Caffe2:
python ./examples/resnet50_trainer.py --train_data ./lmdb
Batch mode
The same can be accomplished in batch mode using the following caffe2_sample.sb script:
wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/caffe2_sample.sb sbatch caffe2_sample.sb squeue
Simple Example with TensorFlow
Interactive mode
Get node for interactive use:
srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash
Once on the compute node, load PowerAI module using one of these:
module load powerai/1.6.0-py2.7 # for python2 environment module load powerai/1.6.0-py3.6 # for python3 environment module load powerai # python3 environment by default
Copy the following code into file "mnist-demo.py":
import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test)
Train on MNIST with keras API:
python ./mnist-demo.py
Batch mode
The same can be accomplished in batch mode using the following tf_sample.sb script:
wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/tf_sample.sb sbatch tf_sample.sb squeue
Visualization with TensorBoard
Interactive mode
Get node for interactive use:
srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash
Once on the compute node, load PowerAI module using one of these:
module load powerai/1.6.0-py2.7 # for python2 environment module load powerai/1.6.0-py3.6 # for python3 environment module load powerai # python3 environment by default
Download the code mnist-with-summaries.py to $HOME folder:
cd ~ wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/mnist-with-summaries.py
Train on MNIST with TensorFlow summary and go back to login node:
python ./mnist-with-summaries.py exit
Batch mode
The same can be accomplished in batch mode using the following tfbd_sample.sb script:
wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/tfbd_sample.sb sbatch tfbd_sample.sb squeue
Start the TensorBorad session
After job completed the TensorFlow log files can be found in "~/tensorflow/mnist/logs", start the TensorBoard server on login node:
module load powerai tensorboard --logdir ~/tensorflow/mnist/logs/ --port [user_pick_port] # please use random number within [6500-6999]
Forward the [user_pick_port] on remote machine to the port 16006 on local machine:
ssh -N -f -L localhost:16006:localhost:[user_pick_port] your_user_name@hal.ncsa.illinois.edu
Paste the follow address into web browser to start the TensorBoard session:
localhost:16006
Simple Example with Pytorch
Interactive mode
Get node for interactive use:
srun --partition=debug --pty --nodes=1 --ntasks-per-node=8 --gres=gpu:v100:1 -t 01:30:00 --wait=0 --export=ALL /bin/bash
Once on the compute node, load PowerAI module using one of these:
module load powerai/1.6.0-py2.7 # for python2 environment module load powerai/1.6.0-py3.6 # for python3 environment module load powerai # python3 environment by default
Install samples for Pytorch:
pytorch-install-samples ~/pytorch-samples cd ~/pytorch-samples
Train on MNIST with Pytorch:
python ./examples/mnist/main.py
Batch mode
The same can be accomplished in batch mode using the following pytorch_sample.sb script:
wget https://wiki.ncsa.illinois.edu/download/attachments/82510352/pytorch_sample.sb sbatch pytorch_sample.sb squeue
Major Installed PowerAI Related Anaconda Modules
Name | Version | Description |
---|---|---|
caffe | 1.0 | Caffe is a deep learning framework made with expression, speed, and modularity in mind. |
cudatoolkit | 10.1.105 | The NVIDIA® CUDA® Toolkit provides a development environment for creating high performance GPU-accelerated applications. |
cudnn | 7.5.0+10.1 | The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. |
h5py | 2.8.0 | The h5py package is a Pythonic interface to the HDF5 binary data format. |
jupyter | 1.0.0 | Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. |
matplotlib | 2.2.3 | Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. |
nccl | 2.4.2 | The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. |
numpy | 1.14.5 | NumPy is the fundamental package for scientific computing with Python. |
opencv | 3.4.2 | OpenCV was designed for computational efficiency and with a strong focus on real-time applications. |
pytables | 3.4.4 | PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. |
pytorch | 1.0.1 | PyTorch enables fast, flexible experimentation and efficient production through a hybrid front-end, distributed training, and ecosystem of tools and libraries. |
scikit-learn | 0.19.1 | Simple and efficient tools for data mining and data analysis. |
scipy | 1.1.0 | SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering |
tensorboard | 1.13.0 | To make it easier to understand, debug, and optimize TensorFlow programs, we've included a suite of visualization tools called TensorBoard. |
tensorflow-gpu | 1.13.1 | The core open source library to help you develop and train ML models. |
torchvision | 0.2.1 | The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision. |