Open discussions on specific topics selected by the Software Working Group and selected from the list of SWG Topics For Discussion.
Tuesday, May 31, 2022 Machine Learning Ops and Useful Tools moderated by Kastan Day
In-meeting "slides" found here, durable documentation: Hands-on Machine Learning Study Materials for Research Software Engineers
Recording: https://uofi.box.com/s/f74m9i70xkpe9fsak6316yhzfydgr5xw
Attendees:
Nathan Tolbert
Christopher Navarro
Sandeep Puthanveetil Satheesan
Discussion:
Check slides above for this session.
Galen Arnold discussed containers with GPU's and it could take days. Galen will provide links that will help with this.
Singularity is the best method with Docker containers. Discussion continued to weigh the pro's and con's of this.
Kastan noted that Slides/Content for this meeting here: DRAFT
From Galen Arnold to Everyone 10:12 AM
caffe:20.03-py3
caffe2:18.08-py3
cntk:18.08-py3 , Microsoft Cognitive Toolkit
digits:21.09-tensorflow-py3
matlab:r2021b
mxnet:21.09-py3
pytorch:22.02-py3
tensorflow:22.02-tf1-py3
tensorflow:22.02-tf2-py3
tensorrt:22.02-py3
theano:18.08
torch:18.08-py2
^^^ current list of the Nvidia containers on Delta
Kastan just mentioned that Singularity has joined the Linux Foundation and is now Apptainer.
Conda toolkit
Cuda toolkit https://anaconda.org/anaconda/cudatoolkit
mamba is a drop-in replacement and uses the same commands and configuration options as conda . The only difference is that you should still use conda for activation and deactivation.
A pip freeze would work as well. pip freeze is a very useful command, because it tells you which modules you've installed with pip install and the versions of these modules that you are currently have installed on your computer. In Python, there's a lot of things that may be incompatible, such as certain modules being incompatible with other modules.
Use Cases:
Horovod is an easy way to scale up data parrallellism. Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU
Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.
Machine Learning Ops:
Weights and Biases in for ML Ops
Clear ML
TensorBoard and TensorFlow
Rapids in Nvidia
Data pipline - Globus is the best for transferring large data
Galen notes rsync pro tip for HPC machines: go get a compute node in a job to do really big things with rsync…login nodes are heavily shared.
Also, if you are faster than gigE end to end, compression is usually not worth it…just stay uncompressed if you have the space available.
Roland notes piping through tar, but recommends Globus
Focus Group notes: ML Focus group docs: DRAFT
Dive into Deep Learning
Links mentioned in this Round Table:
Delta XSEDE Documentation#NVIDIANGCContainers
https://www.tensorflow.org/tensorboard
https://developer.nvidia.com/rapids
If you are interested in contributing to a Round Table, please see these links: