Open discussions on specific topics selected by the Software Working Group and selected from the list of SWG Topics For Discussion.

Tuesday, May 31, 2022 Machine Learning Ops and Useful Tools moderated by Kastan Day

In-meeting "slides" found here, durable documentation: Hands-on Machine Learning Study Materials for Research Software Engineers




Recording: https://uofi.box.com/s/f74m9i70xkpe9fsak6316yhzfydgr5xw


Attendees:

Luigi Marini 

Kastan Day 

Minu Mathew 

Dipannita Dey 

Vismayak Mohanarajan 

Galen Arnold 

Maxwell Burnette 

Nathan Tolbert 
Christopher Navarro 

Chen Wang 

Elizabeth Yanello 

Sandeep Puthanveetil Satheesan 

Michael Groves 

James Phillips 

Roland Haas 

Robert J. Brunner 

Rob Kooper 



Discussion:

Check slides above for this session.

Galen Arnold discussed containers with GPU's and it could take days.  Galen will provide links that will help with this.

Singularity is the best method with Docker containers. Discussion continued to weigh the pro's and con's of this.

Kastan noted that Slides/Content for this meeting here: DRAFT
From Galen Arnold to Everyone 10:12 AM

caffe:20.03-py3

caffe2:18.08-py3

cntk:18.08-py3 , Microsoft Cognitive Toolkit

digits:21.09-tensorflow-py3

matlab:r2021b

mxnet:21.09-py3

pytorch:22.02-py3

tensorflow:22.02-tf1-py3

tensorflow:22.02-tf2-py3

tensorrt:22.02-py3

theano:18.08

torch:18.08-py2

^^^ current list of the Nvidia containers on Delta

Kastan just mentioned that Singularity has joined the Linux Foundation and is now Apptainer.

Conda toolkit

Cuda toolkit https://anaconda.org/anaconda/cudatoolkit

mamba is a drop-in replacement and uses the same commands and configuration options as conda . The only difference is that you should still use conda for activation and deactivation.

A pip freeze would work as well. pip freeze is a very useful command, because it tells you which modules you've installed with pip install and the versions of these modules that you are currently have installed on your computer. In Python, there's a lot of things that may be incompatible, such as certain modules being incompatible with other modules.


Use Cases:

Horovod is an easy way to scale up data parrallellism.  Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU

Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.


Machine Learning Ops:

Weights and Biases in for ML Ops

Clear ML

TensorBoard and TensorFlow

Rapids in Nvidia

Data pipline - Globus is the best for transferring large data

Galen notes rsync pro tip for HPC machines: go get a compute node in a job to do really big things with rsync…login nodes are heavily shared. 

Also, if you are faster than gigE end to end, compression is usually not worth it…just stay uncompressed if you have the space available.

Roland notes piping through tar, but recommends Globus

Focus Group notes: ML Focus group docs: DRAFT

Dive into Deep Learning





Links mentioned in this Round Table:

Delta XSEDE Documentation#NVIDIANGCContainers

https://wandb.ai/site

https://clear.ml/

https://www.tensorflow.org/tensorboard

https://developer.nvidia.com/rapids

https://d2l.ai/d2l-en.pdf



If you are interested in contributing to a Round Table, please see these links:

Round Table Discussions

SWG Topics For Discussion




  • No labels