[This document is under construction]

Document History

Version	Contributors	Major Changes	Date Updated

Contributors

Provide a list of contributors who have contributed to this document either by writing sections or by sharing ideas and participating in discussions.

List of contributors in the alphabetical order of their first names:

Benjamin Galewski
Kastan Day
Minu Mathew
Sandeep Puthanveetil Satheesan
Todd Nicholson
Vismayak Mohanarajan
Volodymyr Kindratenko

Introduction

Provide a brief introduction to this document, the goals, what this isn't, and the process used by the focus group to develop this document.

The Hands-on Machine Learning Study Materials for Research Software Engineers Focus Group was formed to share study materials and other related resources that will be useful for interested Research Software Engineers at different skill levels in Machine Learning (ML) expertise to learn ML skills. These are some of the goals of this focus group:

Come up with a set of good hands-on study materials that Research Software Engineers can use to develop and/or improve Machine Learning (ML) skills
These materials should include ones that are useful for beginners and people with intermediate skills in ML
Gather documentation on ML models that generally work for different problem areas or are based on some parameters (e.g., amount of training data for supervised learning)
Collate and adapt the collected materials if possible
Document the collected materials / URLs and categorize them (based on the focus groups' criteria)
Choose different areas within ML to focus on:
1. Traditional Machine Learning
2. Deep Learning - Text Analysis
3. ML Operations and relevant services
Write some code examples that can be shared (e.g. Jupyter Notebooks)
Collect documentation on existing NCSA hardware for ML (HAL, Delta)

This working document is the Focus Group's report and this document contains the study materials and other learning resources collected by the Focus Group members organized into different sessions and subsections. This document is not an extensive survey of the available study materials and we do not claim that this document list all the available study materials or resources. The Focus Group met every two weeks and discussed the materials collected till then and started documenting these here.

Traditional Machine Learning (Minu Mathew, Sandeep Puthanveetil Satheesan)

Provide a brief introduction to machine learning and list major areas within machine learning with short descriptions.

Machine Learning (ML) is a branch of Artificial Intelligence (AI) that uses data to extract information from data without using a set of instructions on how to process the data. Instead, ML algorithms use mathematical models to represent the structure of the data, which are then used to provide predictions on future data.

ML algorithms can be broadly classified into Traditional Machine Learning and Deep Learning (DL). Each of these is in turn classified into Supervised or Semi-Supervised Learning, Unsupervised Learning, and Reinforcement Learning. DL uses Artificial Neural Networks (ANN) for learning the representation of the data while traditional ML techniques use non-ANN-based frameworks. Supervised or Semi-Supervised Learning uses pre-labeled data that are used to provide "examples" for the ML algorithm to learn from. In Unsupervised Learning, the ML algorithms are provided with unlabelled data to work with. In Reinforcement Learning, a feedback loop is used to provide inputs to ML algorithms about how well the algorithm is performing on any given data item. This feedback loop constantly improves the system as more data is available.

Introductory Courses/Blogs

Machine Learning, Andrew Ng, Stanford University/Coursera, https://www.coursera.org/learn/machine-learning/
- Beginner level, Basic programming skills needed, Theory, Hands-on exercises
Machine Learning Mastery, Jason Brownlee, https://machinelearningmastery.com/
Google's Machine Learning Crashcourse, https://developers.google.com/machine-learning/crash-course
- Beginner level, Intermediate level, Advanced level with limited knowledge about TensorFlow (TF) Framework, Theory
Kaggle's Introduction to Machine Learning, https://www.kaggle.com/learn/intro-to-machine-learning
- Beginner level, Theory, Hands-on tutorials using Jupyter Notebook
Machine Learning with Python: A Practical Introduction, https://www.edx.org/course/machine-learning-with-python-a-practical-introduct
- Beginner level, Basic Python knowledge recommended, Theory, Hands-on exercises,

Deep Learning - Text Analysis(Minu Mathew)

Common resources :

1. Approaching (Almost) Any Machine Learning Problem

More code with a bit of theory.
To the point. (Not elaborate)
Details the code used for most ML tasks

2. EugeneYan AppliedML

A good set of relevant and recent papers on various ML topics.

3. HuggingFace

A good resource for anything NLP
Get pre-trained models, source-code for most well-known problems.

4. Stanford Deep Learning course

Theoretical and math heavy. Dwells into the loss functions, activation functions, representations and word embeddings.

5. Article covering RNNs, CNNs and attention mechanism.

Theoretical. A good read to understand concepts.

Natural language - no structure. Computers like some structure. So try to introduce some structure.

Regular Expressions :

Good for quick string comparisons, transformations.

Tokenization, Normalization and stemming - methods to add some structure

NLTK (python package) can be utilized for these methods.

Dimensionality reduction :

Capture the most important structure.

convert high dimensional space to a low dimensional space by preserving only important vectors (Eigen vectors) - get rid of highly correlated dimensions and reduce to single dimension. (Check out SVD / Singular Value decomposition) for the math behind it.

Method to transform text to numeric :

Vocab count / Bag of Words (BOW) - no contextual info kept

Use count vectorizer from sklearn or TF-IDF (better)
Remove stop words

One-hot encoding
Frequency count - no contextual info kept
TF-IDF - no contextual info kept
Word Embeddings : preserve contextual information. Get the semantics of a word.
- Resource on implementation of various embeddings

Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set.
Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size).
GLoVe (pre-trained from Stanford) - based on global context
BERT embeddings
GPT3 embeddings

Using word embeddings:

Learn it (not recommended)
Reuse it (check what dataset the embeddings has been trained on)
Reuse + fine-tune

Models :

Recurrent Neural Network(RNN) :

Most logical to be used in text. The appearance of one word depends on the previous word in the text, and hence whatever the current word implies (or has information about) is dependent on the previous word. This is the principle of RNNs, the current node weight depends on the current word and the previous weights. Problem with this is the weights decay for longer sentences.

1. LSTM (Long short-term memory) : An RNN but with direct links from the current node to another node in the forward path
2. Bi-LSTM : Bi-directional LSTM. Weights of nodes are propagated both ways.
3. GRU (Gated Recurrent Unit) : RNNs but with gates which connects / disconnects the nodes and controls information flow.
4. Resources :
  1. Blog post on RNNs by Andrej Karpathy
  2. Resource on all 3 RNNs
  3. Stanford Lecture video on RNNs - heavy on math, but great to have the fundamentals + models right.
  4. Paper on RNNs - a very detailed (and lengthy) paper on RNNs, its foundations, methods, architecture, why it works and in which scenarios it doesn't..

2. CNN : Convolutional neural networks can also be used for NLP / sentence analysis. These are mostly used for classification tasks (rather than sentence generation or other complex language tasks).

1. Simple feed forward NN for classification, sentiment analysis,
2. Paper on using CNN for spam detection
3. Paper and code on CNN for sentiment classification

3. Attention mechanism :

1. Paper
2. A good blog post explaining the concept
3. Blog post with example code

4. Transformer architecture : Model architecture with an encoder-decoder structure. Very different from the sequence-to-sequence models.

1. https://www.kaggle.com/code/dschettler8845/transformers-course-chapter-1-tf-torch/notebook,
2. hugging face course
3. Good blog post on transformer architecture

5. BERT models

6. XL-Net (by microsoft) - BERT and GPT-3 works better in general

7. GPT-3 model:

Methods / Models for common use cases at NCSA :

Small project examples :

Twitter sentiment analysis using Word2Vec and LSTM in Keras

ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )

View with nicer formatting! https://kastanday.notion.site/Round-Table-ML-Ops-NCSA-27e4b6efc4cb410f8fa58ab2583340d9

Round Table Discussion

View here: Round Table Discussion May 31, 2022 - NCSA Software Wiki

Hosted by Kastan Day, learn more about me on KastanDay.com

Opening questions

How many people have worked with AI? Raise your virtual hands. go around the room and describe it a little.

Data engineer vs data science role ← are they on your team?

Who uses python? Conda?

Ai on current projects, open questions?

Objectives

Leaving this talk you should have two things:

Context: the parts of ML project / lifecycle.
Tools: A list of the best tools for the job.

My goal:

Plain language.
Offer awareness, so you can know what to look for online.

Outline

High priority topics:

[ ] Pre-trained model zoos
[ ] Environments on HPC (Docker/Singularity/Apptainer/Conda)

Model selection
1. Structured vs unstructured
2. Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
  1. B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
  2. Fastest to use is SkLearn (AutoML).
  3. PyTorch Lightning.
  4. FastAI
  5. XGBoost & LightGBM
3. For measuring success, I like F-1 scores (can be weighted).
Data pipelines
1. Luigi, Airflow, Ray, Snake(?), Spark.
2. Globus, APIs, S3 buckets, HPC resources.
3. Configuring and running Large ML training jobs, on Delta.
4. Normal: Pandas, Numpy
5. Big:
  1. Spark (PySpark)
  2. Dask
  3. XArray
  4. Dask - distributed pandas and Numpy
  5. Rapids
    1. cuDF - cuda dataframes
    2. Dask cuDF - distributed dataframe (can’t fit in one GPU’s memory).
  6. Rapids w/Dask (cudf) - distributed, on-gpu calculations. Blog, reading large CSVs.
1. Key idea: make data as info-dense as possible.
2. Limit correlation between input variables (Pearson or Chi-squred) — this is filter-based, you can also do permutation-based importance.
3. Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.
  Data cleaning (and feature engineering ← this is jargon)
4. Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)
Easy parallelism in Python
1. HPC: Parsl, funcX: Federated Function as a Service
2. Commercial or Cloud: Ray.io
Serving
1. Gradio & HF Spaces & Streamlit & PyDoc
2. Data and Learning Hub for Science (research soft.) Dan Katz.
3. Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
Distributed training
1. XGBoost - Dask.
2. LightGBM - Dask or Spark.
3. Horovod.
4. PyTorch DDP (PyTorch lightning) Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation
5. General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory, but require a fair bit of specific knowledge about the model in question.
6. Flavors of Parallelism
  1. Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
  2. Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
  3. Hard: model parallelism. Must-read resource: Model Parallelism (huggingface.co)
  4. My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
  5. Glossary
    1. DDP — Distributed Data Parallel
    2. PP - Pipeline Parallel (DeepSpeed)
    3. TP - Tensor Parallel
    4. VP - Voting parallel (usually decision tree async updates, e.g. LightGBM)
    5. MP - Model Parallel (Model sharding, and pinning layers to devices)
    6. FS-DDP - Fully Sharded Distributed Data Parallel
Fine-tune on out-of-distribution examples?
1. TBD: What's the best way to fine-tune?
2. TBD: How do you monitor if your model is experiencing domain shift while in production? WandB Alerts is my best idea.
3. Use Fast.ai w/ your PT or TF model, I think.
4. A motivating example: the Permafrost Discovery Gateway has a great classifier for satellite images from Alaska, but need to adjust it for Alaska. How can we best fine-tune our existing model to this slightly different domain?
MLOps
1. WandB.ai — First class tool during model development & data pre-processing.
2. Spell
3. https://github.com/allegroai/clearml
4. MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
5. The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
HPC resources at UIUC
1. NCSA Large: Delta (and Cerebras). External, but friendly: XSEDE (Bridges2).
2. NCSA Small: Nano, Kingfisher, HAL (ppcle).
3. NCSA Modern: DGX, and Arm-based with two A100(40G) (via Hal-login3).

Environments on HPC

module load <TAB><TAB> — discover preinstalled environments
Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
1. Write DOCKERFILEs for HPC, syntax here.
Globus file transfer — my favorite. Wonderfully robust, parallel, lots of logging.
Towards the perfect command-line file transfer: Xargs | Rsync Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.

Rsync essential reference

# My go-to command. Sytax like scp.
rsync -azP source destination

# flags explained
# -a is like than scp's `-r` but it also preserves metadata and symblinks. 
# -z = compression (more CPU usage, less network traffic) 
# -P flag combines the flags --progress and --partial. It enables resuming. 

# to truly keep in sync, add delete option 
rsync -a --delete source destination

# create backups 
rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination

# Good flags 
--exclude=pattern_to_exclude
-n = dry run, don't actually do it. just print what WOULD have happened.

Conda Best Practices

When sharing Conda envs: Consider, are you sharing with others or using Conda in Docker?

Adding the --from-history flag will install only the packages you manually installed while using conda. It will NOT include pip packages or anything else, like apt/yum/brew.

# 1. cross platform conda envs (my go-to)
conda env export --from-history > environment.yml   # main yaml
conda env export > pip_env.yml                      # just for pip
Then, manually copy ONLY the pip section from pip_env.yml into environment.yml.

conda env create -f environment.yml                # usage

# 2. for dockerfiles & exact replication on identical hardware
conda list --explicit > spec-file.txt

conda create --name myenv --file spec-file.txt    # usage

Install complex dependencies with Conda: specific versions of cuda, gcc and more!

Cuda Toolkit :: Anaconda.org — check the “labels” tab for more versions! It works like Docker labels; you can pull whatever version you need.

Note: the same packages are distributed by multiple “channels” (the -c flag). It can be messy finding the right channel, definitely do some googling to find compatibilities.

# Check Cuda Version
$ nvcc --version
$ cat /usr/local/cuda/version.txt  -- always check here

# Install Cuda
conda install -c nvidia/label/cuda-11.3.1 cuda-toolkit # "All" necessary cuda tools
conda install -c nvidia/label/cuda-11.3.1 cuda-nvcc    # "NVidia Cuda Compiler"

Conda vs Mamba

Mamba is a faster drop-in replacement to Conda — it has 100% identical syntax.

But, Mamba is strictly worse than Conda at resolving dependencies. But at least it is conservative and will never-ever mess up your environment; it will just fail.

Therefore, I recommend running mamba install first and if you get error “cannot resolve dependencies,” then try conda install for more power, at the cost of being slow. If you have to pick one, conda is strictly more capable.

Cheap compute

The benefit: sudo access on modern hardware and clean environments. That's perfect for when dependency-hell makes you want to scream, especially when dealing with outdated HPC libraries.

Google Colab (free or paid)
Kaggle kernels (free)
LambdaLabs (my favorite for cheap GPU)
DataCrunch.io (my favorite for cheap GPU, especially top-of-the-line a100s 80G)
Grid.ai (From the creator of PyTorch Lightning)
PaperSpace Gradient
GCP and Azure — lots of free credits floating around.
- Azure is one of few that have 8x80GB A100 systems. For ~$38/hr. Still, sometimes you may need that.

Best AI Courses

Practical Deep Learning for Coders (Fast.ai) — One of the fastest practical learning materials out there.
Dive into Deep Learning — Dive into Deep Learning 0.17.5 documentation (d2l.ai) – Good for concise topic-specific references.

New topics

Streaming data for ML inference

Event listeners...
Data + AI Summit 2021 Agenda - Databricks

domain drift, explainable ai, dataset versioning (need to motivate, include in hyperparam search).

Apache Iceberg - ETL & high perf.
Project Nessie: Transactional Catalog for Data Lakes with Git-like semantics
- Like Git for data.

Explainability tools:

SHAP
ELI5
XGBoost
LightGBM: https://github.com/microsoft/LightGBM

Using GPUs for Speeding up ML (Vismayak Mohanarajan)

Rapids - cuDF and cuML

Colab Page - https://colab.research.google.com/drive/1bzL-mhGNvh7PF_MzsSgMmw9TQjyP6DCe?usp=sharing

ML Pathways

<List of some popular ML learning pathways and a brief comment about each>

Space shortcuts

Page tree

Document History

Contributors

Introduction

Traditional Machine Learning (Minu Mathew, Sandeep Puthanveetil Satheesan)

Introductory Courses/Blogs

Deep Learning - Text Analysis(Minu Mathew)

ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )

Round Table Discussion

Opening questions

Objectives

Outline

Conda Best Practices

Cheap compute

Best AI Courses

New topics

Streaming data for ML inference

Using GPUs for Speeding up ML (Vismayak Mohanarajan)

ML Pathways

References

Space shortcuts

Page tree

[DRAFT] Hands-on Machine Learning Study Materials for Research Software Engineers - Focus Group Report

Document History

Contributors

Introduction

Traditional Machine Learning (Minu Mathew, Sandeep Puthanveetil Satheesan)

Introductory Courses/Blogs

Deep Learning - Text Analysis(Minu Mathew)

ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )

Round Table Discussion

Opening questions

Objectives

Outline

Conda Best Practices

Cheap compute

Best AI Courses

New topics

Streaming data for ML inference

Using GPUs for Speeding up ML (Vismayak Mohanarajan)

ML Pathways

References