[This document is under construction]
Document History
Version | Contributors | Major Changes | Date Updated |
---|---|---|---|
Contributors
Provide a list of contributors who have contributed to this document either by writing sections or by sharing ideas and participating in discussions.
List of contributors in the alphabetical order of their first names:
- Benjamin Galewski
- Kastan Day
- Minu Mathew
- Sandeep Puthanveetil Satheesan
- Todd Nicholson
- Vismayak Mohanarajan
- Volodymyr Kindratenko
Introduction
Provide a brief introduction to this document, the goals, what this isn't, and the process used by the focus group to develop this document.
The Hands-on Machine Learning Study Materials for Research Software Engineers Focus Group was formed to share study materials and other related resources that will be useful for interested Research Software Engineers at different skill levels in Machine Learning (ML) expertise to learn ML skills. The following are some of the major goals of this focus group:
- Come up with a set of good hands-on study materials that Research Software Engineers can use to develop and/or improve Machine Learning (ML) skills
- These materials should include ones that are useful for beginners and people with intermediate skills in ML
- Gather documentation on ML models that generally work for different problem areas or are based on some parameters (e.g., amount of training data for supervised learning)
Collate and adapt the collected materials if possible
- Document the collected materials / URLs and categorize them (based on the focus groups' criteria)
- Choose different areas within ML to focus on:
- Traditional Machine Learning
- Deep Learning - Text Analysis
- ML Operations and relevant services
- Write some code examples that can be shared (e.g. Jupyter Notebooks)
- Collect documentation on existing NCSA hardware for ML (e.g., HAL, Delta)
This working document is the Focus Group's report and this document contains the study materials and other learning resources collected by the Focus Group members organized into different sessions and subsections. This document is not an extensive survey of the available study materials and we do not claim that this document list all the available study materials or resources. The Focus Group met every two weeks and discussed the materials collected till then and started documenting these here.
Traditional Machine Learning (Minu Mathew, Sandeep Puthanveetil Satheesan)
Provide a brief introduction to machine learning and list major areas within machine learning with short descriptions.
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that uses data to extract information from data without using a set of instructions on how to process the data. Instead, ML algorithms use mathematical models to represent the structure of the data, which are then used to provide predictions on future data.
ML algorithms can be broadly classified into Traditional Machine Learning and Deep Learning (DL). Each of these is in turn classified into Supervised or Semi-Supervised Learning, Unsupervised Learning, and Reinforcement Learning. DL uses Artificial Neural Networks (ANN) for learning the representation of the data while traditional ML techniques use non-ANN-based frameworks. Supervised or Semi-Supervised Learning uses pre-labeled data that are used to provide "examples" for the ML algorithm to learn from. In Unsupervised Learning, the ML algorithms are provided with unlabelled data to work with. In Reinforcement Learning, a feedback loop is used to provide inputs to ML algorithms about how well the algorithm is performing on any given data item. This feedback loop constantly improves the system as more data is available.
Introductory Courses/Blogs
- Machine Learning, Andrew Ng, Stanford University/Coursera, https://www.coursera.org/learn/machine-learning/
- Beginner level, Basic programming skills needed, Theory, Hands-on exercises
- Machine Learning Mastery, Jason Brownlee, https://machinelearningmastery.com/
- Foundations, Beginner level, Intermediate level, Advanced level, Theory, Hands-on exercises,
- Google's Machine Learning Crashcourse, https://developers.google.com/machine-learning/crash-course
- Beginner level, Intermediate level, Advanced level with limited knowledge about TensorFlow (TF) Framework, Theory
- Kaggle's Introduction to Machine Learning, https://www.kaggle.com/learn/intro-to-machine-learning
- Beginner level, Theory, Hands-on tutorials using Jupyter Notebook
- Machine Learning with Python: A Practical Introduction, https://www.edx.org/course/machine-learning-with-python-a-practical-introduct
- Beginner level, Basic Python knowledge recommended, Theory, Hands-on exercises,
- Notes On Using Data Science & Machine Learning, https://chrisalbon.com/#code_machine_learning
- Hands-on, practical and applied learning resources:
- Practical Deep Learning for Coders (Fast.ai) — One of the fastest practical learning materials out there.
- Dive into Deep Learning — Dive into Deep Learning 0.17.5 documentation (d2l.ai) – Good for concise topic-specific references.
Deep Learning - Text Analysis(Minu Mathew)
Common resources :
1. Approaching (Almost) Any Machine Learning Problem
- More code with a bit of theory.
- To the point. (Not elaborate)
- Details the code used for most ML tasks
- A good set of relevant and recent papers on various ML topics.
3. HuggingFace
- A good resource for anything NLP
- Get pre-trained models, source-code for most well-known problems.
4. Stanford Deep Learning course
- Theoretical and math heavy. Dwells into the loss functions, activation functions, representations and word embeddings.
5. Article covering RNNs, CNNs and attention mechanism.
- Theoretical. A good read to understand concepts.
Natural language - no structure. Computers like some structure. So try to introduce some structure.
Regular Expressions :
Good for quick string comparisons, transformations.
Tokenization, Normalization and stemming - methods to add some structure
NLTK (python package) can be utilized for these methods.
Dimensionality reduction :
Capture the most important structure.
convert high dimensional space to a low dimensional space by preserving only important vectors (Eigen vectors) - get rid of highly correlated dimensions and reduce to single dimension. (Check out SVD / Singular Value decomposition) for the math behind it.
Method to transform text to numeric :
- Vocab count / Bag of Words (BOW) - no contextual info kept
- Use count vectorizer from sklearn or TF-IDF (better)
- Remove stop words
- One-hot encoding
- Frequency count - no contextual info kept
- TF-IDF - no contextual info kept
- Word Embeddings : preserve contextual information. Get the semantics of a word.
- Resource on implementation of various embeddings
- Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set.
- Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size).
- GLoVe (pre-trained from Stanford) - based on global context
- BERT embeddings
- GPT3 embeddings
- Using word embeddings:
- Learn it (not recommended)
- Reuse it (check what dataset the embeddings has been trained on)
- Reuse + fine-tune
Models :
Most logical to be used in text. The appearance of one word depends on the previous word in the text, and hence whatever the current word implies (or has information about) is dependent on the previous word. This is the principle of RNNs, the current node weight depends on the current word and the previous weights. Problem with this is the weights decay for longer sentences.
- LSTM (Long short-term memory) : An RNN but with direct links from the current node to another node in the forward path
- Bi-LSTM : Bi-directional LSTM. Weights of nodes are propagated both ways.
- GRU (Gated Recurrent Unit) : RNNs but with gates which connects / disconnects the nodes and controls information flow.
- Resources :
- Blog post on RNNs by Andrej Karpathy
- Resource on all 3 RNNs
- Stanford Lecture video on RNNs - heavy on math, but great to have the fundamentals + models right.
- Paper on RNNs - a very detailed (and lengthy) paper on RNNs, its foundations, methods, architecture, why it works and in which scenarios it doesn't..
2. CNN : Convolutional neural networks can also be used for NLP / sentence analysis. These are mostly used for classification tasks (rather than sentence generation or other complex language tasks).
- Simple feed forward NN for classification, sentiment analysis,
- Paper on using CNN for spam detection
- Paper and code on CNN for sentiment classification
3. Attention mechanism :
4. Transformer architecture : Model architecture with an encoder-decoder structure. Very different from the sequence-to-sequence models.
- https://www.kaggle.com/code/dschettler8845/transformers-course-chapter-1-tf-torch/notebook,
- hugging face course
- Good blog post on transformer architecture
5. BERT models
6. XL-Net (by microsoft) - BERT and GPT-3 works better in general
7. GPT-3 model:
Methods / Models for common use cases at NCSA :
Small project examples :
ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )
View with nicer formatting! https://kastanday.notion.site/Round-Table-ML-Ops-NCSA-27e4b6efc4cb410f8fa58ab2583340d9
Round Table Discussion
View here: Round Table Discussion May 31, 2022 - NCSA Software Wiki
Hosted by Kastan Day, learn more about me on KastanDay.com
Round Table Objectives
Leaving this talk you should have two things:
- Context: the parts of ML project / lifecycle.
- Tools: A list of the best tools for the job.
My goal:
- Plain language.
- Offer awareness, so you can know what to look for online.
Round table high priority topics:
- Pre-trained model zoos
- Dev environments on HPC (Docker/Singularity/Apptainer/Conda)
ML-Ops Outline of Big Ideas
Model selection
- Structured vs unstructured
- Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
- B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
- Fastest to use is SkLearn (AutoML).
- PyTorch Lightning.
- FastAI
- XGBoost & LightGBM
- For measuring success, I like F-1 scores (can be weighted).
Data pipelines
- Luigi, Airflow, Ray, Snake(?), Spark.
- Globus, APIs, S3 buckets, HPC resources.
- Configuring and running Large ML training jobs, on Delta.
- Normal: Pandas, Numpy
- Big:
- Spark (PySpark)
- Dask
- XArray
- Dask - distributed pandas and Numpy
- Rapids
- cuDF - cuda dataframes
- Dask cuDF - distributed dataframe (can’t fit in one GPU’s memory).
- Rapids w/Dask (
cudf
) - distributed, on-gpu calculations. Blog, reading large CSVs.
- Key idea: make data as info-dense as possible.
- Limit correlation between input variables (Pearson or Chi-squred) — this is filter-based, you can also do permutation-based importance.
- Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.
Data cleaning (and feature engineering ← this is jargon)
- Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)
Easy parallelism in Python
- HPC: Parsl, funcX: Federated Function as a Service
- Commercial or Cloud: Ray.io
Serving
- Gradio & HF Spaces & Streamlit & PyDoc
- Data and Learning Hub for Science (research soft.) Dan Katz.
- Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
Distributed training
XGBoost - Dask.
LightGBM - Dask or Spark.
Horovod.
PyTorch DDP (PyTorch lightning) Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation
General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory, but require a fair bit of specific knowledge about the model in question.
Flavors of Parallelism
- Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
- Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
- Hard: model parallelism. Must-read resource: Model Parallelism (huggingface.co)
- My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
- Glossary
- DDP — Distributed Data Parallel
- PP - Pipeline Parallel (DeepSpeed)
- TP - Tensor Parallel
- VP - Voting parallel (usually decision tree async updates, e.g. LightGBM)
- MP - Model Parallel (Model sharding, and pinning layers to devices)
- FS-DDP - Fully Sharded Distributed Data Parallel
Fine-tune on out-of-distribution examples?
- TBD: What's the best way to fine-tune?
- TBD: How do you monitor if your model is experiencing domain shift while in production? WandB Alerts is my best idea.
- Use Fast.ai w/ your PT or TF model, I think.
- A motivating example: the Permafrost Discovery Gateway has a great classifier for satellite images from Alaska, but need to adjust it for Alaska. How can we best fine-tune our existing model to this slightly different domain?
MLOps
- WandB.ai — Highly recommended. First class tool during model development & data pre-processing.
- Spell
- ClearML
- MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
- The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
HPC resources at UIUC
- NCSA Large: Delta (and Cerebras). External, but friendly: XSEDE (Bridges2).
- NCSA Small: Nano, Kingfisher, HAL (
ppcle
). - NCSA Modern: DGX, and Arm-based with two A100(40G) (via
Hal-login3
).
Environments on HPC
module load <TAB><TAB>
— discover preinstalled environments- Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
- Write DOCKERFILEs for HPC, syntax here.
- Globus file transfer — my favorite. Wonderfully robust, parallel, lots of logging.
- Towards the perfect command-line file transfer:
Xargs | Rsync
Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.
Cheapest GPU cloud compute
The benefit: sudo access on modern hardware and clean environments. That's perfect for when dependency-hell makes you want to scream, especially when dealing with outdated HPC libraries.
- Google Colab (free or paid)
- Kaggle kernels (free)
- LambdaLabs (my favorite for cheap GPU)
- DataCrunch.io (my favorite for cheap GPU, especially top-of-the-line a100s 80G)
- Grid.ai (From the creator of PyTorch Lightning)
- PaperSpace Gradient
- GCP and Azure — lots of free credits floating around.
- Azure is one of few that have 8x80GB A100 systems. For ~$38/hr. Still, sometimes you may need that.
New topics
Streaming data for ML inference
- Event listeners...
- Data + AI Summit 2021 Agenda - Databricks
Domain drift, explainable ai, dataset versioning (need to motivate, include in hyperparam search).
- Apache Iceberg - ETL & high perf.
- Project Nessie: Transactional Catalog for Data Lakes with Git-like semantics
- Like Git for data.
Explainability tools:
- SHAP
- ELI5
- (Gradient boosted) decision trees
- XGBoost
- LightGBM: https://github.com/microsoft/LightGBM
Using GPUs for Speeding up ML (Vismayak Mohanarajan)
Rapids - cuDF and cuML
Colab Page - https://colab.research.google.com/drive/1bzL-mhGNvh7PF_MzsSgMmw9TQjyP6DCe?usp=sharing
ML Pathways
<List of some popular ML learning pathways and a brief comment about each>
Conclusion
<A brief concluding section about the report any future ideas>
Conda Best Practices
When sharing Conda envs: Consider, are you sharing with others or using Conda in Docker?
Adding the --from-history
flag will install only the packages you manually installed while using conda. It will NOT include pip packages or anything else, like apt/yum/brew.
# 1. cross platform conda envs (my go-to)
conda env export --from-history > environment.yml # main yaml
conda env export > pip_env.yml # just for pip
Then, manually copy ONLY the pip section from pip_env.yml into environment.yml.
conda env create -f environment.yml # usage
# 2. for dockerfiles & exact replication on identical hardware
conda list --explicit > spec-file.txt
conda create --name myenv --file spec-file.txt # usage
Install complex dependencies with Conda: specific versions of cuda
, gcc
and more!
Cuda Toolkit :: Anaconda.org — check the “labels” tab for more versions! It works like Docker labels; you can pull whatever version you need.
Note: the same packages are distributed by multiple “channels” (the -c
flag). It can be messy finding the right channel, definitely do some googling to find compatibilities.
# Check Cuda Version
$ nvcc --version
$ cat /usr/local/cuda/version.txt -- always check here
# Install Cuda
conda install -c nvidia/label/cuda-11.3.1 cuda-toolkit # "All" necessary cuda tools
conda install -c nvidia/label/cuda-11.3.1 cuda-nvcc # "NVidia Cuda Compiler"
Conda vs Mamba
Mamba is a faster drop-in replacement to Conda — it has 100% identical syntax.
But, Mamba is strictly worse than Conda at resolving dependencies. But at least it is conservative and will never-ever mess up your environment; it will just fail.
Therefore, I recommend running mamba install
first and if you get error “cannot resolve dependencies,” then try conda install
for more power, at the cost of being slow. If you have to pick one, conda is strictly more capable.
Rsync Best Practices
Rsync syntax is modeled after scp
. Here is my favorite usage.
# My go-to command:
rsync -azP source destination
# flags explained
# -a is like than scp's `-r` but it also preserves metadata and symblinks.
# -z = compression (more CPU usage, less network traffic)
# -P flag combines the flags --progress and --partial. It enables resuming.
# to truly keep in sync, add delete option
rsync -a --delete source destination
# create backups
rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination
# Good flags
--exclude=pattern_to_exclude
-n = dry run, don't actually do it. just print what WOULD have happened.