[This document is under construction]
Contributors
Table of Contents
Document History
Version | Contributors | Major Changes | Date Updated |
---|---|---|---|
0.1.0 |
| First draft version |
|
Contributors
Following is the list of contributors (in alphabetical order of their first names) Provide a list of contributors who have contributed to this document either by writing sections or by , sharing ideas, and participating in discussions.
Introduction
Provide a brief introduction to this document, the goals, what this isn't, and the process used by the focus group to develop this document.
Traditional Machine Learning (Minu Mathew, Sandeep Puthanveetil Satheesan)
Provide a brief introduction to machine learning and list major areas within machine learning with short descriptions.
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that uses data to extract information from data without using a set of instructions on how to process the data. Instead, ML algorithms use mathematical models to represent the structure of the data, which are then used to provide predictions on future data.
ML algorithms can be broadly classified into Traditional Machine Learning and Deep Learning (DL). Each of these is in turn classified into Supervised or Semi-Supervised Learning, Unsupervised Learning, and Reinforcement Learning. DL uses Artificial Neural Networks (ANN) for learning the representation of the data while traditional ML techniques use non-ANN-based frameworks. Supervised or Semi-Supervised Learning uses pre-labeled data that are used to provide "examples" for the ML algorithm to learn from. In Unsupervised Learning, the ML algorithms are provided with unlabelled data to work with. In Reinforcement Learning, a feedback loop is used to provide inputs to ML algorithms about how well the algorithm is performing on any given data item. This feedback loop constantly improves the system as more data is available.
Introductory Courses/Blogs
- Machine Learning, Andrew Ng, Stanford University/Coursera, https://www.coursera.org/learn/machine-learning/
- Beginner level, Basic programming skills needed, Theory, Hands-on exercises
- Machine Learning Mastery, Jason Brownlee, https://machinelearningmastery.com/
- Google's Machine Learning Crashcourse, https://developers.google.com/machine-learning/crash-course
- Beginner level, Intermediate level, Advanced level with limited knowledge about TensorFlow (TF) Framework, Theory
- Kaggle's Introduction to Machine Learning, https://www.kaggle.com/learn/intro-to-machine-learning
- Beginner level, Theory, Hands-on tutorials using Jupyter Notebook
- Machine Learning with Python: A Practical Introduction, https://www.edx.org/course/machine-learning-with-python-a-practical-introduct
- Beginner level, Basic Python knowledge recommended, Theory, Hands-on exercises,
Deep Learning - Text Analysis(Minu Mathew)
Common resources :
1. Approaching (Almost) Any Machine Learning Problem
- More code with a bit of theory.
- To the point. (Not elaborate)
- Details the code used for most ML tasks
- A good set of relevant and recent papers on various ML topics.
3. HuggingFace
- A good resource for anything NLP
- Get pre-trained models, source-code for most well-known problems.
4. Stanford Deep Learning course
- Theoretical and math heavy. Dwells into the loss functions, activation functions, representations and word embeddings.
5. Article covering RNNs, CNNs and attention mechanism.
- Theoretical. A good read to understand concepts.
Natural language - no structure. Computers like some structure. So try to introduce some structure.
Regular Expressions :
Good for quick string comparisons, transformations.
Tokenization, Normalization and stemming - methods to add some structure
NLTK (python package) can be utilized for these methods.
Dimensionality reduction :
Capture the most important structure.
convert high dimensional space to a low dimensional space by preserving only important vectors (Eigen vectors) - get rid of highly correlated dimensions and reduce to single dimension. (Check out SVD / Singular Value decomposition) for the math behind it.
Method to transform text to numeric :
...
- Use count vectorizer from sklearn or TF-IDF (better)
- Remove stop words
...
- Resource on implementation of various embeddings
...
- Learn it (not recommended)
- Reuse it (check what dataset the embeddings has been trained on)
- Reuse + fine-tune
- Benjamin Galewski
- Kastan Day
- Minu Mathew
- Sandeep Puthanveetil Satheesan
- Todd Nicholson
- Vismayak Mohanarajan
- Volodymyr Kindratenko
Introduction
The Hands-on Machine Learning Study Materials for Research Software Engineers Focus Group was formed to share study materials and other related resources that will be useful for interested Research Software Engineers at different Machine Learning (ML) skill levels to learn those skills. The following were some of the primary goals of this focus group:
- Come up with a set of good hands-on study materials that Research Software Engineers can use to develop and/or improve ML skills
- Include materials that are useful for beginners and people with intermediate skills in ML
- Gather documentation on ML models that generally work for different problem areas or are based on some parameters (e.g., amount of training data for supervised learning)
Collate and adapt the collected materials if possible
- Document the collected materials/URLs and categorize them (based on the focus group's criteria)
- Choose different areas within ML to focus on:
- Traditional Machine Learning
- Deep Learning - Text Analysis
- ML Operations and relevant services
- Write some code examples that can be shared (e.g., Jupyter Notebooks)
- Collect documentation on existing NCSA hardware for ML (e.g., HAL, Delta)
This working document is the Focus Group's report, containing the study materials and other learning resources collected by the Focus Group members organized into different sessions and subsections. This document is not an extensive survey of the available study materials, and we do not claim that this document lists all the available study materials or resources. The Focus Group met every two weeks, discussed the materials collected till then, and started documenting these here.
Traditional Machine Learning
ML is a branch of Artificial Intelligence (AI) that uses data to extract information from data without using a set of instructions on how to process the data. Instead, ML algorithms use mathematical models to represent the structure of the data, which are then used to provide predictions on future data.
ML algorithms can be broadly classified into Traditional Machine Learning and Deep Learning (DL). These are classified into Supervised or Semi-Supervised Learning, Unsupervised Learning, and Reinforcement Learning. DL uses Artificial Neural Networks (ANN) to learn data representation, while Traditional ML techniques use non-ANN-based frameworks. Supervised or Semi-Supervised Learning uses pre-labeled data to provide "examples" from which the ML algorithm can learn. In Unsupervised Learning, the ML algorithms are provided with unlabelled data. In Reinforcement Learning, a feedback loop provides inputs to ML algorithms about how well the algorithm performs on any given data item. This feedback loop constantly improves the system as more data is available.
Introductory Courses/Blogs
- Machine Learning by Andrew Ng, Stanford University/Coursera, https://www.coursera.org/learn/machine-learning/
- Skill Level: Beginner (basic programming skills needed).
- Key Features: Theoretical concepts; Hands-on exercises
- Short Description: This is a popular ML course offered by Coursera and part of their Machine Learning specialization. The learners build machine learning models using NumPy and scikit-learn libraries in Python. The learners also build and train supervised machine-learning models for prediction and classification tasks.
- Machine Learning Mastery by Jason Brownlee, https://machinelearningmastery.com/
- Skill Levels: Foundations; Beginner; Intermediate; Advanced
- Key Features: Some theoretical concepts, Mostly hands-on exercises
- Short Description: This is a popular and developer-focused suite of study materials on different topics in ML.
- Machine Learning Crashcourse from Google, https://developers.google.com/machine-learning/crash-course
- Beginner-level and Intermediate level, Advanced level with limited knowledge about TensorFlow (TF) Framework, Theory
- Introduction to Machine Learning from Kaggle, https://www.kaggle.com/learn/intro-to-machine-learning
- Beginner-level, Theory, Hands-on tutorials using Jupyter Notebook
- Machine Learning with Python: A Practical Introduction, https://www.edx.org/course/machine-learning-with-python-a-practical-introduct
- Beginner level, Basic Python knowledge recommended, Theory, Hands-on exercises,
- Notes On Using Data Science & Machine Learning, https://chrisalbon.com/#code_machine_learning
- Hands-on, practical, and applied learning resources:
- Practical Deep Learning for Coders (Fast.ai)
- One of the fastest practical learning materials available.
- Dive into Deep Learning — Dive into Deep Learning 0.17.5 documentation (d2l.ai)
- Good for concise topic-specific references.
Deep Learning - Text Analysis
Curator: Minu Mathew (minum@illinois.edu)
Introduction
Natural Language Processing (NLP) is broadly defined as software's automatic manipulation of natural language, like speech and text. The study of natural language processing has been around for more than 50 years and has grown out of the field of linguistics with the rise of computers.
Common resources
- Approaching (Almost) Any Machine Learning Problem
- More code with a bit of theory.
- To the point and not that elaborate
- Details the code used for most practical ML tasks
- EugeneYan AppliedML
- A good set of relevant and recent papers on various ML topics.
- HuggingFace
- A good resource for anything NLP
- Get pre-trained models and source code for the most well-known problems.
- Stanford Deep Learning course
- Theoretical and math-heavy. Dwells into the loss functions, activation functions, representations, and word embeddings.
Basic NLP
Natural language has no structure. However, computers like some structure. So, basic NLP techniques try introducing some structure to text to find patterns. Commonly used is the re python library.
Regular Expressions
This is the most basic form of text manipulation. Good for quick string comparisons and transformations.
Data Preparation
Tokenization, Normalization, and stemming are methods to add some structure to text. NLTK (python package) is commonly used for these methods.
Check out this blog for the usage of NLTK for data preparation
Dimensionality Reduction
This technique, in essence, captures the most important structure/part of the text. The technique converts high-dimensional to low-dimensional space by preserving only important vectors (Eigenvectors), removing highly correlated dimensions, and reducing them to a single dimension. Check out SVD / Singular Value decomposition for the math behind it.
Text to Numeric
After cleaning, the next step is to convert text to numeric (convert to vectors/matrices). This blog post explains the process with code. The below methods can be used to convert text to numeric.
- Vocab count / Bag of Words (BoW)
- The simplest technique of counting all words and giving indexes for each.
- No contextual information is preserved. Context is critical in language, and this key part is lost when employing this technique.
- Use count vectorizer from sklearn or TF-IDF (better)
- Remove stop words
- One-hot encoding
- Each word is represented as an n-dimensional vector where n is the total number of words. The index of the particular word will have a value of 1, and the rest of the index values are 0.
- Context is lost.
- Easy to manipulate and process because of 1s and 0s.
- Frequency count
- The frequency of each word is preserved, along with if the word is present or not in a sentence.
- Context is lost, but the frequency is preserved.
- The idea here is more frequent the word, the less significant it is.
- Term-Frequency Inverse-Document Frequency (TF-IDF)
- This is the most used form of vectorization in simple NLP tasks. It uses the word frequency in each document and across documents.
- No contextual information is preserved. But word importance is highlighted in this method.
- This method provides good results for topic classification and spam filtering (identifying spam words).
- Blog on BoW and TF-IDF
- Word Embeddings: preserve contextual information. Get the semantics of a word.
- Resource on implementation of various embeddings
- Learn word embeddings using n-gram (PyTorch and Keras). Here the word embeddings are learned in the training phase. Hence embeddings are specific to the training set. (considers text sequences)
- Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size). (considers text sequences)
- GLoVe (pre-trained from Stanford) - based on global context ( considers text sequences)
- BERTembeddings ( an advanced technique using transformer architecture)
- GPT3 embeddings
- Using word embeddings:
- Learn it (not recommended)
- Reuse it (recommended - although check what dataset the embeddings have been trained on)
- Reuse + fine-tune (recommended)
- Fine-tuning usually means using the below/first few layers with the same weights as the pre-trained model. Freeze the below layers during the training phase.
- The final few layers (usually the last 3-4) are trained with the dataset. That way, the weights of the final layers are learned specific to the task/data at hand.
- The lower layers have rich general knowledge of being trained with a huge and varied dataset (from the pre-trained model), while the last/final layers to the output have weights according to the specific task.
Models
Recurrent Neural Networks (RNN)
Most logical to be used in text analysis. The appearance of one word depends on the previous word in the text, and hence whatever the current word implies (or has information about) is dependent on the previous word. This is the principle of RNNs, the current node weight depends on the current word and the previous weights (auto-regressive models). The problem with this is the weights decay for longer sentences. This blog gives an introduction to RNNs.
- LSTM (Long short-term memory): An RNN but with direct links from the current node to another node in the forward path
- Bi-LSTM: Bi-directional LSTM. Weights of nodes are propagated both ways.
- GRU (Gated Recurrent Unit): RNNs but with gates that connect/disconnect the nodes and control information flow.
- Resources :
- Blog post on RNNs by Andrej Karpathy
- Resource on all 3 RNNs
- Stanford Lecture video on RNNs - heavy on math, but great to have the fundamentals + models right.
- Paper on RNNs - a very detailed (and lengthy) paper on RNNs, its foundations, methods, architecture, why it works, and in which scenarios it doesn't.
- Article covering RNNs, CNNs, and attention mechanism - Theoretical. A good read to understand concepts.
Convolutional Neural Networks (CNN)
- Although mostly used for computer vision tasks, Convolutional neural networks can also be used for NLP / sentence analysis. These are mostly used for classification tasks (rather than sentence generation or other complex language tasks).
- Simple feed-forward NN for classification, sentiment analysis,
- Paper on using CNN for spam detection
- Paper and code on CNN for sentiment classification
Attention Mechanism
Transformer Architecture
- Model architecture with an encoder-decoder structure. The auto-encoder model structure differs greatly from the sequence-to-sequence (auto-regressive) models.
- https://www.kaggle.com/code/dschettler8845/transformers-course-chapter-1-tf-torch/notebook,
- hugging face course
- Good blog post on transformer architecture
BERT
- The introduction of the BERT model is commonly termed as the NLP’s ImageNet moment as it was open-sourced and achieved astounding results.
- Auto-encoding language model (does not use auto-regressive techniques like in RNNs)
- Open-source model released by Google
- Source code and pre-trained models available
- BERT paper - well-written and a good read.
- Jay Alammar's blog post on BERT
- Explains how BERT works and its differences from other models.
- Illustrative theory.
- BERT in practice (using Colab, Hugging Face, and very simple code )
GPT-2
- This model, released by OpenAI, was much more mature than BERT. OpenAI did not release the fully trained model fearing malicious intents.
- Blog post by OpenAI
- Details the crux of the model and the experiment results
- Links to the paper and other technical blog posts on transformers.
- Jay Alammar Blog post with illustrations
- Compares BERT with GPT-2
- Open source code
XL-Net
GPT-3
- This model forms the basis of all recent models. Open-sourced and pre-trained models are available.
- State-of-the-art (SOTA) for now
- Jay Alammar Blog post explaining GPT-3 in simple terms with illustrations
- Very simple theory. No code.
- Reading on GPT2 is recommended.
- Blog post describing various NLP tasks and GPT3 achievement in those
- Theoretical read.
- Talks about NLP benchmarks like GLUE, BLUE, SQUAD
New Topics / Cutting-edge
ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )
View with nicer formatting! https://kastanday.notion.site/Round-Table-ML-Ops-NCSA-27e4b6efc4cb410f8fa58ab2583340d9
Round Table Discussion
View here: Round Table Discussion May 31, 2022 - NCSA Software Wiki
Hosted by Kastan Day, learn more about me on KastanDay.com
Round Table Objectives
Leaving this talk, you should have two things:
- Context: the parts of ML project/lifecycle.
- Tools: A list of the best tools for the job.
My goal:
- Plain language.
- Offer awareness, so you can know what to look for online.
Round table high-priority topics:
- Pre-trained model zoos
- Dev environments on HPC (Docker/Singularity/Apptainer/Conda)
ML-Ops Outline of Big Ideas
Model selection
- Structured vs. unstructured
- Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
- B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
- Fastest to use is SkLearn (AutoML).
- PyTorch Lightning.
- FastAI
- XGBoost & LightGBM
- For measuring success - F-1 scores (which can be weighted).
Data pipelines
- Luigi, Airflow, Ray, Snake(?), Spark.
- Globus, APIs, S3 buckets, HPC resources.
- Configuring and running Large ML training jobs on Delta.
- Normal: Pandas, Numpy
- Big:
- Spark (PySpark)
- Dask
- XArray
- Dask - distributed pandas and Numpy
- Rapids
- cuDF - cuda dataframes
- Dask cuDF - distributed data frame (can’t fit in one GPU’s memory).
- Rapids w/Dask (
cudf
) - distributed, on-GPU calculations. Blog, reading large CSVs.
- Key idea: make data as info-dense as possible.
- Limit correlation between input variables (Pearson or Chi-squared) — this is filter-based, you can also do permutation-based importance.
- Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.
Data cleaning (and feature engineering ← this is jargon)
- Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)
Easy parallelism in Python
- HPC: Parsl, funcX: Federated Function as a Service
- Commercial or Cloud: Ray.io
Serving
- Gradio & HF Spaces & Streamlit & PyDoc
- Data and Learning Hub for Science (research soft.) Dan Katz.
- Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
Distributed training
XGBoost - Dask.
LightGBM - Dask or Spark.
Horovod.
PyTorch DDP (PyTorch lightning) Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation
General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory but require a fair bit of specific knowledge about the model in question.
Flavors of Parallelism
- Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
- Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
- Hard: model parallelism. Must-read resource: Model Parallelism (Hugging Face)
- My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
- Glossary
- DDP — Distributed Data Parallel
- PP - Pipeline Parallel (DeepSpeed)
- TP - Tensor Parallel
- VP - Voting parallel (usually decision tree async updates, e.g., LightGBM)
- MP - Model Parallel (Model sharding and pinning layers to devices)
- FS-DDP - Fully Sharded Distributed Data Parallel
Fine-tune on out-of-distribution examples?
- TBD: What's the best way to fine-tune?
- TBD: How do you monitor if your model is experiencing domain shift while in production? WandB Alerts is my best idea.
- Use Fast.ai w/ your PT or TF model, I think.
- A motivating example: the Permafrost Discovery Gateway has a great classifier for satellite images from Alaska but needs to adjust it for Alaska. How can we best fine-tune our existing model to this slightly different domain?
MLOps
- WandB.ai — Highly recommended. First-class tool during model development & data pre-processing.
- Spell
- ClearML
- MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
- The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
HPC resources at UIUC
- NCSA Large: Delta (and Cerebras). External but friendly: XSEDE (Bridges2).
- NCSA Small: Nano, Kingfisher, HAL (
ppcle
). - NCSA Modern: DGX, and Arm-based with two A100(40G) (via
Hal-login3
).
Environments on HPC
module load <TAB><TAB>
— discover preinstalled environments- Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
- Write DOCKERFILEs for HPC, syntax here.
- Globus file transfer — my favorite. Wonderfully robust, parallel, and has lots of logging.
- Towards the perfect command-line file transfer:
Xargs | Rsync
Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.
Cheapest GPU cloud compute
The benefit: sudo access to modern hardware and clean environments. That's perfect for when dependency hell makes you want to scream, especially when dealing with outdated HPC libraries.
- Google Colab (free or paid)
- Kaggle kernels (free)
- LambdaLabs (my favorite for cheap GPU)
- DataCrunch.io (my favorite for cheap GPU, especially top-of-the-line a100s 80G)
- Grid.ai (From the creator of PyTorch Lightning)
- PaperSpace Gradient
- GCP and Azure — lots of free credits floating around.
- Azure is one of the few that have 8x80GB A100 systems. For ~$38/hr. Still, sometimes you may need that.
New topics
Streaming data for ML inference
- Event listeners...
- Data + AI Summit 2021 Agenda - Databricks
Domain drift, explainable ai, dataset versioning (need to motivate, include in hyper param search).
- Apache Iceberg - ETL & high perf.
- Project Nessie: Transactional Catalog for Data Lakes with Git-like semantics
- Like Git for data.
Explainability tools:
- SHAP
- ELI5
- (Gradient boosted) decision trees
- XGBoost
- LightGBM: https://github.com/microsoft/LightGBM
Using GPUs for Speeding up ML (Vismayak Mohanarajan)
Rapids - cuDF and cuML
Colab Page - https://colab.research.google.com/drive/1bzL-mhGNvh7PF_MzsSgMmw9TQjyP6DCe?usp=sharing
ML Pathways
<List of some popular ML learning pathways and a brief comment about each>
Conclusion
<A brief concluding section about the report any future ideas>
Conda Best Practices
When sharing the Conda environment: Consider whether you are sharing with others or using Conda in Docker.
Adding the --from-history
the flag will install only the packages you manually installed while using conda. It will NOT include pip packages or anything else, like apt/yum/brew.
# 1. cross platform conda envs (my go-to)
conda env export --from-history > environment.yml # main yaml
conda env export > pip_env.yml # just for pip
Then, manually copy ONLY the pip section from pip_env.yml into environment.yml.
conda env create -f environment.yml # usage
# 2. for dockerfiles & exact replication on identical hardware
conda list --explicit > spec-file.txt
conda create --name myenv --file spec-file.txt # usage
Install complex dependencies with Conda: specific versions of cuda
, gcc
and more!
Cuda Toolkit: Anaconda.org — check the “labels” tab for more versions! It works like Docker labels; you can pull whatever version you need.
Note: the same packages are distributed by multiple “channels” (the -c
flag). It can be messy finding the right channel; do some googling to find compatibilities.
# Check Cuda Version
$ nvcc --version
$ cat /usr/local/cuda/version.txt -- always check here
# Install Cuda
conda install -c nvidia/label/cuda-11.3.1 cuda-toolkit # "All" necessary cuda tools
conda install -c nvidia/label/cuda-11.3.1 cuda-nvcc # "NVidia Cuda Compiler"
Conda vs. Mamba
Mamba is a faster drop-in replacement than Conda, with 100% identical syntax.
But Mamba is strictly worse than Conda at resolving dependencies. But at least it is conservative and will never-ever mess up your environment; it will just fail.
Therefore, I recommend running mamba install
first, and if you get the error “cannot resolve dependencies,” then try conda install
for more power at the cost of being slow. If you have to pick one, conda is strictly more capable.
Rsync Best Practices
Rsync syntax is modeled after scp
. Here is my favorite usage.
# My go-to command:
rsync -azP source destination
# flags explained
# -a is like than scp's `-r` but it also preserves metadata and symblinks.
# -z = compression (more CPU usage, less network traffic)
# -P flag combines the flags --progress and --partial. It enables resuming.
# to truly keep in sync, add delete option
rsync -a --delete source destination
# create backups
rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination
# Good flags
--exclude=pattern_to_exclude
-n = dry run, don't actually do it. just print what WOULD have happened.
Models :
Most logical to be used in text. The appearance of one word depends on the previous word in the text, and hence whatever the current word implies (or has information about) is dependent on the previous word. This is the principle of RNNs, the current node weight depends on the current word and the previous weights. Problem with this is the weights decay for longer sentences.
- LSTM (Long short-term memory) : An RNN but with direct links from the current node to another node in the forward path
- Bi-LSTM : Bi-directional LSTM. Weights of nodes are propagated both ways.
- GRU (Gated Recurrent Unit) : RNNs but with gates which connects / disconnects the nodes and controls information flow.
- Resources :
- Blog post on RNNs by Andrej Karpathy
- Resource on all 3 RNNs
- Stanford Lecture video on RNNs - heavy on math, but great to have the fundamentals + models right.
- Paper on RNNs - a very detailed (and lengthy) paper on RNNs, its foundations, methods, architecture, why it works and in which scenarios it doesn't..
2. CNN : Convolutional neural networks can also be used for NLP / sentence analysis. These are mostly used for classification tasks (rather than sentence generation or other complex language tasks).
- Simple feed forward NN for classification, sentiment analysis,
- Paper on using CNN for spam detection
- Paper and code on CNN for sentiment classification
3. Attention mechanism :
4. Transformer architecture : Model architecture with an encoder-decoder structure. Very different from the sequence-to-sequence models.
- https://www.kaggle.com/code/dschettler8845/transformers-course-chapter-1-tf-torch/notebook,
- hugging face course
- Good blog post on transformer architecture
5. BERT models
6. XL-Net (by microsoft) - BERT and GPT-3 works better in general
7. GPT-3 model:
Methods / Models for common use cases at NCSA :
Small project examples :
ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )
View with nicer formatting! https://kastanday.notion.site/Round-Table-ML-Ops-NCSA-27e4b6efc4cb410f8fa58ab2583340d9
Round Table Discussion
View here: Round Table Discussion May 31, 2022 - NCSA Software Wiki
Hosted by Kastan Day, learn more about me on KastanDay.com
Opening questions
How many people have worked with AI? Raise your virtual hands. go around the room and describe it a little.
Data engineer vs data science role ← are they on your team?
Who uses python? Conda?
Ai on current projects, open questions?
Objectives
Leaving this talk you should have two things:
- Context: the parts of ML project / lifecycle.
- Tools: A list of the best tools for the job.
My goal:
- Plain language.
- Offer awareness, so you can know what to look for online.
Outline
High priority topics:
- [ ] Pre-trained model zoos
- [ ] Environments on HPC (Docker/Singularity/Apptainer/Conda)
Model selection
- Structured vs unstructured
- Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
- B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
- Fastest to use is SkLearn (AutoML).
- PyTorch Lightning.
- FastAI
- XGBoost & LightGBM
- For measuring success, I like F-1 scores (can be weighted).
Data pipelines
- Luigi, Airflow, Ray, Snake(?), Spark.
- Globus, APIs, S3 buckets, HPC resources.
- Configuring and running Large ML training jobs, on Delta.
- Normal: Pandas, Numpy
- Big:
- Spark (PySpark)
- Dask
- XArray
- Dask - distributed pandas and Numpy
- Rapids
- cuDF - cuda dataframes
- Dask cuDF - distributed dataframe (can’t fit in one GPU’s memory).
- Rapids w/Dask (
cudf
) - distributed, on-gpu calculations. Blog, reading large CSVs.
- Key idea: make data as info-dense as possible.
- Limit correlation between input variables (Pearson or Chi-squred) — this is filter-based, you can also do permutation-based importance.
- Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.
Data cleaning (and feature engineering ← this is jargon)
- Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)
Easy parallelism in Python
- HPC: Parsl, funcX: Federated Function as a Service
- Commercial or Cloud: Ray.io
Serving
- Gradio & HF Spaces & Streamlit & PyDoc
- Data and Learning Hub for Science (research soft.) Dan Katz.
- Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
Distributed training
XGBoost - Dask.
LightGBM - Dask or Spark.
Horovod.
PyTorch DDP (PyTorch lightning) Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation
General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory, but require a fair bit of specific knowledge about the model in question.
Flavors of Parallelism
- Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
- Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
- Hard: model parallelism. Must-read resource: Model Parallelism (huggingface.co)
- My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
- Glossary
- DDP — Distributed Data Parallel
- PP - Pipeline Parallel (DeepSpeed)
- TP - Tensor Parallel
- VP - Voting parallel (usually decision tree async updates, e.g. LightGBM)
- MP - Model Parallel (Model sharding, and pinning layers to devices)
- FS-DDP - Fully Sharded Distributed Data Parallel
Fine-tune on out-of-distribution examples?
- TBD: What's the best way to fine-tune?
- TBD: How do you monitor if your model is experiencing domain shift while in production? WandB Alerts is my best idea.
- Use Fast.ai w/ your PT or TF model, I think.
- A motivating example: the Permafrost Discovery Gateway has a great classifier for satellite images from Alaska, but need to adjust it for Alaska. How can we best fine-tune our existing model to this slightly different domain?
MLOps
- WandB.ai — First class tool during model development & data pre-processing.
- Spell
- https://github.com/allegroai/clearml
- MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
- The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
HPC resources at UIUC
- NCSA Large: Delta (and Cerebras). External, but friendly: XSEDE (Bridges2).
- NCSA Small: Nano, Kingfisher, HAL (
ppcle
). - NCSA Modern: DGX, and Arm-based with two A100(40G) (via
Hal-login3
).
Environments on HPC
module load <TAB><TAB>
— discover preinstalled environments- Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
- Write DOCKERFILEs for HPC, syntax here.
- Globus file transfer — my favorite. Wonderfully robust, parallel, lots of logging.
- Towards the perfect command-line file transfer:
Xargs | Rsync
Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.
Rsync essential reference # My go-to command. Sytax like scp. rsync -azP source destination # flags explained # -a is like than scp's `-r` but it also preserves metadata and symblinks. # -z = compression (more CPU usage, less network traffic) # -P flag combines the flags --progress and --partial. It enables resuming. # to truly keep in sync, add delete option rsync -a --delete source destination # create backups rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination # Good flags --exclude=pattern_to_exclude -n = dry run, don't actually do it. just print what WOULD have happened.
Conda Best Practices
When sharing Conda envs: Consider, are you sharing with others or using Conda in Docker?
Adding the --from-history
flag will install only the packages you manually installed while using conda. It will NOT include pip packages or anything else, like apt/yum/brew.
# 1. cross platform conda envs (my go-to)
conda env export --from-history > environment.yml # main yaml
conda env export > pip_env.yml # just for pip
Then, manually copy ONLY the pip section from pip_env.yml into environment.yml.
conda env create -f environment.yml # usage
# 2. for dockerfiles & exact replication on identical hardware
conda list --explicit > spec-file.txt
conda create --name myenv --file spec-file.txt # usage
Install complex dependencies with Conda: specific versions of cuda
, gcc
and more!
Cuda Toolkit :: Anaconda.org — check the “labels” tab for more versions! It works like Docker labels; you can pull whatever version you need.
Note: the same packages are distributed by multiple “channels” (the -c
flag). It can be messy finding the right channel, definitely do some googling to find compatibilities.
# Check Cuda Version
$ nvcc --version
$ cat /usr/local/cuda/version.txt -- always check here
# Install Cuda
conda install -c nvidia/label/cuda-11.3.1 cuda-toolkit # "All" necessary cuda tools
conda install -c nvidia/label/cuda-11.3.1 cuda-nvcc # "NVidia Cuda Compiler"
Conda vs Mamba
Mamba is a faster drop-in replacement to Conda — it has 100% identical syntax.
But, Mamba is strictly worse than Conda at resolving dependencies. But at least it is conservative and will never-ever mess up your environment; it will just fail.
Therefore, I recommend running mamba install
first and if you get error “cannot resolve dependencies,” then try conda install
for more power, at the cost of being slow. If you have to pick one, conda is strictly more capable.
Cheap compute
The benefit: sudo access on modern hardware and clean environments. That's perfect for when dependency-hell makes you want to scream, especially when dealing with outdated HPC libraries.
- Google Colab (free or paid)
- Kaggle kernels (free)
- LambdaLabs (my favorite for cheap GPU)
- DataCrunch.io (my favorite for cheap GPU, especially top-of-the-line a100s 80G)
- Grid.ai (From the creator of PyTorch Lightning)
- PaperSpace Gradient
- GCP and Azure — lots of free credits floating around.
- Azure is one of few that have 8x80GB A100 systems. For ~$38/hr. Still, sometimes you may need that.
Best AI Courses
- Practical Deep Learning for Coders (Fast.ai) — One of the fastest practical learning materials out there.
- Dive into Deep Learning — Dive into Deep Learning 0.17.5 documentation (d2l.ai) – Good for concise topic-specific references.
New topics
Streaming data for ML inference
- Event listeners...
- Data + AI Summit 2021 Agenda - Databricks
domain drift, explainable ai, dataset versioning (need to motivate, include in hyperparam search).
- Apache Iceberg - ETL & high perf.
- Project Nessie: Transactional Catalog for Data Lakes with Git-like semantics
- Like Git for data.
Explainability tools:
- SHAP
- ELI5
- XGBoost
- LightGBM: https://github.com/microsoft/LightGBM
Using GPUs for Speeding up ML (Vismayak Mohanarajan)
Rapids - cuDF and cuML
Colab Page - https://colab.research.google.com/drive/1bzL-mhGNvh7PF_MzsSgMmw9TQjyP6DCe?usp=sharing
ML Pathways
...