[This document is under construction]


Document History

VersionContributorsMajor ChangesDate Updated
0.1.0
  • Benjamin Galewski
  • Kastan Day
  • Minu Mathew
  • Sandeep Puthanveetil Satheesan
  • Todd Nicholson
  • Vismayak Mohanarajan
  • Volodymyr Kindratenko
First draft version 

 





Contributors

Following is the list of contributors (in alphabetical order of their first names) who have contributed to this document by writing sections, sharing ideas, and participating in discussions.

  • Benjamin Galewski
  • Kastan Day
  • Minu Mathew
  • Sandeep Puthanveetil Satheesan
  • Todd Nicholson
  • Vismayak Mohanarajan
  • Volodymyr Kindratenko

Introduction

The Hands-on Machine Learning Study Materials for Research Software Engineers Focus Group was formed to share study materials and other related resources that will be useful for interested Research Software Engineers at different Machine Learning (ML) skill levels to learn those skills. The following were some of the primary goals of this focus group:

  1. Come up with a set of good hands-on study materials that Research Software Engineers can use to develop and/or improve ML skills
  2. Include materials that are useful for beginners and people with intermediate skills in ML
  3. Gather documentation on ML models that generally work for different problem areas or are based on some parameters (e.g., amount of training data for supervised learning)
  4. Collate and adapt the collected materials if possible

  5. Document the collected materials/URLs and categorize them (based on the focus group's criteria)
  6. Choose different areas within ML to focus on:
    1. Traditional Machine Learning
    2. Deep Learning - Text Analysis
    3. ML Operations and relevant services
  7. Write some code examples that can be shared (e.g., Jupyter Notebooks)
  8. Collect documentation on existing NCSA hardware for ML (e.g., HAL, Delta)

This working document is the Focus Group's report, containing the study materials and other learning resources collected by the Focus Group members organized into different sessions and subsections. This document is not an extensive survey of the available study materials, and we do not claim that this document lists all the available study materials or resources. The Focus Group met every two weeks, discussed the materials collected till then, and started documenting these here. 

Traditional Machine Learning

ML is a branch of Artificial Intelligence (AI) that uses data to extract information from data without using a set of instructions on how to process the data. Instead, ML algorithms use mathematical models to represent the structure of the data, which are then used to provide predictions on future data.

ML algorithms can be broadly classified into Traditional Machine Learning and Deep Learning (DL). These are classified into Supervised or Semi-Supervised Learning, Unsupervised Learning, and Reinforcement Learning. DL uses Artificial Neural Networks (ANN) to learn data representation, while Traditional ML techniques use non-ANN-based frameworks. Supervised or Semi-Supervised Learning uses pre-labeled data to provide "examples" from which the ML algorithm can learn. In Unsupervised Learning, the ML algorithms are provided with unlabelled data. In Reinforcement Learning, a feedback loop provides inputs to ML algorithms about how well the algorithm performs on any given data item. This feedback loop constantly improves the system as more data is available.

Introductory Courses/Blogs

Deep Learning - Text Analysis

Curator: Minu Mathew  (minum@illinois.edu)

Introduction

Natural Language Processing (NLP) is broadly defined as software's automatic manipulation of natural language, like speech and text. The study of natural language processing has been around for more than 50 years and has grown out of the field of linguistics with the rise of computers.

Common resources 

  1. Approaching (Almost) Any Machine Learning Problem  
    • More code with a bit of theory. 
    • To the point and not that elaborate
    • Details the code used for most practical ML tasks
  2. EugeneYan AppliedML
    • A good set of relevant and recent papers on various ML topics.
  3. HuggingFace
    • A good resource for anything NLP
    • Get pre-trained models and source code for the most well-known problems.
  4. Stanford Deep Learning course
    • Theoretical and math-heavy. Dwells into the loss functions, activation functions, representations, and word embeddings.

Basic NLP

Natural language has no structure. However, computers like some structure. So, basic NLP techniques try introducing some structure to text to find patterns. Commonly used is the re python library.

Regular Expressions

This is the most basic form of text manipulation. Good for quick string comparisons and transformations.

Data Preparation

Tokenization, Normalization, and stemming are methods to add some structure to text. NLTK (python package) is commonly used for these methods.

Check out this blog for the usage of NLTK for data preparation

Dimensionality  Reduction

This technique, in essence, captures the most important structure/part of the text. The technique converts high-dimensional to low-dimensional space by preserving only important vectors (Eigenvectors), removing highly correlated dimensions, and reducing them to a single dimension. Check out SVD / Singular Value decomposition for the math behind it.

Text to Numeric

After cleaning, the next step is to convert text to numeric (convert to vectors/matrices). This blog post explains the process with code. The below methods can be used to convert text to numeric.

  1. Vocab count / Bag of Words (BoW)
    1. The simplest technique of counting all words and giving indexes for each.
    2. No contextual information is preserved. Context is critical in language, and this key part is lost when employing this technique.
    3. Use count vectorizer from sklearn or TF-IDF (better)
    4. Remove stop words
  2. One-hot encoding
    1. Each word is represented as an n-dimensional vector where n is the total number of words. The index of the particular word will have a value of 1, and the rest of the index values are 0.
    2. Context is lost. 
    3. Easy to manipulate and process because of 1s and 0s.
  3. Frequency count
    1. The frequency of each word is preserved, along with if the word is present or not in a sentence.
    2. Context is lost, but the frequency is preserved.
    3. The idea here is more frequent the word, the less significant it is.
  4. Term-Frequency Inverse-Document Frequency (TF-IDF)
    1. This is the most used form of vectorization in simple NLP tasks. It uses the word frequency in each document and across documents.
    2. No contextual information is preserved. But word importance is highlighted in this method.
    3. This method provides good results for topic classification and spam filtering (identifying spam words).
    4. Blog on BoW and TF-IDF
  5. Word Embeddings: preserve contextual information. Get the semantics of a word.
    1. Resource on implementation of various embeddings
    2. Learn word embeddings using n-gram (PyTorch and Keras). Here the word embeddings are learned in the training phase. Hence embeddings are specific to the training set. (considers text sequences)
    3. Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size). (considers text sequences)
    4. GLoVe (pre-trained from Stanford) - based on global context ( considers text sequences)
    5. BERT embeddings ( an advanced technique using transformer architecture)
    6. GPT3 embeddings 
  6. Using word embeddings:
    1. Learn it (not recommended)
    2. Reuse it (recommended - although check what dataset the embeddings have been trained on)
    3. Reuse + fine-tune (recommended)
      1. Fine-tuning usually means using the below/first few layers with the same weights as the pre-trained model. Freeze the below layers during the training phase.
      2. The final few layers (usually the last 3-4) are trained with the dataset. That way, the weights of the final layers are learned specific to the task/data at hand.
      3. The lower layers have rich general knowledge of being trained with a huge and varied dataset (from the pre-trained model), while the last/final layers to the output have weights according to the specific task.

Models

Recurrent Neural Networks (RNN)

Most logical to be used in text analysis. The appearance of one word depends on the previous word in the text, and hence whatever the current word implies (or has information about) is dependent on the previous word. This is the principle of RNNs, the current node weight depends on the current word and the previous weights (auto-regressive models). The problem with this is the weights decay for longer sentences. This blog gives an introduction to RNNs.

    1. LSTM (Long short-term memory): An RNN but with direct links from the current node to another node in the forward path
    2. Bi-LSTM: Bi-directional LSTM. Weights of nodes are propagated both ways.
    3. GRU (Gated Recurrent Unit): RNNs but with gates that connect/disconnect the nodes and control information flow.
    4. Resources :
      1. Blog post on RNNs by Andrej Karpathy
      2. Resource on all 3 RNNs
      3. Stanford Lecture video on RNNs - heavy on math, but great to have the fundamentals + models right.
      4. Paper on RNNs - a very detailed (and lengthy) paper on RNNs, its foundations, methods, architecture, why it works, and in which scenarios it doesn't.
      5. Article covering RNNs, CNNs, and attention mechanism - Theoretical. A good read to understand concepts.

Convolutional Neural Networks (CNN)

    1. Although mostly used for computer vision tasks, Convolutional neural networks can also be used for NLP / sentence analysis. These are mostly used for classification tasks (rather than sentence generation or other complex language tasks). 
    2. Simple feed-forward NN for classification, sentiment analysis
    3. Paper on using CNN for spam detection
    4. Paper and code on CNN for sentiment classification

Attention Mechanism

    1. This mechanism lays the foundation of transformer models currently used in all state-of-the-art (SOTA) models.
    2. Paper - this paper is well-written and has a large number of citations.
    3. A good blog post explaining the concept
    4. Blog post with example code

Transformer Architecture

    1. Model architecture with an encoder-decoder structure. The auto-encoder model structure differs greatly from the sequence-to-sequence (auto-regressive) models. 
    2. https://www.kaggle.com/code/dschettler8845/transformers-course-chapter-1-tf-torch/notebook
    3. hugging face course
    4. Good blog post on transformer architecture

BERT

    1. The introduction of the BERT model is commonly termed as the NLP’s ImageNet moment as it was open-sourced and achieved astounding results. 
    2. Auto-encoding language model (does not use auto-regressive techniques like in RNNs)
    3. Open-source model released by Google
      1. Source code and pre-trained models available
    4. BERT paper  - well-written and a good read.
    5. Jay Alammar's blog post on BERT
      1. Explains how BERT works and its differences from other models.
      2. Illustrative theory.
    6. BERT in practice (using Colab, Hugging Face, and very simple code )

GPT-2

    1. This model, released by OpenAI, was much more mature than BERT. OpenAI did not release the fully trained model fearing malicious intents.
    2. Blog post by OpenAI 
      1. Details the crux of the model and the experiment results
      2. Links to the paper and other technical blog posts on transformers.
    3. Jay Alammar Blog post with illustrations
      1. Compares BERT with GPT-2
    4. Open source code

XL-Net

    1. Model released by Microsoft. Paper link.
    2. Uses auto-regressive techniques within an architecture similar to BERT.
    3. Blog post on what makes XLNet better than BERT
    4. Usually, BERT and GPT-3 work better in general.

GPT-3

    1. This model forms the basis of all recent models. Open-sourced and pre-trained models are available.
    2. State-of-the-art (SOTA) for now
    3. Jay Alammar Blog post explaining GPT-3 in simple terms with illustrations
      1. Very simple theory. No code.
      2. Reading on GPT2 is recommended.
    4. Blog post describing various NLP tasks and GPT3 achievement in those
      1. Theoretical read.
      2. Talks about NLP benchmarks like GLUE, BLUE, SQUAD

New Topics / Cutting-edge

  • HuggingFace Bloom: largest model trained on multiple languages (46 languages), blog post


ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )

View with nicer formatting! https://kastanday.notion.site/Round-Table-ML-Ops-NCSA-27e4b6efc4cb410f8fa58ab2583340d9 

Round Table Discussion

View here: Round Table Discussion May 31, 2022 - NCSA Software Wiki

Hosted by Kastan Day, learn more about me on KastanDay.com

Round Table Objectives

Leaving this talk, you should have two things:

  • Context: the parts of ML project/lifecycle.
  • Tools: A list of the best tools for the job.

My goal:

  • Plain language.
  • Offer awareness, so you can know what to look for online.

Round table high-priority topics:

  • Pre-trained model zoos
  • Dev environments on HPC (Docker/Singularity/Apptainer/Conda)

ML-Ops Outline of Big Ideas

  1. Model selection

    1. Structured vs. unstructured
    2. Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
      1. B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
      2. Fastest to use is SkLearn (AutoML).
      3. PyTorch Lightning.
      4. FastAI
      5. XGBoost & LightGBM
    3. For measuring success - F-1 scores (which can be weighted).
  2. Data pipelines

    1. Luigi, Airflow, Ray, Snake(?), Spark.
    2. Globus, APIs, S3 buckets, HPC resources.
    3. Configuring and running Large ML training jobs on Delta.
    4. Normal: Pandas, Numpy
    5. Big:
      1. Spark (PySpark)
      2. Dask
      3. XArray
      4. Dask - distributed pandas and Numpy
      5. Rapids
        1. cuDF - cuda dataframes
        2. Dask cuDF - distributed data frame (can’t fit in one GPU’s memory).
      6. Rapids w/Dask (cudf) - distributed, on-GPU calculations. Blog, reading large CSVs.
    1. Key idea: make data as info-dense as possible.
    2. Limit correlation between input variables (Pearson or Chi-squared) — this is filter-based, you can also do permutation-based importance.
    3. Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.

      Data cleaning (and feature engineering ← this is jargon)

    4. Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)
  3. Easy parallelism in Python

    1. HPC: Parsl, funcX: Federated Function as a Service
    2. Commercial or Cloud: Ray.io
  4. Serving

    1. Gradio & HF Spaces & Streamlit & PyDoc
    2. Data and Learning Hub for Science (research soft.) Dan Katz.
    3. Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
  5. Distributed training

    1. XGBoost - Dask.

    2. LightGBM - Dask or Spark.

    3. Horovod.

    4. PyTorch DDP (PyTorch lightning) Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation

    5. General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory but require a fair bit of specific knowledge about the model in question.

    6. Flavors of Parallelism

      1. Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
      2. Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
      3. Hard: model parallelism. Must-read resource: Model Parallelism (Hugging Face)
      4. My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
      5. Glossary
        1. DDP — Distributed Data Parallel
        2. PP - Pipeline Parallel (DeepSpeed)
        3. TP - Tensor Parallel
        4. VP - Voting parallel (usually decision tree async updates, e.g., LightGBM)
        5. MP - Model Parallel (Model sharding and pinning layers to devices)
        6. FS-DDP - Fully Sharded Distributed Data Parallel
  6. Fine-tune on out-of-distribution examples?

    1. TBD: What's the best way to fine-tune? 
    2. TBD: How do you monitor if your model is experiencing domain shift while in production? WandB Alerts is my best idea.
    3. Use Fast.ai w/ your PT or TF model, I think.
    4. A motivating example: the Permafrost Discovery Gateway has a great classifier for satellite images from Alaska but needs to adjust it for Alaska. How can we best fine-tune our existing model to this slightly different domain? 
  7. MLOps

    1. WandB.ai — Highly recommended. First-class tool during model development & data pre-processing. 
    2. Spell
    3. ClearML
    4. MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
    5. The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
  8. HPC resources at UIUC

    1. NCSA Large: Delta (and Cerebras). External but friendly: XSEDE (Bridges2).
    2. NCSA Small: Nano, Kingfisher, HAL (ppcle).
    3. NCSA Modern: DGX, and Arm-based with two A100(40G) (via Hal-login3).
  9. Environments on HPC

    1. module load <TAB><TAB> — discover preinstalled environments
    2. Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
      1. Write DOCKERFILEs for HPC, syntax here.
    3. Globus file transfer — my favorite. Wonderfully robust, parallel, and has lots of logging.
    4. Towards the perfect command-line file transfer: Xargs | Rsync Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.

Cheapest GPU cloud compute

The benefit: sudo access to modern hardware and clean environments. That's perfect for when dependency hell makes you want to scream, especially when dealing with outdated HPC libraries.

  • Google Colab (free or paid)
  • Kaggle kernels (free)
  • LambdaLabs (my favorite for cheap GPU)
  • DataCrunch.io (my favorite for cheap GPU, especially top-of-the-line a100s 80G)
  • Grid.ai (From the creator of PyTorch Lightning)
  • PaperSpace Gradient
  • GCP and Azure — lots of free credits floating around.
    • Azure is one of the few that have 8x80GB A100 systems. For ~$38/hr. Still, sometimes you may need that.

New topics

Streaming data for ML inference

Domain drift, explainable ai, dataset versioning (need to motivate, include in hyper param search).

Explainability tools:

Using GPUs for Speeding up ML (Vismayak Mohanarajan)


Rapids - cuDF and cuML

Colab Page - https://colab.research.google.com/drive/1bzL-mhGNvh7PF_MzsSgMmw9TQjyP6DCe?usp=sharing

ML Pathways

<List of some popular ML learning pathways and a brief comment about each>

Conclusion

<A brief concluding section about the report any future ideas>



Conda Best Practices

When sharing the Conda environment: Consider whether you are sharing with others or using Conda in Docker.

Adding the --from-history the flag will install only the packages you manually installed while using conda. It will NOT include pip packages or anything else, like apt/yum/brew.

# 1. cross platform conda envs (my go-to)
conda env export --from-history > environment.yml   # main yaml
conda env export > pip_env.yml                      # just for pip
Then, manually copy ONLY the pip section from pip_env.yml into environment.yml.

conda env create -f environment.yml                # usage

# 2. for dockerfiles & exact replication on identical hardware
conda list --explicit > spec-file.txt

conda create --name myenv --file spec-file.txt    # usage

Install complex dependencies with Conda: specific versions of cuda, gcc and more!

Cuda Toolkit: Anaconda.org — check the “labels” tab for more versions! It works like Docker labels; you can pull whatever version you need.

Note: the same packages are distributed by multiple “channels” (the -c flag). It can be messy finding the right channel; do some googling to find compatibilities.

# Check Cuda Version
$ nvcc --version
$ cat /usr/local/cuda/version.txt  -- always check here

# Install Cuda
conda install -c nvidia/label/cuda-11.3.1 cuda-toolkit # "All" necessary cuda tools
conda install -c nvidia/label/cuda-11.3.1 cuda-nvcc    # "NVidia Cuda Compiler"

Conda vs. Mamba

Mamba is a faster drop-in replacement than Conda, with 100% identical syntax.

But Mamba is strictly worse than Conda at resolving dependencies. But at least it is conservative and will never-ever mess up your environment; it will just fail.

Therefore, I recommend running mamba install first, and if you get the error “cannot resolve dependencies,” then try conda install for more power at the cost of being slow. If you have to pick one, conda is strictly more capable.


Rsync Best Practices

Rsync syntax is modeled after scp. Here is my favorite usage.


# My go-to command:
rsync -azP source destination

# flags explained
# -a is like than scp's `-r` but it also preserves metadata and symblinks. 
# -z = compression (more CPU usage, less network traffic) 
# -P flag combines the flags --progress and --partial. It enables resuming. 

# to truly keep in sync, add delete option 
rsync -a --delete source destination

# create backups 
rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination

# Good flags 
--exclude=pattern_to_exclude
-n = dry run, don't actually do it. just print what WOULD have happened.

References



  • No labels