You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 36 Next »

[This document is under construction]

Document History

VersionContributorsMajor ChangesDate Updated








Contributors

Provide a list of contributors who have contributed to this document either by writing sections or by sharing ideas and participating in discussions.

List of contributors in the alphabetical order of their first names:

  • Benjamin Galewski
  • Kastan Day
  • Minu Mathew
  • Sandeep Puthanveetil Satheesan
  • Todd Nicholson
  • Vismayak Mohanarajan
  • Volodymyr Kindratenko

Introduction

Provide a brief introduction to this document, the goals, what this isn't, and the process used by the focus group to develop this document.

The Hands-on Machine Learning Study Materials for Research Software Engineers Focus Group was formed to share study materials and other related resources that will be useful for interested Research Software Engineers at different skill levels in Machine Learning (ML) expertise to learn ML skills. These are some of the goals of this focus group:

  1. Come up with a set of good hands-on study materials that Research Software Engineers can use to develop and/or improve Machine Learning (ML) skills
  2. These materials should include ones that are useful for beginners and people with intermediate skills in ML
  3. Gather documentation on ML models that generally work for different problem areas or are based on some parameters (e.g., amount of training data for supervised learning)
  4. Collate and adapt the collected materials if possible

  5. Document the collected materials / URLs and categorize them (based on the focus groups' criteria)
  6. Choose different areas within ML to focus on:
    1. Traditional Machine Learning
    2. Deep Learning - Text Analysis
    3. ML Operations and relevant services
  7. Write some code examples that can be shared (e.g. Jupyter Notebooks)
  8. Collect documentation on existing NCSA hardware for ML (HAL, Delta)

This working document is the Focus Group's report and this document contains the study materials and other learning resources collected by the Focus Group members organized into different sessions and subsections. This document is not an extensive survey of the available study materials and we do not claim that this document list all the available study materials or resources. The Focus Group met every two weeks and discussed the materials collected till then and started documenting these here. 

Traditional Machine Learning (Minu Mathew,  Sandeep Puthanveetil Satheesan)

Provide a brief introduction to machine learning and list major areas within machine learning with short descriptions.

Machine Learning (ML) is a branch of Artificial Intelligence (AI) that uses data to extract information from data without using a set of instructions on how to process the data. Instead, ML algorithms use mathematical models to represent the structure of the data, which are then used to provide predictions on future data.

ML algorithms can be broadly classified into Traditional Machine Learning and Deep Learning (DL). Each of these is in turn classified into Supervised or Semi-Supervised Learning, Unsupervised Learning, and Reinforcement Learning. DL uses Artificial Neural Networks (ANN) for learning the representation of the data while traditional ML techniques use non-ANN-based frameworks. Supervised or Semi-Supervised Learning uses pre-labeled data that are used to provide "examples" for the ML algorithm to learn from. In Unsupervised Learning, the ML algorithms are provided with unlabelled data to work with. In Reinforcement Learning, a feedback loop is used to provide inputs to ML algorithms about how well the algorithm is performing on any given data item. This feedback loop constantly improves the system as more data is available.

Introductory Courses/Blogs

Deep Learning - Text Analysis(Minu Mathew)

Common resources :

1. Approaching (Almost) Any Machine Learning Problem  

  • More code with a bit of theory. 
  • To the point. (Not elaborate)
  • Details the code used for most ML tasks

2. EugeneYan AppliedML

  • A good set of relevant and recent papers on various ML topics.

3. HuggingFace

  • A good resource for anything NLP
  • Get pre-trained models, source-code for most well-known problems.

4.  Stanford Deep Learning course

  • Theoretical and math heavy. Dwells into the loss functions, activation functions, representations and word embeddings.

5. Article covering RNNs, CNNs and attention mechanism.

  •  Theoretical. A good read to understand concepts.


Natural language - no structure. Computers like some structure. So try to introduce some structure.

Regular Expressions :

Good for quick string comparisons, transformations.

Tokenization, Normalization and stemming - methods to add some structure

NLTK (python package) can be utilized for these methods.

Dimensionality reduction

Capture the most important structure. 

convert high dimensional space to a low dimensional space by preserving only important vectors (Eigen vectors) - get rid of highly correlated dimensions and reduce to single dimension. (Check out SVD / Singular Value decomposition) for the math behind it.

Method to transform text to numeric :

  • Vocab count / Bag of Words (BOW) - no contextual info kept
    • Use count vectorizer from sklearn or TF-IDF (better)
    • Remove stop words
  • One-hot encoding
  • Frequency count - no contextual info kept
  • TF-IDF - no contextual info kept
  • Word Embeddings : preserve contextual information. Get the semantics of a word.
    • Resource on implementation of various embeddings
    • Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set.
    • Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size). 
    • GLoVe (pre-trained from Stanford) - based on global context
    • BERT embeddings
    • GPT3 embeddings 
  • Using word embeddings:
    • Learn it (not recommended)
    • Reuse it (check what dataset the embeddings has been trained on)
    • Reuse + fine-tune 



Models :

  1. Recurrent Neural Network(RNN)

Most logical to be used in text. The appearance of one word depends on the previous word in the text, and hence whatever the current word implies (or has information about) is dependent on the previous word. This is the principle of RNNs, the current node weight depends on the current word and the previous weights. Problem with this is the weights decay for longer sentences. 

    1. LSTM (Long short-term memory) : An RNN but with direct links from the current node to another node in the forward path
    2. Bi-LSTM : Bi-directional LSTM. Weights of nodes are propagated both ways.
    3. GRU (Gated Recurrent Unit) : RNNs but with gates which connects / disconnects the nodes and controls information flow.
    4. Resources :
      1. Blog post on RNNs by Andrej Karpathy
      2. Resource on all 3 RNNs
      3. Stanford Lecture video on RNNs - heavy on math, but great to have the fundamentals + models right.
      4. Paper on RNNs - a very detailed (and lengthy) paper on RNNs, its foundations, methods, architecture, why it works and in which scenarios it doesn't..

2. CNN : Convolutional neural networks can also be used for NLP / sentence analysis. These are mostly used for classification tasks (rather than sentence generation or other complex language tasks). 

    1. Simple feed forward NN for classification, sentiment analysis
    2. Paper on using CNN for spam detection
    3. Paper and code on CNN for sentiment classification

3. Attention mechanism : 

    1. Paper
    2. A good blog post explaining the concept
    3. Blog post with example code

4. Transformer architecture :  Model architecture with an encoder-decoder structure. Very different from the sequence-to-sequence models. 

    1. https://www.kaggle.com/code/dschettler8845/transformers-course-chapter-1-tf-torch/notebook
    2. hugging face course
    3. Good blog post on transformer architecture

5. BERT models

6. XL-Net (by microsoft) - BERT and GPT-3 works better in general

7. GPT-3 model:


Methods / Models for common use cases at NCSA :


Small project examples :

  1. Twitter sentiment analysis using Word2Vec and LSTM in Keras
  2.  


ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )

View with nicer formatting! https://kastanday.notion.site/Round-Table-ML-Ops-NCSA-27e4b6efc4cb410f8fa58ab2583340d9 

Round Table Discussion

View here: Round Table Discussion May 31, 2022 - NCSA Software Wiki

Hosted by Kastan Day, learn more about me on KastanDay.com

Opening questions

How many people have worked with AI? Raise your virtual hands. go around the room and describe it a little.

Data engineer vs data science role ← are they on your team?

Who uses python? Conda?

Ai on current projects, open questions?

Objectives

Leaving this talk you should have two things:

  • Context: the parts of ML project / lifecycle.
  • Tools: A list of the best tools for the job.

My goal:

  • Plain language.
  • Offer awareness, so you can know what to look for online.

Outline

High priority topics:

  • [ ] Pre-trained model zoos
  • [ ] Environments on HPC (Docker/Singularity/Apptainer/Conda)
  1. Model selection

    1. Structured vs unstructured
    2. Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
      1. B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
      2. Fastest to use is SkLearn (AutoML).
      3. PyTorch Lightning.
      4. FastAI
      5. XGBoost & LightGBM
    3. For measuring success, I like F-1 scores (can be weighted).
  2. Data pipelines

    1. Luigi, Airflow, Ray, Snake(?), Spark.
    2. Globus, APIs, S3 buckets, HPC resources.
    3. Configuring and running Large ML training jobs, on Delta.
    4. Normal: Pandas, Numpy
    5. Big:
      1. Spark (PySpark)
      2. Dask
      3. XArray
      4. Dask - distributed pandas and Numpy
      5. Rapids
        1. cuDF - cuda dataframes
        2. Dask cuDF - distributed dataframe (can’t fit in one GPU’s memory).
      6. Rapids w/Dask (cudf) - distributed, on-gpu calculations. Blog, reading large CSVs.
    1. Key idea: make data as info-dense as possible.
    2. Limit correlation between input variables (Pearson or Chi-squred) — this is filter-based, you can also do permutation-based importance.
    3. Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.

      Data cleaning (and feature engineering ← this is jargon)

    4. Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)
  3. Easy parallelism in Python

    1. HPC: Parsl, funcX: Federated Function as a Service
    2. Commercial or Cloud: Ray.io
  4. Serving

    1. Gradio & HF Spaces & Streamlit & PyDoc
    2. Data and Learning Hub for Science (research soft.) Dan Katz.
    3. Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
  5. Distributed training

    1. XGBoost - Dask.

    2. LightGBM - Dask or Spark.

    3. Horovod.

    4. PyTorch DDP (PyTorch lightning) Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation

    5. General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory, but require a fair bit of specific knowledge about the model in question.

    6. Flavors of Parallelism

      1. Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
      2. Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
      3. Hard: model parallelism. Must-read resource: Model Parallelism (huggingface.co)
      4. My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
      5. Glossary
        1. DDP — Distributed Data Parallel
        2. PP - Pipeline Parallel (DeepSpeed)
        3. TP - Tensor Parallel
        4. VP - Voting parallel (usually decision tree async updates, e.g. LightGBM)
        5. MP - Model Parallel (Model sharding, and pinning layers to devices)
        6. FS-DDP - Fully Sharded Distributed Data Parallel
  6. Fine-tune on out-of-distribution examples?

    1. TBD: What's the best way to fine-tune? 
    2. TBD: How do you monitor if your model is experiencing domain shift while in production? WandB Alerts is my best idea.
    3. Use Fast.ai w/ your PT or TF model, I think.
    4. A motivating example: the Permafrost Discovery Gateway has a great classifier for satellite images from Alaska, but need to adjust it for Alaska. How can we best fine-tune our existing model to this slightly different domain? 
  7. MLOps

    1. WandB.ai — First class tool during model development & data pre-processing.
    2. Spell
    3. https://github.com/allegroai/clearml
    4. MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
    5. The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
  8. HPC resources at UIUC

    1. NCSA Large: Delta (and Cerebras). External, but friendly: XSEDE (Bridges2).
    2. NCSA Small: Nano, Kingfisher, HAL (ppcle).
    3. NCSA Modern: DGX, and Arm-based with two A100(40G) (via Hal-login3).
  9. Environments on HPC

    1. module load <TAB><TAB> — discover preinstalled environments
    2. Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
      1. Write DOCKERFILEs for HPC, syntax here.
    3. Globus file transfer — my favorite. Wonderfully robust, parallel, lots of logging.
    4. Towards the perfect command-line file transfer: Xargs | Rsync Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.
    Rsync essential reference
    
    # My go-to command. Sytax like scp.
    rsync -azP source destination
    
    # flags explained
    # -a is like than scp's `-r` but it also preserves metadata and symblinks. 
    # -z = compression (more CPU usage, less network traffic) 
    # -P flag combines the flags --progress and --partial. It enables resuming. 
    
    # to truly keep in sync, add delete option 
    rsync -a --delete source destination
    
    # create backups 
    rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination
    
    # Good flags 
    --exclude=pattern_to_exclude
    -n = dry run, don't actually do it. just print what WOULD have happened.
    

Conda Best Practices

When sharing Conda envs: Consider, are you sharing with others or using Conda in Docker?

Adding the --from-history flag will install only the packages you manually installed while using conda. It will NOT include pip packages or anything else, like apt/yum/brew.

# 1. cross platform conda envs (my go-to)
conda env export --from-history > environment.yml   # main yaml
conda env export > pip_env.yml                      # just for pip
Then, manually copy ONLY the pip section from pip_env.yml into environment.yml.

conda env create -f environment.yml                # usage

# 2. for dockerfiles & exact replication on identical hardware
conda list --explicit > spec-file.txt

conda create --name myenv --file spec-file.txt    # usage

Install complex dependencies with Conda: specific versions of cuda, gcc and more!

Cuda Toolkit :: Anaconda.org — check the “labels” tab for more versions! It works like Docker labels; you can pull whatever version you need.

Note: the same packages are distributed by multiple “channels” (the -c flag). It can be messy finding the right channel, definitely do some googling to find compatibilities.

# Check Cuda Version
$ nvcc --version
$ cat /usr/local/cuda/version.txt  -- always check here

# Install Cuda
conda install -c nvidia/label/cuda-11.3.1 cuda-toolkit # "All" necessary cuda tools
conda install -c nvidia/label/cuda-11.3.1 cuda-nvcc    # "NVidia Cuda Compiler"

Conda vs Mamba

Mamba is a faster drop-in replacement to Conda — it has 100% identical syntax.

But, Mamba is strictly worse than Conda at resolving dependencies. But at least it is conservative and will never-ever mess up your environment; it will just fail.

Therefore, I recommend running mamba install first and if you get error “cannot resolve dependencies,” then try conda install for more power, at the cost of being slow. If you have to pick one, conda is strictly more capable.

Cheap compute

The benefit: sudo access on modern hardware and clean environments. That's perfect for when dependency-hell makes you want to scream, especially when dealing with outdated HPC libraries.

  • Google Colab (free or paid)
  • Kaggle kernels (free)
  • LambdaLabs (my favorite for cheap GPU)
  • DataCrunch.io (my favorite for cheap GPU, especially top-of-the-line a100s 80G)
  • Grid.ai (From the creator of PyTorch Lightning)
  • PaperSpace Gradient
  • GCP and Azure — lots of free credits floating around.
    • Azure is one of few that have 8x80GB A100 systems. For ~$38/hr. Still, sometimes you may need that.

Best AI Courses

New topics

Streaming data for ML inference

domain drift, explainable ai, dataset versioning (need to motivate, include in hyperparam search).

Explainability tools:

Using GPUs for Speeding up ML (Vismayak Mohanarajan)


Rapids - cuDF and cuML

Colab Page - https://colab.research.google.com/drive/1bzL-mhGNvh7PF_MzsSgMmw9TQjyP6DCe?usp=sharing

ML Pathways

<List of some popular ML learning pathways and a brief comment about each>

References



  • No labels