Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Vocab count / Bag of Words (BOW) - no contextual info kept
    • Use count vectorizer from sklearn or TF-IDF (better)
    • Remove stop words
  • One-hot encoding
  • Frequency count - no contextual info kept
  • TF-IDF - no contextual info kept
  • Word Embeddings : preserve contextual information. Get the semantics of a word.
    • Resource on implementation of various embeddings
    • Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set.
    • Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size). 
    • GLoVe (pre-trained from Stanford) - based on global context
    • BERTembeddings
    • GPT3 embeddings 
    • Resource on implementation of various embeddings
  • Using word embeddings:
    • Learn it (not recommended)
    • Reuse it (check what dataset the embeddings has been trained on)
    • Reuse + fine-tune 

...

  1. Twitter sentiment analysis using Word2Vec and LSTM in Keras
  2.  


ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )

View with nicer formatting! https://kastanday.notion.site/Round-Table-ML-Ops-NCSA-27e4b6efc4cb410f8fa58ab2583340d9 

...

  1. Model selection

    1. Structured vs unstructured
    2. Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
      1. B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
      2. Fastest to use is SkLearn (AutoML).
      3. PyTorch Lightning.
      4. FastAI
      5. XGBoost & LightGBM
    3. For measuring success, I like F-1 scores (can be weighted).
  2. Data pipelines

    1. Luigi, Airflow, Ray, Snake(?), Spark.
    2. Globus, APIs, S3 buckets, HPC resources.
    3. Configuring and running Large ML training jobs, on Delta.
    4. Normal: Pandas, Numpy
    5. Big:
      1. Spark (PySpark)
      2. Dask
      3. XArray
      4. Dask - distributed pandas and Numpy
      5. Rapids
        1. cuDF - cuda dataframes
        2. Dask cuDF - distributed dataframe (can’t fit in one GPU’s memory).
      6. Rapids w/Dask (cudf) - distributed, on-gpu calculations. Blog, reading large CSVs.
  3. Data cleaning (and feature engineering ← this is jargon)

    1. Key idea: make data as info-dense as possible.
    2. Limit correlation between input variables (Pearson or Chi-squred) — this is filter-based, you can also do permutation-based importance.
    3. Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.

    Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)

  4. Easy parallelism in Python

    1. HPC: Parsl, funcX: Federated Function as a Service
    2. Commercial or Cloud: Ray.io
  5. Serving

    1. Gradio & HF Spaces & Streamlit & PyDoc
    2. Data and Learning Hub for Science (research soft.) Dan Katz.
    3. Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
  6. Distributed training

    1. XGBoost - Dask.

    2. LightGBM - Dask or Spark.

    3. Horovod.

    4. PyTorch DDP (PyTorch lightning)

      Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation

    5. General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory, but require a fair bit of specific knowledge about the model in question.

    6. Flavors of Parallelism

      1. Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
      2. Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
      3. Hard: model parallelism. Must-read resource: Model Parallelism (huggingface.co)
      4. My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
      5. Glossary
        1. DDP — Distributed Data Parallel
        2. PP - Pipeline Parallel (DeepSpeed)
        3. TP - Tensor Parallel
        4. VP - Voting parallel (usually decision tree async updates, e.g. LightGBM)
        5. MP - Model Parallel (Model sharding, and pinning layers to devices)
        6. FS-DDP - Fully Sharded Distributed Data Parallel
  7. fine tune on out of distribution examples?

    1. Best way to fine-tune? Use Fast.ai w/ your PT or TF model, I think.
    2. Example of PDG project going from Russia to Alaska and struggling to fine-tune.
  8. MLOps

    Untitled

    1. WandB.ai — First class tool during model development & data pre-processing.
    2. Spell
    3. https://github.com/allegroai/clearml
    4. MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
    5. The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
  9. HPC resources at UIUC

    1. NCSA Large: Delta (and Cerebras). External, but friendly: XSEDE (Bridges2).
    2. NCSA Small: Nano, Kingfisher, HAL (ppcle).
    3. NCSA Modern: DGX, and Arm-based with two A100(40G) (via Hal-login3).
  10. Environments on HPC

    1. module load <TAB><TAB> — discover preinstalled environments
    2. Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
      1. Write DOCKERFILEs for HPC, syntax here.
    3. Globus file transfer — my favorite. Wonderfully robust, parallel, lots of logging.
    4. Towards the perfect command-line file transfer: Xargs | Rsync Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.
    **Rsync** essential reference
    
    # ??My - go-to command. WorksSytax like scp.
    **rsync -azP source destination
    
    # flags explained** 
    # -a is like than scp's `-r` but it also preserves metadata and symblinks. 
    # -z = compression (more CPU usage, less network traffic) 
    # -P flag combines the flags --progress and --partial. It enables resuming. 
    
    # to truly keep in sync, add delete option 
    rsync -a --delete source destination
    
    # create backups 
    rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination
    
    # Good flags 
    --exclude=pattern_to_exclude
    -n = dry run, don't actually do it. just print what WOULD have happened.
    

Conda Best Practices

When sharing Conda envs: Consider, are you sharing with others or using Conda in Docker?

Adding the --from-history flag will install only the packages you manually installed while using conda. It will NOT include pip packages or anything else, like apt/yum/brew.

# ?? 1. cross platform conda envs (my go-to) ??
conda env export --from-history > environment.yml   # main yaml
conda env export > pip_env.yml                      # just for pip
Then, manually copy ONLY the pip section from pip_env.yml into environment.yml.

conda env create -f environment.yml                # usage

# 2. for dockerfiles & exact replication on identical hardware
conda list --explicit > spec-file.txt

conda create --name myenv --file spec-file.txt    # usage

...

The benefit: sudo access on modern hardware and clean environments, . That's perfect for when dependency-hell makes you want to scream, especially after using when dealing with outdated HPC libraries.

  • Google Colab (free or paid)
  • Kaggle kernels (free)
  • LambdaLabs (my favorite for cheap GPU)
  • DataCrunch.io (my favorite for cheap GPU, especially top-of-the-line a100s 80G)
  • Grid.ai (From the creator of PyTorch Lightning)
  • PaperSpace Gradient
  • GCP and Azure — lots of free credits floating around.
    • Azure is one of few that have 8x80GB A100 systems. For ~$38/hr. Still, sometimes you may need that.

...