...
- Vocab count / Bag of Words (BOW) - no contextual info kept
- Use count vectorizer from sklearn or TF-IDF (better)
- Remove stop words
- One-hot encoding
- Frequency count - no contextual info kept
- TF-IDF - no contextual info kept
- Word Embeddings : preserve contextual information. Get the semantics of a word.
- Resource on implementation of various embeddings
- Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set.
- Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size).
- GLoVe (pre-trained from Stanford) - based on global context
- BERTembeddings
- GPT3 embeddings Resource on implementation of various embeddings
- Using word embeddings:
- Learn it (not recommended)
- Reuse it (check what dataset the embeddings has been trained on)
- Reuse + fine-tune
...
ML Ops (Kastan Day, Todd Nicholson, Benjamin Galewsky )
View with nicer formatting! https://kastanday.notion.site/Round-Table-ML-Ops-NCSA-27e4b6efc4cb410f8fa58ab2583340d9
...
Model selection
- Structured vs unstructured
- Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
- B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
- Fastest to use is SkLearn (AutoML).
- PyTorch Lightning.
- FastAI
- XGBoost & LightGBM
- For measuring success, I like F-1 scores (can be weighted).
Data pipelines
- Luigi, Airflow, Ray, Snake(?), Spark.
- Globus, APIs, S3 buckets, HPC resources.
- Configuring and running Large ML training jobs, on Delta.
- Normal: Pandas, Numpy
- Big:
- Spark (PySpark)
- Dask
- XArray
- Dask - distributed pandas and Numpy
- Rapids
- cuDF - cuda dataframes
- Dask cuDF - distributed dataframe (can’t fit in one GPU’s memory).
- Rapids w/Dask (
cudf
) - distributed, on-gpu calculations. Blog, reading large CSVs.
Data cleaning (and feature engineering ← this is jargon)
- Key idea: make data as info-dense as possible.
- Limit correlation between input variables (Pearson or Chi-squred) — this is filter-based, you can also do permutation-based importance.
- Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.
Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)
Easy parallelism in Python
- HPC: Parsl, funcX: Federated Function as a Service
- Commercial or Cloud: Ray.io
Serving
- Gradio & HF Spaces & Streamlit & PyDoc
- Data and Learning Hub for Science (research soft.) Dan Katz.
- Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
Distributed training
XGBoost - Dask.
LightGBM - Dask or Spark.
Horovod.
PyTorch DDP (PyTorch lightning)
Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation
General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory, but require a fair bit of specific knowledge about the model in question.
Flavors of Parallelism
- Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
- Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
- Hard: model parallelism. Must-read resource: Model Parallelism (huggingface.co)
- My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
- Glossary
- DDP — Distributed Data Parallel
- PP - Pipeline Parallel (DeepSpeed)
- TP - Tensor Parallel
- VP - Voting parallel (usually decision tree async updates, e.g. LightGBM)
- MP - Model Parallel (Model sharding, and pinning layers to devices)
- FS-DDP - Fully Sharded Distributed Data Parallel
fine tune on out of distribution examples?
- Best way to fine-tune? Use Fast.ai w/ your PT or TF model, I think.
- Example of PDG project going from Russia to Alaska and struggling to fine-tune.
MLOps
- WandB.ai — First class tool during model development & data pre-processing.
- Spell
- https://github.com/allegroai/clearml
- MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
- The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
HPC resources at UIUC
- NCSA Large: Delta (and Cerebras). External, but friendly: XSEDE (Bridges2).
- NCSA Small: Nano, Kingfisher, HAL (
ppcle
). - NCSA Modern: DGX, and Arm-based with two A100(40G) (via
Hal-login3
).
Environments on HPC
module load <TAB><TAB>
— discover preinstalled environments- Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
- Write DOCKERFILEs for HPC, syntax here.
- Globus file transfer — my favorite. Wonderfully robust, parallel, lots of logging.
- Towards the perfect command-line file transfer:
Xargs | Rsync
Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.
**Rsync** essential reference # ??My - go-to command. WorksSytax like scp. **rsync -azP source destination # flags explained** # -a is like than scp's `-r` but it also preserves metadata and symblinks. # -z = compression (more CPU usage, less network traffic) # -P flag combines the flags --progress and --partial. It enables resuming. # to truly keep in sync, add delete option rsync -a --delete source destination # create backups rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination # Good flags --exclude=pattern_to_exclude -n = dry run, don't actually do it. just print what WOULD have happened.
Conda Best Practices
When sharing Conda envs: Consider, are you sharing with others or using Conda in Docker?
Adding the --from-history
flag will install only the packages you manually installed while using conda. It will NOT include pip packages or anything else, like apt/yum/brew.
# ?? 1. cross platform conda envs (my go-to) ??
conda env export --from-history > environment.yml # main yaml
conda env export > pip_env.yml # just for pip
Then, manually copy ONLY the pip section from pip_env.yml into environment.yml.
conda env create -f environment.yml # usage
# 2. for dockerfiles & exact replication on identical hardware
conda list --explicit > spec-file.txt
conda create --name myenv --file spec-file.txt # usage
...
The benefit: sudo access on modern hardware and clean environments, . That's perfect for when dependency-hell makes you want to scream, especially after using when dealing with outdated HPC libraries.
- Google Colab (free or paid)
- Kaggle kernels (free)
- LambdaLabs (my favorite for cheap GPU)
- DataCrunch.io (my favorite for cheap GPU, especially top-of-the-line a100s 80G)
- Grid.ai (From the creator of PyTorch Lightning)
- PaperSpace Gradient
- GCP and Azure — lots of free credits floating around.
- Azure is one of few that have 8x80GB A100 systems. For ~$38/hr. Still, sometimes you may need that.
...