...
Hosted by Kastan Day, learn more about me on KastanDay.com
Round
...
Table Objectives
Leaving this talk you should have two things:
...
- Plain language.
- Offer awareness, so you can know what to look for online.
Outline
High Round table high priority topics:
- [ ] Pre-trained model zoos
- [ ] Environments Dev environments on HPC (Docker/Singularity/Apptainer/Conda)
ML-Ops Outline of Big Ideas
Model selection
- Structured vs unstructured
- Pre-trained models (called Model Zoos, Model hub, or model garden): PyTorch, PyTorch Lightning, TF, TF garden, HuggingFace, PapersWCode, https://github.com/openvinotoolkit/open_model_zoo
- B-tier: https://github.com/collections/ai-model-zoos, CoreML (iOS), iOS, Android, web/JS, ONNX zoo, Largest quantity hit-or-miss quality
- Fastest to use is SkLearn (AutoML).
- PyTorch Lightning.
- FastAI
- XGBoost & LightGBM
- For measuring success, I like F-1 scores (can be weighted).
Data pipelines
- Luigi, Airflow, Ray, Snake(?), Spark.
- Globus, APIs, S3 buckets, HPC resources.
- Configuring and running Large ML training jobs, on Delta.
- Normal: Pandas, Numpy
- Big:
- Spark (PySpark)
- Dask
- XArray
- Dask - distributed pandas and Numpy
- Rapids
- cuDF - cuda dataframes
- Dask cuDF - distributed dataframe (can’t fit in one GPU’s memory).
- Rapids w/Dask (
cudf
) - distributed, on-gpu calculations. Blog, reading large CSVs.
- Key idea: make data as info-dense as possible.
- Limit correlation between input variables (Pearson or Chi-squred) — this is filter-based, you can also do permutation-based importance.
- Common workflow: Normalization → Pearson correlation → XGBoost feature importance → Kernel PCA dimensionality reduction.
Data cleaning (and feature engineering ← this is jargon)
- Always normalize both inputs and outputs. Original art by Kastan Day at KastanDay/wascally_wabbit (github.com)
Easy parallelism in Python
- HPC: Parsl, funcX: Federated Function as a Service
- Commercial or Cloud: Ray.io
Serving
- Gradio & HF Spaces & Streamlit & PyDoc
- Data and Learning Hub for Science (research soft.) Dan Katz.
- Triton, TensorRT and ONNX. NVIDIA Triton Inference Server
Distributed training
XGBoost - Dask.
LightGBM - Dask or Spark.
Horovod.
PyTorch DDP (PyTorch lightning) Speed Up Model Training — PyTorch Lightning 1.7.0dev documentation
General idea: Pin certain layers to certain devices. Simple cases aren’t too bad in theory, but require a fair bit of specific knowledge about the model in question.
Flavors of Parallelism
- Easy: XGBoost or LightGBM. Python code: Dask, Ray, Parsl.
- Medium: Data parallelism in Horovod, DeepSpeed, PyTorch DDP. GPU operations with Nvidia RAPIDS.
- Hard: model parallelism. Must-read resource: Model Parallelism (huggingface.co)
- My research: FS-DDP, DeepSpeed, pipeline parallelism, tensor parallelism, distributed-all-reduce, etc.
- Glossary
- DDP — Distributed Data Parallel
- PP - Pipeline Parallel (DeepSpeed)
- TP - Tensor Parallel
- VP - Voting parallel (usually decision tree async updates, e.g. LightGBM)
- MP - Model Parallel (Model sharding, and pinning layers to devices)
- FS-DDP - Fully Sharded Distributed Data Parallel
Fine-tune on out-of-distribution examples?
- TBD: What's the best way to fine-tune?
- TBD: How do you monitor if your model is experiencing domain shift while in production? WandB Alerts is my best idea.
- Use Fast.ai w/ your PT or TF model, I think.
- A motivating example: the Permafrost Discovery Gateway has a great classifier for satellite images from Alaska, but need to adjust it for Alaska. How can we best fine-tune our existing model to this slightly different domain?
MLOps
- WandB.ai — Highly recommended. First class tool during model development & data pre-processing.
- Spell
- https://github.com/allegroai/clearml
- MLOps: What It Is, Why It Matters, and How to Implement It - neptune.ai
- The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them | ZenML Blog
HPC resources at UIUC
- NCSA Large: Delta (and Cerebras). External, but friendly: XSEDE (Bridges2).
- NCSA Small: Nano, Kingfisher, HAL (
ppcle
). - NCSA Modern: DGX, and Arm-based with two A100(40G) (via
Hal-login3
).
Environments on HPC
module load <TAB><TAB>
— discover preinstalled environments- Apptainer (previously called Singularity): Docker for HPC, requires few permissions.
- Write DOCKERFILEs for HPC, syntax here.
- Globus file transfer — my favorite. Wonderfully robust, parallel, lots of logging.
- Towards the perfect command-line file transfer:
Xargs | Rsync
Xargs to parallelize Rsync for file transfer and sync (NCSA wiki resource) and another 3rd party blog.
Rsync essential reference # My go-to command. Sytax like scp. rsync -azP source destination # flags explained # -a is like than scp's `-r` but it also preserves metadata and symblinks. # -z = compression (more CPU usage, less network traffic) # -P flag combines the flags --progress and --partial. It enables resuming. # to truly keep in sync, add delete option rsync -a --delete source destination # create backups rsync -a --delete --backup --backup-dir=/path/to/backups /path/to/source destination # Good flags --exclude=pattern_to_exclude -n = dry run, don't actually do it. just print what WOULD have happened.
...
<List of some popular ML learning pathways and a brief comment about each>
Conclusion
<A brief concluding section about the report any future ideas>