View Source

Objective: Enable system-wide file I/O and memory usage profiling for deep learning frameworks.

Approach: Write a script that collects all relevant data from /proc. Users can call it when running their deep learning applications.

How to get started: Here are two scripts (memprof2.sh, plotone2.sh) that were developed ~3 years ago (or more) to collect and visualize the relevant data. They possibly need to be updated in order to work with the latest version of /proc.

To Do:

Request accounts on Nano cluster.
Copy scripts to nano and learn how to use them on simple batch-scheduled applications
(possibly update scripts, if needed)
Go over TensorFlow getting started tutorial (https://www.tensorflow.org/tutorials/).
Figure out how to use the above scripts with TensorFlow tutorial examples, collect and visualize data.
Collect data from TensorFlow benchmarks (https://www.tensorflow.org/guide/performance/benchmarks)
Look into integrating the scripts with nano's system monitor (https://nano.ncsa.illinois.edu:3000/d/3QVrDIFmz/nano-status?refresh=1m&orgId=1).
- Yan: InfluxDB takes HTTP requests with JSON-formatted data. Also check relevant Telegraf plugins (cpu, mem, disk, diskio, zfs, etc.) as these are currently being used in admin (non-public) dashboards and may provide some of the desired metrics.