MLPerf Issues

Nvidia's Implementation of the Benchmark :

Goal is to get the image_classification benchmark from MLCommons to run on hydro and get some timing results back from it.

There are 2 options to try and do this:

Use the image_classification benchmark from the official MLCommons training benchmark repository
Use Nvidia's implementation of the benchmark found here

MLCommon Implementation

MLCommon's given instructions for running each benchmark are as follows:

Setup docker & dependencies. There is a shared script (install_cuda_docker.sh) to do this.
Download the dataset using ./download_dataset.sh. This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
Optionally, run verify_dataset.sh to ensure the was successfully downloaded.
Build and run the docker image, the command to do this is included with each Benchmark.

Problems with these instructions : For this benchmark there is no install_cuda_docker.sh file provided or any indication of a dockerfile or image. The download_dataset.sh and verify_dataset.sh are both stub files with "TO DO" written in them and there isn't even a requirements.txt file for what python packages or versions are required. The README for the benchmark was updated recently to actually include some instructions but it is still very bare.

What has been attempted : The instructions identify that the dataset that is supposed to be used is the imagenet dataset. This is available on hydro at /sw/unsupported/mldata/ImageNet/ (1.2TB) I also downloaded a small-imagenet version (162 GB) from kaggle these both appear to work with imagenet_preprocessing.py I also worked on figuring out what python packages were required to run the resnet_ctl_imagenet_main.py via trial and error, however this is were I hit a roadblock. The tensorflow version used by this code is tensorflow 2.3.0, which requires a python version 3.8 or Less. So the 3.9 module on hydro will not work. This means that we would need a container to run this code likely.

Blocking Problem : Need a way to run python 3.8 or less. Which would probably mean we would have to make our own container from scratch and further debug what other python packages are needed...

Nvidia Implementation

What has been attempted : The Nvidia implementation comes with a Dockerfile, so I attempted to convert this to an apptainer def file to build with apptainer on hydro using spython. This does not appear to work due to the dockerfile's use of multiple

Blocking Problem : Nvidia's Dockerfile either is not a valid file or does not appear to convertible to an apptainer def file.

Nvidia Kaniko workaround

Jim suggested using kaniko to build the Nvidia dockerfile to an image with docker so we could just run the docker built image with apptainer.

What has been attempted :

Blocking Problem :

Page tree

MLPerf Issues

MLCommon Implementation

Nvidia Implementation

Nvidia Kaniko workaround