MLPerf Issues

Nvidia's Implementation of the Benchmark :

Goal is to get the image_classification benchmark from MLCommons to run on hydro and get some timing results back from it.

There are 2 options to try and do this:

Use the image_classification benchmark from the official MLCommons training benchmark repository
Use Nvidia's implementation of the benchmark found here

*Note that while there are prebuilt images for the inference benchmark, I have not found any prebuilt images for the training benchmarks which is what we are trying to use.

MLCommon Implementation

MLCommon's given instructions for running each benchmark are as follows:

Setup docker & dependencies. There is a shared script (install_cuda_docker.sh) to do this.
Download the dataset using ./download_dataset.sh. This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
Optionally, run verify_dataset.sh to ensure the was successfully downloaded.
Build and run the docker image, the command to do this is included with each Benchmark.

Problems with these instructions : For this benchmark there is no install_cuda_docker.sh file provided or any indication of a dockerfile or image. The download_dataset.sh and verify_dataset.sh are both stub files with "TO DO" written in them and there isn't even a requirements.txt file for what python packages or versions are required. The README for the benchmark was updated recently to actually include some instructions but it is still very bare.

What has been attempted : The instructions identify that the dataset that is supposed to be used is the imagenet dataset. This is available on hydro at /sw/unsupported/mldata/ImageNet/ (1.2TB) I also downloaded a small-imagenet version (162 GB) from kaggle these both appear to work with imagenet_preprocessing.py I also worked on figuring out what python packages were required to run the resnet_ctl_imagenet_main.py via trial and error, however this is were I hit a roadblock. The tensorflow version used by this code is tensorflow 2.3.0, which requires a python version 3.8 or Less. So the 3.9 module on hydro will not work. This means that we would need a container to run this code likely.

Blocking Problem : Need a way to run python 3.8 or less. Which would probably mean we would have to make our own container from scratch and further debug what other python packages are needed...

Nvidia Implementation

What has been attempted : The Nvidia implementation comes with a Dockerfile, so I converted this to an apptainer def file with spython. Building the def file with apptainer to a sif image though encounters an issue, "ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'" which doesn't make sense to me as its in the directory and is copied over before the install command is run in the dockerfile and def. If somebody with some experience with docker/apptainer could look at this that would be useful.

Blocking Problem : The conversion of Nvidia's dockerfile fails when trying to build the converted apptatiner def file with apptainer.

Nvidia Kaniko workaround

Jim suggested using kaniko which is a tool to build container images from a Dockerfile, inside a container. This would get rid of any issues with the conversion between Dockerfile and apptainer def file.

What has been attempted : To test whether this could work I tried building the docker hello-world example in kaniko first. This is the latest command that I was trying with it :

apptainer run --fakeroot docker://gcr.io/kaniko-project/executor:latest --no-push --tar-path=mybuild.tar --dockerfile=Dockerfile.build

This fails with:

FATAL:   failed to open /bin/sh for inspection: failed to open elf binary /bin/sh: open /bin/sh: no such file or directory

I did try using the --bind option to fix this but it appears there is a conflict with using the --fakeroot option and --bind at the same time. Please try to verify that I'm not just messing it up though.

Blocking Problem : Can't seem to run the kaniko image with apptainer properly or I am doing something wrong (probably the latter).

Page tree

MLCommon Implementation

Nvidia Implementation

Nvidia Kaniko workaround

1 Comment

Ayush Vikram