Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Use the image_classification benchmark from the official MLCommons training benchmark repository
  2. Use Nvidia's implementation of the benchmark found here

*Note that while there are prebuilt images for the inference benchmark, I have not found any prebuilt images for the training benchmarks which is what we are trying to use.

MLCommon Implementation

MLCommon's given instructions for running each benchmark are as follows:

...

What has been attempted : The Nvidia implementation comes with a Dockerfile, so I attempted to convert converted this to an apptainer def file to build with apptainer on hydro using spython. This does not appear to work due to the dockerfile's use of multiple  spython. Building the def file with apptainer to a sif image though encounters an issue, "ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'" which doesn't make sense to me as its in the directory and is copied over before the install command is run in the dockerfile and def. If somebody with some experience with docker/apptainer could look at this that would be useful.

Blocking Problem : The conversion of Nvidia's dockerfile fails when trying to build the converted apptatiner def file with apptainerBlocking Problem : Nvidia's Dockerfile either is not a valid file or does not appear to convertible to an apptainer def file.

Nvidia Kaniko workaround

Jim suggested using kaniko to build the Nvidia dockerfile to an image with docker so we could just run the docker built image with apptainerwhich is a tool to build container images from a Dockerfile, inside a container. This would get rid of any issues with the conversion between Dockerfile and apptainer def file.

What has been attempted : To test whether this could work I tried building the docker hello-world example in kaniko first. This is the latest command that I was trying with it :

apptainer run --fakeroot docker://gcr.io/kaniko-project/executor:latest --no-push --tar-path=mybuild.tar --dockerfile=Dockerfile.build 

This fails with:

FATAL:   failed to open /bin/sh for inspection: failed to open elf binary /bin/sh: open /bin/sh: no such file or directory

I did try using the --bind option to fix this but it appears there is a conflict with using the --fakeroot option and --bind at the same time. Please try to verify that I'm not just messing it up though. 

Blocking Problem : Can't seem to run the kaniko image with apptainer properly or I am doing something wrong (probably the latter).