Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

ImageNet Distributed Mixed-precision Training Benchmark


Github repo for all source code and details: https://github.com/richardkxu/distributed-pytorch

Jupyter notebook tutorial for the key pointsSource code and tutorialhttps://github.com/richardkxu/distributed-pytorch/blob/master/ddp_apex_tutorial.ipynb

HAL paper: coming up soon! https://dl.acm.org/doi/10.1145/3311790.3396649

Benchmark Results

Training Time: Time to solution during training. The number of GPUs ranges from 2 GPUs to 64 GPUs. ImageNet training with ResNet-50 using 2 GPUs takes 20.00 hrs, 36.00 mins, 51.11 secs. With 64 GPUs across 16 compute nodes, we can train ResNet-50 in 1.00 hr, 7.00 mins, 51.31 secs, while maintaining the same level of top1 and top5 accuracy.

...

I/O Bandwidth: I/O Bandwidth (GB/s) and IOPS of our file system throughout our full system ImageNet training using 64 GPUs. Between 10th and 60th epoch, the average bandwidth is 3.30 GB/s and the average IOPS is 36.5K.

Image RemovedImage Added

Software Stack

  • IBM WMLCE 1.6.2
  • Python 3.7
  • PyTorch 1.2.0
  • NVIDIA Apex 0.1.0
  • CUDA 10.1

...

Each of our compute nodes has 4 NVIDIA Volta V100 GPUs. We scale ResNet-50 from 2 GPUs on the same computer node to 64 GPUs across 16 compute node, doubling the number of GPUs for each intermediate runs. We use a per-GPU batch size of 208 images, which is the largest batch size we can fit with distributed mixedprecision training. Therefore, our global batch size ranges from 416 images to 13312 images. We use Automatic Mixed Precision (Amp) and Distributed Data Parallel (DDP) from NVIDIA Apex for mixed-precision and distributed training. An optimization level of "O2" is used for mixed-precision training to benefit from FP16 training while keeping a few parameters to be FP32. We use NVIDIA Collective Communication Library (NCCL) 34 as our distributed backend.

...