ImageNet Distributed Mixed-precision Training Benchmark
Github repo for all source code and details: https://github.com/richardkxu/distributed-pytorch
Jupyter notebook tutorial for the key pointsSource code and tutorial: https://github.com/richardkxu/distributed-pytorch/blob/master/ddp_apex_tutorial.ipynb
HAL paper: coming up soon! https://dl.acm.org/doi/10.1145/3311790.3396649
Benchmark Results
Training Time: Time to solution during training. The number of GPUs ranges from 2 GPUs to 64 GPUs. ImageNet training with ResNet-50 using 2 GPUs takes 20.00 hrs, 36.00 mins, 51.11 secs. With 64 GPUs across 16 compute nodes, we can train ResNet-50 in 1.00 hr, 7.00 mins, 51.31 secs, while maintaining the same level of top1 and top5 accuracy.
...
I/O Bandwidth: I/O Bandwidth (GB/s) and IOPS of our file system throughout our full system ImageNet training using 64 GPUs. Between 10th and 60th epoch, the average bandwidth is 3.30 GB/s and the average IOPS is 36.5K.
Software Stack
- IBM WMLCE 1.6.2
- Python 3.7
- PyTorch 1.2.0
- NVIDIA Apex 0.1.0
- CUDA 10.1
...