Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

ImageNet Distributed Mixed-precision Training

...

Benchmark

References

Source code and tutorialhttps://github.com/richardkxu/distributed-pytorch

Overview

We will cover the following training methods for PyTorch:

  • regular, single node, single GPU training
  • torch.nn.DataParallel
  • torch.nn.DistributedDataParallel
  • mixed precision training with NVIDIA Apex
  • TensorBoard logging under distributed training context

We will cover the following use cases:


Software Prerequisites

  • PyTorch 

Benchmark Details

To demonstrate the performance and scalability of our system, we conduct ImageNet32 training by scaling ResNet-50 [4] across multiple GPUs and multiple computer nodes. We use the official implementation of ResNet-50 by PyTorch [8]. We use a standard momentum with ?? of 0.9 and a weight decay ?? of 0.0001. All models are trained for 90 epochs regardless of batch sizes. We perform the learning rate scaling and gradual warmup mentioned in [2] to tackle training instability at early stages for large batch size. Most of our training setup is consistent with [2] and [11].


Each of our compute nodes has 4 NVIDIA Volta V100 GPUs. We scale ResNet-50 from 2 GPUs on the same computer node to 64 GPUs across 16 compute node, doubling the number of GPUs for each intermediate runs. We use a per-GPU batch size of 208 images, which is the largest batch size we can fit with distributed mixedprecision training. Therefore, our global batch size ranges from 416 images to 13312 images. We use Automatic Mixed Precision (Amp) and Distributed Data Parallel (DDP) from NVIDIA Apex 33 for mixed-precision and distributed training. An optimization level of "O2" is used for mixed-precision training to benefit from FP16 training while keeping a few parameters to be FP32. We use NVIDIA Collective Communication Library (NCCL)34 as our distributed backend.

...