Effective Distributed Training

In this topic guide, we focus on effective techniques for distributed training. Before reading this guide, we recommend that users first read the Distributed Training how-to guide, which describes how to configure distributed training, and the Optimizing Distributed Training topic guide, which describes the optimizations available in Determined for distributed training.

In this topic guide, we will cover:

  • How distributed training in Determined works.

  • Reducing computation and communication overheads.

  • How to train effectively with large batch sizes.

  • Model characteristics that affect the performance of distributed training.

  • Debugging performance bottlenecks in distributed training.

How Distributed Training in Determined Works

Distributed training in Determined utilizes data-parallelism. Data-parallelism for deep-learning consists of a set of workers, where each worker is assigned to a unique compute accelerator such as a GPU or a TPU. Each worker maintains a copy of the model parameters (weights that are being trained), which is synchronized across all the workers at the start of training.

After initialization is completed, distributed training in Determined follows a loop where:

  1. Every worker performs a forward and backward pass on a unique mini-batch of data.

  2. As the result of the backward pass, every worker generates a set of updates to the model parameters based on the data it processed.

  3. The workers communicate their updates to each other, so that all the workers see all the updates made during that batch.

  4. Every worker averages the updates by the number of workers.

  5. Every worker applies the updates to its copy of the model parameters, resulting in all the workers having identical solution states.

  6. Return to step 1.

Reducing Computation and Communication Overheads

Of the steps involved in the distributed training loop in Determined, which are described above, step 1 and step 2 introduce the majority of the computational overhead. To reduce computational overhead, it’s recommended that users maximize the utilization of their GPU. This is typically done by using the largest batch size that fits into memory. When performing distributed training, to reduce the computational overhead it’s recommended to set the global_batch_size to the largest batch size that fits into a single GPU * number of slots. This is commonly referred to as weak scaling.

Step 3 of the distributed training loop in Determined introduces the majority of the communication overhead. Because deep learning models typically perform dense updates, where every model parameter is updated for every training sample, batch size does not affect how long it takes workers to communicate updates. However, increasing global_batch_size does reduce the required number of passes through the training loop, thus reducing the total communication overhead.

Determined optimizes the communication in step 3 by using an efficient form of ring all-reduce, which minimizes the amount of communication necessary for all the workers to communicate their updates. Determined also reduces the communication overhead by overlapping computation (step 1 & step 2) and communication (step 3) by communicating updates for deeper layers concurrently with computing updates for the shallower layers. The Optimizing Distributed Training topic guide covers additional optimizations available in Determined for reducing the communication overhead.

How to Train Effectively with Large Batch Sizes

To improve the performance of distributed training, we recommend using the largest possible global_batch_size, setting it to be largest batch size that fits into a single GPU * number of slots. However, training with a large global_batch_size can have adverse effects on the convergence (accuracy) of the model. At Determined AI we have found several effective techniques for training with large batch sizes:

  • Starting with the original learning rate used for a single GPU and gradually increasing it to number of slots * original learning rate throughout the first several epochs. For more details, see Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

  • Using custom optimizers designed for large batch training, such as RAdam, LARS, or LAMB. We have found RAdam especially effective.

These techniques often require hyperparameter modifications. To automate this process, we encourage users to utilize the Hyperparameter Tuning capabilities in Determined.

Model Characteristics that Affect Performance of Distributed Training

Deep learning models typically perform dense updates, where every model parameter is updated for every training sample. Because of this, the amount of communication per mini-batch (step 3 in the distributed training loop) is dependent on the number of model parameters. Models that have fewer parameters such as ResNet-50 (~30 million parameters) train more efficiently in distributed settings than models with more parameters such as VGG-16 (~136 million parameters). If planning to utilize distributed training, we encourage users to be mindful of their model size when designing models.

Debugging Performance Bottlenecks

When scaling up distributed training, it’s fairly common to see non-linear speed up when scaling from one machine to two machines as intra-machine communication (e.g., NVLink) is often significantly faster than inter-machine communication. Scaling up beyond two machines often provides close to linear speed-up, but it does vary depending on the model characteristics. If observing unexpected scaling performance, assuming you have scaled your global_batch_size proportionally with slots_per_trial, it’s possible that training performance is being bottlenecked by network communication or disk I/O.

To check if your training is bottlenecked by communication, we suggest setting optimizations.aggregation_frequency in the Experiment Configuration to a very large number (e.g., 1000). This setting results in communicating updates once every 1000 batches. Comparing throughput with aggregation_frequency of 1 vs. aggregation_frequency of 1000 will demonstrate the communication overhead. If you do observe significant communication overhead, refer to Optimizing Distributed Training for guidance on how to optimize communication.

To check if your training is I/O bottlenecked, we encourage users to experiment with using synthetic datasets. If you observe that I/O is a significant bottleneck, we suggest optimizing the data input pipeline to the model (e.g., copy training data to local SSDs).