Optimizing Distributed Training¶
The Distributed Training how-to guide describes how to configure distributed training. In this topic guide, we will describe how to optimize distributed training, focusing on the following topics:
setting batch size,
multi-machine training, and
configuring advanced optimizations.
Determined uses synchronous data parallelism for distributed training.
Setting Global Batch Size¶
When doing distributed training, the
global_batch_size specified in
the Experiment Configuration is partitioned across
slots_per_trial GPUs. The per-GPU batch size is set to:
not divide the
global_batch_size evenly, the batch size is rounded
down. For convenience, the per-GPU batch size can be accessed via the
Trial API, using
For improved performance, we recommend weak-scaling: increasing your
global_batch_size proportionally with
global_batch_size of 32 for
slots_per_trial of 1 to
global_batch_size of 128 for
slots_per_trial of 4).
global_batch_size can affect your model convergence, which
can affect your training and/or testing accuracy. You may need to adjust
model hyperparameters like the learning rate and/or use a different
optimizer when training with larger batch sizes.
When getting started using multiple GPUs to train a model, we recommend starting by using multiple GPUs on a single machine first, before proceeding to multi-machine training. Communication between GPUs on a single machine is typically significantly faster than communication between GPUs on different machines.
To use single-machine multi-GPU training, set the following field in the experiment configuration file:
resources: slots_per_trial: N
where N is any number less than or equal to the number of GPUs on an agent machine. In this configuration, trials will train using some or all the resources on a single machine.
Multi-machine parallelism offers the ability to further parallelize training across more GPUs. For multi-machine training, in the experiment configuration, add:
resources: slots_per_trial: M
where M is a multiple of the total number of GPUs on an agent machine. For example, if your cluster consists of 8-GPU agent machines, valid values for M would be 16, 24, 32, etc. In this configuration, trials will use all the resources of multiple machines to train a model.
For distributed multi-machine training, Determined automatically
detects a common network interface shared by the agent machines. If
your cluster has multiple common network interfaces, please specify
the fastest one in Cluster Configuration under
Determined supports several optimizations to further reduce training
time. These optimizations are available in
Experiment Configuration under
optimizations.aggregation_frequencycontrols how many batches are evaluated before exchanging gradients. It is helpful in situations where it is not possible to increase the batch size directly (e.g., due to GPU memory limitations). This optimization increases your effective batch size to
optimizations.gradient_compressionreduces the time it takes to transfer gradients between GPUs.
optimizations.auto_tune_tensor_fusionautomatically identifies the optimal message size during gradient transfers, reducing communication overhead.
optimizations.average_training_metricsaverages the training metrics across GPUs at the end of every training workload, which requires communication. This will typically not have a major impact on training performance, but if you have a very small
scheduling_unit, ensuring it is disabled may improve performance. If this option is disabled (which is the default behavior), only the training metrics from the chief GPU are used. This impacts shown in the Determined UI and TensorBoard, but does not influence model behavior or hyperparameter search.
If you do not see improved performance using distributed training, there might be a performance bottleneck in the model that cannot be directly alleviated by using multiple GPUs, e.g., data loading. We suggest experimenting with a synthetic dataset to verify the performance of multi-GPU training.
Multi-machine distributed training is designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never becomes active: if the number of GPUs requested does not divide into the machines available, for instance, or if another experiment is already using some GPUs on a machine.
If an experiment does not become active after a minute or so, please
slots_per_trial is a multiple of the number of GPUs
available on a machine. You can also use the CLI command
list to check if any other tasks are using GPUs and preventing your
experiment from using all the GPUs on a machine.