Optimizing Distributed Training¶
The Distributed Training how-to guide describes how to configure distributed training. In this topic guide, we will describe how to optimize distributed training, focusing on the following topics:
setting batch size,
single-machine training,
multi-machine training, and
configuring advanced optimizations.
Determined uses synchronous data parallelism for distributed training.
Setting Global Batch Size¶
When doing distributed training, the global_batch_size
specified in
the Experiment Configuration is partitioned across
slots_per_trial
GPUs. The per-GPU batch size is set to:
global_batch_size
/ slots_per_trial
. If slots_per_trial
does
not divide the global_batch_size
evenly, the batch size is rounded
down. For convenience, the per-GPU batch size can be accessed via the
Trial API, using context.get_per_slot_batch_size
.
For improved performance, we recommend weak-scaling: increasing your
global_batch_size
proportionally with slots_per_trial
(e.g.,
change global_batch_size
of 32 for slots_per_trial
of 1 to
global_batch_size
of 128 for slots_per_trial
of 4).
Adjusting global_batch_size
can affect your model convergence, which
can affect your training and/or testing accuracy. You may need to adjust
model hyperparameters like the learning rate and/or use a different
optimizer when training with larger batch sizes.
Single-Machine Training¶
When getting started using multiple GPUs to train a model, we recommend starting by using multiple GPUs on a single machine first, before proceeding to multi-machine training. Communication between GPUs on a single machine is typically significantly faster than communication between GPUs on different machines.
To use single-machine multi-GPU training, set the following field in the experiment configuration file:
resources:
slots_per_trial: N
where N is any number less than or equal to the number of GPUs on an agent machine. In this configuration, trials will train using some or all the resources on a single machine.
Multi-Machine Training¶
Multi-machine parallelism offers the ability to further parallelize training across more GPUs. For multi-machine training, in the experiment configuration, add:
resources:
slots_per_trial: M
where M is a multiple of the total number of GPUs on an agent machine. For example, if your cluster consists of 8-GPU agent machines, valid values for M would be 16, 24, 32, etc. In this configuration, trials will use all the resources of multiple machines to train a model.
Warning
For distributed multi-machine training, Determined automatically
detects a common network interface shared by the agent machines. If
your cluster has multiple common network interfaces, please specify
the fastest one in Cluster Configuration under
task_container_defaults.dtrain_network_interface
.
Advanced Optimizations¶
Determined supports several optimizations to further reduce training
time. These optimizations are available in
Experiment Configuration under optimizations
.
optimizations.aggregation_frequency
controls how many batches are evaluated before exchanging gradients. It is helpful in situations where it is not possible to increase the batch size directly (e.g., due to GPU memory limitations). This optimization increases your effective batch size toaggregation_frequency
*global_batch_size
.optimizations.gradient_compression
reduces the time it takes to transfer gradients between GPUs.optimizations.auto_tune_tensor_fusion
automatically identifies the optimal message size during gradient transfers, reducing communication overhead.optimizations.average_training_metrics
averages the training metrics across GPUs at the end of every training workload, which requires communication. This will typically not have a major impact on training performance, but if you have a very smallscheduling_unit
, ensuring it is disabled may improve performance. If this option is disabled (which is the default behavior), only the training metrics from the chief GPU are used. This impacts shown in the Determined UI and TensorBoard, but does not influence model behavior or hyperparameter search.
If you do not see improved performance using distributed training, there might be a performance bottleneck in the model that cannot be directly alleviated by using multiple GPUs, e.g., data loading. We suggest experimenting with a synthetic dataset to verify the performance of multi-GPU training.
Warning
Multi-machine distributed training is designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never becomes active: if the number of GPUs requested does not divide into the machines available, for instance, or if another experiment is already using some GPUs on a machine.
If an experiment does not become active after a minute or so, please
confirm that slots_per_trial
is a multiple of the number of GPUs
available on a machine. You can also use the CLI command det task
list
to check if any other tasks are using GPUs and preventing your
experiment from using all the GPUs on a machine.