Optimizing Distributed and Parallel Training¶
The Distributed and Parallel Training how-to guide describes how to configure distributed and parallel training. In this topic guide, we will describe how to optimize distributed and parallel training, focusing on the following topics:
setting batch size,
single-machine parallel training,
multi-machine distributed training, and
configuring advanced optimizations.
Determined uses synchronous data parallelism for distributed and parallel training.
Setting Global Batch Size¶
When doing distributed and parallel training, the global_batch_size
specified in the Experiment Configuration is partitioned across
slots_per_trial
GPUs. The per-GPU batch size is set to:
global_batch_size
/ slots_per_trial
. If slots_per_trial
does not divide the global_batch_size
evenly,
the batch size is rounded down. For convenience, the
per-GPU batch size can be accessed via the Trial API, using
context.get_per_slot_batch_size
.
For improved performance, we recommend weak-scaling: increasing your
global_batch_size
proportionally with slots_per_trial
(e.g.,
change global_batch_size
of 32 for slots_per_trial
of 1 to
global_batch_size
of 128 for slots_per_trial
of 4).
Adjusting global_batch_size
can affect your model convergence, which can affect your training and/or
testing accuracy. You may need to adjust model hyperparameters like the learning rate and/or
use a different optimizer when training with larger batch sizes.
Single-Machine Parallelism¶
When using multi-GPU training in Determined, we recommend first trying single-machine parallel training, then multiple machine distributed training. For single-machine parallel training, in the experiment configuration, add:
resources:
slots_per_trial: N
where N is the total number of GPUs on an agent machine. In this configuration, trials will train using all the resources on a single machine.
Distributed Multi-Machine Parallelism¶
Multi-machine parallelism offers the ability to further parallelize training across more GPUs. For distributed multi-machine training, in the experiment configuration, add:
resources:
slots_per_trial: M
where M is a multiple of the total number of GPUs on an agent machine. For example, if your cluster consists of 8-GPU agent machines, valid values for M would be 16, 24, 32, etc. In this configuration, trials will use all the resources of multiple machines to train a model.
Warning
For distributed multi-machine training, Determined automatically detects
a common network interface shared by the agent machines. If your cluster has
multiple common network interfaces, please specify the fastest one
in Cluster Configuration under task_container_defaults.dtrain_network_interface
.
Configuring Advanced Optimizations¶
Determined offers several optimizations to further reduce
training time. These optimizations are available in Experiment Configuration
under optimizations
.
The optimizations.aggregation_frequency
option
evaluates more batches before exchanging gradients and is helpful
in situations where it is not possible to increase the batch size
directly. This optimization increases your effective batch size to
aggregation_frequency
* global_batch_size
.
The optimizations.gradient_compression
option reduces
the time it takes to transfer gradients between GPUs.
The optimization.auto_tune_tensor_fusion
option automatically identifies
the optimal message size during gradient transfers, reducing
communication overhead.
The average_training_metrics
option will average the training metrics
across GPUs at the end of every step, which requires communication. Typically
this will not have an impact on training performance, but if you have a very
small batches_per_step
, ensuring it is disabled may improve performance.
If you do not see improved performance, there might be a performance bottleneck in the model that cannot be directly alleviated by using multiple GPUs, e.g., data loading. We suggest experimenting with a synthetic dataset to verify performance of multi-GPU training.
Warning
Distributed and parallel training is designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never becomes active: if the number of GPUs requested does not divide into the machines available, for instance, or if another experiment is already using some GPUs on a machine.
If a multi-GPU experiment does not become active after a minute or so,
please confirm that slots_per_trial
is a multiple of the number of GPUs
available on a machine. Also, you can also use the CLI command det task
list
to check if any other tasks are using GPUs and preventing your
experiment from using all the GPUs on a machine.