Shortcuts

Optimizing Distributed Training

The Distributed Training how-to guide describes how to configure distributed training. In this topic guide, we will describe how to optimize distributed training, focusing on the following topics:

  • setting batch size,

  • single-machine training,

  • multi-machine training, and

  • configuring advanced optimizations.

Determined uses synchronous data parallelism for distributed training.

Setting Global Batch Size

When doing distributed training, the global_batch_size specified in the Experiment Configuration is partitioned across slots_per_trial GPUs. The per-GPU batch size is set to: global_batch_size / slots_per_trial. If slots_per_trial does not divide the global_batch_size evenly, the batch size is rounded down. For convenience, the per-GPU batch size can be accessed via the Trial API, using context.get_per_slot_batch_size.

For improved performance, we recommend weak-scaling: increasing your global_batch_size proportionally with slots_per_trial (e.g., change global_batch_size of 32 for slots_per_trial of 1 to global_batch_size of 128 for slots_per_trial of 4).

Adjusting global_batch_size can affect your model convergence, which can affect your training and/or testing accuracy. You may need to adjust model hyperparameters like the learning rate and/or use a different optimizer when training with larger batch sizes.

Single-Machine Training

When getting started using multiple GPUs to train a model, we recommend starting by using multiple GPUs on a single machine first, before proceeding to multi-machine training. Communication between GPUs on a single machine is typically significantly faster than communication between GPUs on different machines.

To use single-machine multi-GPU training, set the following field in the experiment configuration file:

resources:
  slots_per_trial: N

where N is the total number of GPUs on an agent machine. In this configuration, trials will train using all the resources on a single machine.

Multi-Machine Training

Multi-machine parallelism offers the ability to further parallelize training across more GPUs. For multi-machine training, in the experiment configuration, add:

resources:
  slots_per_trial: M

where M is a multiple of the total number of GPUs on an agent machine. For example, if your cluster consists of 8-GPU agent machines, valid values for M would be 16, 24, 32, etc. In this configuration, trials will use all the resources of multiple machines to train a model.

Warning

For distributed multi-machine training, Determined automatically detects a common network interface shared by the agent machines. If your cluster has multiple common network interfaces, please specify the fastest one in Cluster Configuration under task_container_defaults.dtrain_network_interface.

Advanced Optimizations

Determined supports several optimizations to further reduce training time. These optimizations are available in Experiment Configuration under optimizations.

  • optimizations.aggregation_frequency controls how many batches are evaluated before exchanging gradients. It is helpful in situations where it is not possible to increase the batch size directly (e.g., due to GPU memory limitations). This optimization increases your effective batch size to aggregation_frequency * global_batch_size.

  • optimizations.gradient_compression reduces the time it takes to transfer gradients between GPUs.

  • optimizations.auto_tune_tensor_fusion automatically identifies the optimal message size during gradient transfers, reducing communication overhead.

  • optimizations.average_training_metrics averages the training metrics across GPUs at the end of every step, which requires communication. This will typically not have a major impact on training performance, but if you have a very small batches_per_step, ensuring it is disabled may improve performance. If this option is disabled (which is the default behavior), only the training metrics from the chief GPU are used. This impacts shown in the Determined UI and TensorBoard, but does not influence model behavior or hyperparameter search.

If you do not see improved performance using distributed training, there might be a performance bottleneck in the model that cannot be directly alleviated by using multiple GPUs, e.g., data loading. We suggest experimenting with a synthetic dataset to verify the performance of multi-GPU training.

Warning

Distributed training is designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never becomes active: if the number of GPUs requested does not divide into the machines available, for instance, or if another experiment is already using some GPUs on a machine.

If a multi-GPU experiment does not become active after a minute or so, please confirm that slots_per_trial is a multiple of the number of GPUs available on a machine. You can also use the CLI command det task list to check if any other tasks are using GPUs and preventing your experiment from using all the GPUs on a machine.