Shortcuts

Optimizing Distributed and Parallel Training

The Distributed and Parallel Training how-to guide describes how to configure distributed and parallel training. In this topic guide, we will describe how to optimize distributed and parallel training, focusing on the following topics:

  • setting batch size,

  • single-machine parallel training,

  • multi-machine distributed training, and

  • configuring advanced optimizations.

Determined uses synchronous data parallelism for distributed and parallel training.

Setting Global Batch Size

When doing distributed and parallel training, the global_batch_size specified in the Experiment Configuration is partitioned across slots_per_trial GPUs. The per-GPU batch size is set to: global_batch_size / slots_per_trial. If slots_per_trial does not divide the global_batch_size evenly, the batch size is rounded down. For convenience, the per-GPU batch size can be accessed via the Trial API, using context.get_per_slot_batch_size.

For improved performance, we recommend weak-scaling: increasing your global_batch_size proportionally with slots_per_trial (e.g., change global_batch_size of 32 for slots_per_trial of 1 to global_batch_size of 128 for slots_per_trial of 4).

Adjusting global_batch_size can affect your model convergence, which can affect your training and/or testing accuracy. You may need to adjust model hyperparameters like the learning rate and/or use a different optimizer when training with larger batch sizes.

Single-Machine Parallelism

When using multi-GPU training in Determined, we recommend first trying single-machine parallel training, then multiple machine distributed training. For single-machine parallel training, in the experiment configuration, add:

resources:
  slots_per_trial: N

where N is the total number of GPUs on an agent machine. In this configuration, trials will train using all the resources on a single machine.

Distributed Multi-Machine Parallelism

Multi-machine parallelism offers the ability to further parallelize training across more GPUs. For distributed multi-machine training, in the experiment configuration, add:

resources:
  slots_per_trial: M

where M is a multiple of the total number of GPUs on an agent machine. For example, if your cluster consists of 8-GPU agent machines, valid values for M would be 16, 24, 32, etc. In this configuration, trials will use all the resources of multiple machines to train a model.

Warning

For distributed multi-machine training, Determined automatically detects a common network interface shared by the agent machines. If your cluster has multiple common network interfaces, please specify the fastest one in Cluster Configuration under task_container_defaults.dtrain_network_interface.

Configuring Advanced Optimizations

Determined offers several optimizations to further reduce training time. These optimizations are available in Experiment Configuration under optimizations.

The optimizations.aggregation_frequency option evaluates more batches before exchanging gradients and is helpful in situations where it is not possible to increase the batch size directly. This optimization increases your effective batch size to aggregation_frequency * global_batch_size.

The optimizations.gradient_compression option reduces the time it takes to transfer gradients between GPUs. The optimization.auto_tune_tensor_fusion option automatically identifies the optimal message size during gradient transfers, reducing communication overhead. The average_training_metrics option will average the training metrics across GPUs at the end of every step, which requires communication. Typically this will not have an impact on training performance, but if you have a very small batches_per_step, ensuring it is disabled may improve performance.

If you do not see improved performance, there might be a performance bottleneck in the model that cannot be directly alleviated by using multiple GPUs, e.g., data loading. We suggest experimenting with a synthetic dataset to verify performance of multi-GPU training.

Warning

Distributed and parallel training is designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never becomes active: if the number of GPUs requested does not divide into the machines available, for instance, or if another experiment is already using some GPUs on a machine.

If a multi-GPU experiment does not become active after a minute or so, please confirm that slots_per_trial is a multiple of the number of GPUs available on a machine. Also, you can also use the CLI command det task list to check if any other tasks are using GPUs and preventing your experiment from using all the GPUs on a machine.