Multi-GPU TrainingΒΆ

PEDL provides three main methods for taking advantage of multiple GPUs to accelerate your model training:

  1. Scheduling multiple experiments at once: more than one experiment can proceed in parallel if there are enough GPUs available.

  2. Scheduling multiple trials of an experiment at once: a hyperparameter search may run more than one trial at once, each of which will use its own GPUs.

  3. Using multiple GPUs to speed up the training of a trial: using data parallelism, PEDL can coordinate multiple GPUs to improve the performance of training a single trial.

This part of the documentation describes the last method: how to use PEDL to train a trial faster by using multiple GPUs at once.

The behavior of multi-GPU training is controlled by resources.slots_per_trial in the Experiment Configuration which defaults to 1 and uses a single GPU to train a trial.

The field slots_per_trial indicates the number of GPUs to use to train a trial. These GPUs may be on a single agent machine or across multiple machines.

To use multi-GPU training, slots_per_trial must be set to be set to a multiple of the of the GPUs per machine. E.g., if you have a cluster of 8-GPU machines, you need to set slots_per_trial to 8, 16, 24, etc. This ensures that the full network and interconnect bandwidth is available to the multi-GPU workloads and results in better performance.

Warning

Multi-GPU training is designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never becomes active: if the number of GPUs requested does not divide into the machines available, for instance, or if another experiment is already using some GPUs on a machine.

If a multi-GPU experiment does not become active after a minute or so, please confirm that slots_per_trial is a multiple of the number of GPUs available on a machine. Also, you can also use the CLI command pedl task list to check if any other tasks are using GPUs and preventing your experiment from using all the GPUs on a machine.

When using multi-GPU training, we recommend first trying the following two configurations: (1) single machine, maximum GPUs and (2) multiple machine, maximum GPUs.

Single machine, maximum GPUs. In the experiment configuration, add

resources:
  slots_per_trial: N

where N is the total number of GPUs on an agent machine. In this configuration, trials will train using PEDL-optimized execution and will use all the resources on a single machine.

Note

Multi-GPU training works by partitioning a training batch across GPUs. If slots_per_trial is greater than the batch size, there will be no benefit from additional GPUs.

If slots_per_trial does not divide the batch size evenly, the model metrics may differ from what would be obtained when training the model with a single GPU.

For best performance, slots_per_trial should divide the batch size evenly.

If you do not observe any reduction in the time to train a trial, you can try increasing the batch size or evaluating options under optimizations in the Experiment Configuration.

The configuration option optimizations.aggregation_frequency can be used to evaluate more batches before exchanging gradients and is helpful in situations where it is not possible to increase the batch size directly. The option optimizations.gradient_compression can be helpful to reduce the time it takes to transfer gradients between GPUs.

Changing any of the above options also affects model convergence so you may need to adjust model hyperparameters like the learning rate and/or use a different optimizer to compensate.

If you still see no improvement in performance, there might be a performance bottleneck in the model that cannot be directly alleviated by using multiple GPUs. We suggest confirming that the model follows best practices for performance according to the model framework (e.g., TensorFlow guide) and also profiling the code (e.g., PyTorch bottleneck guide).

If you do observe a performance improvement from using multiple GPUs on a single machine, you may also want to try the multiple machine, maximum GPUs configuration. Set the following fields in your experiment configuration:

resources:
  slots_per_trial: M

where M is a small multiple (between 2 and 8 typically) of the total number of GPUs on an agent machine. For example, if your cluster consists of 4-GPU agent machines, reasonable values for M would be 8, 12, 16, etc. In this configuration, trials will use all the resources of multiple machines to train a model.