Distributed and Parallel Training¶

Determined provides three main methods to accelerate your model training by taking advantage of multiple machines and multiple GPUs:

Parallelism across experiments. Schedule multiple experiments at once: more than one experiment can proceed in parallel if there are enough machines and GPUS available.
Parallelism within an experiment. Schedule multiple trials of an experiment at once: a hyperparameter search may run more than one trial at once, each of which will use its own GPUs.
Parallelism within a trial. Use multiple machine to speed up the training of a trial: using data parallelism. Determined can coordinate across multiple GPUs on a single machine (parallel training) or across multiple GPUs on multiple machines (distributed training) to improve the performance of training a single trial.

This how-to will focus on the third approach, demonstrating how to perform distributed or parallel training with Determined to speed up the training of a single trial.

In the Experiment Configuration, the resources.slots_per_trial option controls multi-GPU behavior. The default value is 1, which means that a single GPU will be used to train a trial. The slots_per_trial field indicates the number of GPUs to use to train a single trial. These GPUs may be on a single agent machine or across multiple machines.

Note

When the slots_per_trial option is changed, the per-slot batch size is set to global_batch_size // slots_per_trial. The per-slot (per-GPU) and global batch size should be accessed via the context using context.get_per_slot_batch_size() and context.get_global_batch_size(), respectively. If global_batch_size is not evenly divisible by slots_per_trial, the remainder is dropped.

Example configuration with distributed or parallel training:

resources:
  slots_per_trial: N

To use distributed or parallel training, slots_per_trial must be set to a multiple of the GPUs per machine. For example, if you have a cluster of machines that each has 8 GPUs, you should set slots_per_trial to a multiple of 8, such as 8, 16, or 24. This ensures that the full network and interconnect bandwidth is available to multi-GPU workloads and results in better performance.

Warning

Distributed and parallel training is designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never starts running on the cluster, for example if the number of GPUs requested does not divide into the machines available. Similarly, if a task is running on a multi-GPU machine and using one or more of its GPUs, that will prevent a distributed training job from starting on that machine.

If a multi-GPU experiment does not become active after a minute or so, please confirm that slots_per_trial is a multiple of the number of GPUs available on a machine. Also, you can also use the CLI command det task list to check if any other tasks are using GPUs and preventing your experiment from using all the GPUs on a machine.

Next Steps¶

Optimizing Distributed and Parallel Training