Distributed and Parallel Training¶
Determined provides three main methods to accelerate your model training by taking advantage of multiple machines and multiple GPUs:
Parallelism across experiments. Schedule multiple experiments at once: more than one experiment can proceed in parallel if there are enough machines and GPUS available.
Parallelism within an experiment. Schedule multiple trials of an experiment at once: a hyperparameter search may run more than one trial at once, each of which will use its own GPUs.
Parallelism within a trial. Use multiple machine to speed up the training of a trial: using data parallelism. Determined can coordinate across multiple GPUs on a single machine (parallel training) or across multiple GPUs on multiple machines (distributed training) to improve the performance of training a single trial.
This how-to will focus on the third approach, demonstrating how to perform distributed or parallel training with Determined to speed up the training of a single trial.
In the Experiment Configuration, the resources.slots_per_trial
option
controls multi-GPU behavior. The default value is 1, which means that a single
GPU will be used to train a trial. The slots_per_trial
field indicates the
number of GPUs to use to train a single trial. These GPUs may be on a single
agent machine or across multiple machines.
Note
When the slots_per_trial
option is changed, the per-slot batch size is set to
global_batch_size
// slots_per_trial
. The per-slot (per-GPU) and global batch size
should be accessed via the context using context.get_per_slot_batch_size()
and context.get_global_batch_size()
, respectively. If global_batch_size
is
not evenly divisible by slots_per_trial
, the remainder is dropped.
Example configuration with distributed or parallel training:
resources:
slots_per_trial: N
To use distributed or parallel training, slots_per_trial
must be set to a multiple
of the GPUs per machine. For example, if you have a cluster of machines that each has
8 GPUs, you should set slots_per_trial
to a multiple of 8, such as 8, 16, or 24.
This ensures that the full network and interconnect bandwidth are available to
multi-GPU workloads and results in better performance.
Warning
Distributed and parallel training are designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never starts running on the cluster, for example if the number of GPUs requested is not a multiple of the number of GPUs per machine. Similarly, if a task is running on a multi-GPU machine and using one or more of its GPUs, that will prevent a distributed training job from starting on that machine.
If a multi-GPU experiment does not become active after a minute or so,
please confirm that slots_per_trial
is a multiple of the number of GPUs
available on a machine. Also, you can also use the CLI command det task
list
to check if any other tasks are using GPUs and preventing your
experiment from using all the GPUs on a machine.