Distributed Training¶
Determined provides three main methods to take advantage of multiple GPUs:
Parallelism across experiments. Schedule multiple experiments at once: more than one experiment can proceed in parallel if there are enough GPUs available.
Parallelism within an experiment. Schedule multiple trials of an experiment at once: a hyperparameter search may train more than one trial at once, each of which will use its own GPUs.
Parallelism within a trial. Use multiple GPUs to speed up the training of a single trial (distributed training). Determined can coordinate across multiple GPUs on a single machine or across multiple GPUs on multiple machines to improve the performance of training a single trial.
This guide will focus on the third approach, demonstrating how to perform distributed training with Determined to speed up the training of a single trial.
Configuration¶
In the Experiment Configuration, the
resources.slots_per_trial
field controls the number of GPUs that
will be used to train a single trial. The default value is 1, which
disables distributed training. Setting slots_per_trial
to a larger
value enables multi-GPU training automatically. Note that these GPUs
might be on a single machine or across multiple machines; the experiment
configuration simply defines how many GPUs should be used for training,
and the Determined job scheduler decides whether to schedule the task on
a single agent or multiple agents, depending on the machines in the
cluster and the other active workloads.
Note
When the slots_per_trial
option is changed, the per-slot batch
size is set to global_batch_size // slots_per_trial
. The per-slot
(per-GPU) and global batch size should be accessed via the context
using context.get_per_slot_batch_size()
and
context.get_global_batch_size()
, respectively. If
global_batch_size
is not evenly divisible by slots_per_trial
,
the remainder is dropped.
Example configuration with distributed training:
resources:
slots_per_trial: N
Data Downloading¶
When performing distributed training, Determined will automatically
create one process for every GPU that is being used for training. Each
process will attempt to download training and/or validation data, so
care should be taken to ensure that concurrent data downloads do not
conflict with one another. One way to do this is to include a unique
identifier in the local file system path where the downloaded data is
stored. A convenient identifier is the rank
of the current process:
a process’s rank
is automatically assigned by Determined, and will
be unique among all the processes in a trial.
You can do this by leveraging the
self.context.distributed.get_rank()
function. Below
is an example of how to do this when downloading data from S3. In this
example, the S3 bucket name is configured via a field data.bucket
in
the experiment configuration.
import boto3
import os
def download_data_from_s3(self):
s3_bucket = self.context.get_data_config()["bucket"]
download_directory = f"/tmp/data-rank{self.context.distributed.get_rank()}"
data_file = "data.csv"
s3 = boto3.client("s3")
os.makedirs(download_directory, exist_ok=True)
filepath = os.path.join(download_directory, data_file)
if not os.path.exists(filepath):
s3.download_file(s3_bucket, data_file, filepath)
return download_directory
Scheduling Behavior¶
The Determined master takes care of scheduling distributed training jobs automatically, ensuring that all of the compute resources required for a job are available before the job itself is launched. Users should be aware of the following details about scheduler behavior when using distributed training:
If
slots_per_trial
is smaller than or equal to the number of slots on a single agent, Determined will consider scheduling multiple distributed training jobs on a single agent. This is designed to improve utilization and to allow multiple small training jobs to run on a single agent. For example, an agent with 8 GPUs could be assigned two 4-GPU jobs, or four 2-GPU jobs.Otherwise, if
slots_per_trial
is greater than the number of slots on a single agent, Determined will schedule the distributed training job onto multiple agents. A multi-machine distributed training job will only be scheduled onto an agent if this will result in utilizing all of the agent’s GPUs. This is to ensure good performance and utilize the full network bandwidth of each machine, while minimizing inter-machine networking. For example, if all of the agents in your cluster have 8 GPUs each , you should submit jobs withslots_per_trial
set to a multiple of 8 (e.g., 8, 16, or 24).
Warning
If the scheduling constraints for multi-machine distributed training
described above are not satisfied, distributed training jobs will not
be scheduled and will wait indefinitely. For example, if every agent
in the cluster has 8 GPUs, a job with slots_per_trial
set to
12
will never be scheduled.
If a multi-GPU experiment does not become active after a minute or
so, please confirm that slots_per_trial
is set so that it can be
scheduled within these constraints. The CLI command det task list
can also be used to check if any other tasks are using GPUs and
preventing your experiment from using all the GPUs on a machine.