Shortcuts

Distributed Training

Determined provides three main methods to take advantage of multiple GPUs:

  1. Parallelism across experiments. Schedule multiple experiments at once: more than one experiment can proceed in parallel if there are enough GPUs available.

  2. Parallelism within an experiment. Schedule multiple trials of an experiment at once: a hyperparameter search may train more than one trial at once, each of which will use its own GPUs.

  3. Parallelism within a trial. Use multiple GPUs to speed up the training of a single trial (distributed training). Determined can coordinate across multiple GPUs on a single machine or across multiple GPUs on multiple machines to improve the performance of training a single trial.

This guide will focus on the third approach, demonstrating how to perform distributed training with Determined to speed up the training of a single trial.

Configuration

In the Experiment Configuration, the resources.slots_per_trial field controls the number of GPUs that will be used to train a single trial. The default value is 1, which disables distributed training. Setting slots_per_trial to a larger value enables multi-GPU training automatically. Note that these GPUs might be on a single machine or across multiple machines; the experiment configuration simply defines how many GPUs should be used for training, and the Determined job scheduler decides whether to schedule the task on a single agent or multiple agents, depending on the machines in the cluster and the other active workloads.

Note

When the slots_per_trial option is changed, the per-slot batch size is set to global_batch_size // slots_per_trial. The per-slot (per-GPU) and global batch size should be accessed via the context using context.get_per_slot_batch_size() and context.get_global_batch_size(), respectively. If global_batch_size is not evenly divisible by slots_per_trial, the remainder is dropped.

Example configuration with distributed training:

resources:
  slots_per_trial: N

Data Downloading

When performing distributed training, Determined will automatically create one process for every GPU that is being used for training. Each process will attempt to download training and/or validation data, so care should be taken to ensure that concurrent data downloads do not conflict with one another. One way to do this is to include a unique identifier in the local file system path where the downloaded data is stored. A convenient identifier is the rank of the current process: a process’s rank is automatically assigned by Determined, and will be unique among all the processes in a trial.

You can do this by leveraging the self.context.distributed.get_rank() function. Below is an example of how to do this when downloading data from S3. In this example, the S3 bucket name is configured via a field data.bucket in the experiment configuration.

import boto3
import os

def download_data_from_s3(self):
    s3_bucket = self.context.get_data_config()["bucket"]
    download_directory = f"/tmp/data-rank{self.context.distributed.get_rank()}"
    data_file = "data.csv"

    s3 = boto3.client("s3")
    os.makedirs(download_directory, exist_ok=True)
    filepath = os.path.join(download_directory, data_file)
    if not os.path.exists(filepath):
        s3.download_file(s3_bucket, data_file, filepath)
    return download_directory

Scheduling Behavior

To use distributed training, slots_per_trial must be set to a multiple of the number of GPUs per machine. For example, if you have a cluster of machines with 8 GPUs each, you should set slots_per_trial to a multiple of 8, such as 8, 16, or 24. This ensures that the full network and interconnect bandwidth are available to multi-GPU workloads and results in better performance.

Warning

Distributed training is designed to maximize performance by training with all the resources of a machine. This can lead to situations where an experiment is created but never starts running on the cluster, for example if the number of GPUs requested is not a multiple of the number of GPUs per machine. Similarly, if a task is running on a multi-GPU machine and using one or more of its GPUs, that will prevent a distributed training job from starting on that machine.

If a multi-GPU experiment does not become active after a minute or so, please confirm that slots_per_trial is a multiple of the number of GPUs available on a machine. Also, you can also use the CLI command det task list to check if any other tasks are using GPUs and preventing your experiment from using all the GPUs on a machine.