Determined AI (v0.8.26)¶
Concepts and Terms¶
An experiment is a collection of one or more deep learning (DL) training tasks that corresponds to a unified DL workflow, e.g., exploring a user-defined hyperparameter space.
Each training task in an experiment is called a trial. A trial is a full-specified training task with a fixed dataset and a deep learning model with all hyperparameters set. PEDL executes the training process associated with a trial as a sequence of steps, where each step corresponds to a fixed number of model updates.
To create an experiment, users must provide two resources. The first is called the experiment configuration file and specifies the hyperparameter search space, the location of the dataset, and other experiment-level settings. The second is called the model definition and specifies the deep learning model, e.g., via TensorFlow or Keras.
A search method is an algorithm for exploring the hyperparameter space of an experiment. Examples of search algorithms include adaptive search and random search.
PEDL consists of a single master and one or more agents. There is typically one agent per compute server; a single machine can serve as both a master and an agent.
The master is responsible for
- Storing experiment, trial, and step metadata.
- Scheduling and dispatching work to agents.
- Advancing the experiment, trial, step state machines over time.
Each agent manages a number of slots, which represent computing devices, typically one slot per CPU or GPU. On startup the agent sends the master the devices it has available. It then waits for messages from the master and runs the requested workloads; agents have no state and otherwise do not communicate with the master.
A slot executes its workload in a containerized environment called the trial runner. We provide standard Docker containers to run typical DL workloads; containers can be customized for specific needs. Trial runners are expected to have access to the training data.
The trial runner runs workloads, which may be training steps of the trial, evaluating a trial on a validation dataset, or other operations like checkpointing model state. The master may then send more workloads or terminate the trial runner (freeing the slot). When sending a workload to the trial runner, the master consults with the searcher to determine which workload to run next.
The master allocates cluster resources (slots) among the active experiments using a fair-share scheduling policy. In other words, slots are divided among the active experiments according to the demand (number of desired concurrent tasks) of each experiment. For instance, in an eight GPU cluster with two experiments with demands of three and six respectively, the scheduler assigns three slots and five slots respectively. As new experiments become active or the resource demand of an active experiment changes, the scheduler will adjust how slots are allocated to experiments as apropriate.
Scheduling behavior can be configured via the
resources section of the experiment config file; see the configuration documentation for details.
In PEDL, there are two ways to create and monitor experiments: the WebUI and the Command Line Interface (CLI). PEDL also leverages existing application frameworks (e.g., TensorFlow, Keras) for specifying model architectures and defining the underlying iterative optimization training algorithms.
Under the hood, PEDL has the following internal components:
- Experiment Manager. This component manages the efficient and robust execution of submitted experiments.
- Trial Optimizer. This component controls the process of hyperparameter optimization by executing the work as described by the searcher.
- Resource Optimizer. This component controls low-level trial execution, ensuring efficient and full utilization of all available on-premise computational resources. It allocates jobs across GPU hardware and shares computational resources across all PEDL users. This component leverages a proprietary cluster resource manager.
- Metadata Store. This component captures experimental inputs and outputs, as well as all metadata generated as part of the model training process. It interacts with webUI and CLI for visualization purposes.
The WebUI allows users to create and monitor the progress of experiments. It is accessible by visiting
master-addr is the hostname or IP address where the PEDL master is running.
Users can also interact with PEDL using a command-line interface. The CLI is distributed as a Python wheel package; once the wheel has been installed (see the installation instructions for details), the CLI can be used via the
The CLI can be installed on any machine where a user would like to access PEDL. The
--master flag determines the network address of the PEDL master that the CLI connects to. The default value of this flag is given by the
PEDL_MASTER_ADDR environment variable; if that is not set, the default address is
localhost. The default TCP port is
8080; to specify a different port, use the syntax
# Connect to PEDL master at localhost, port 8080. $ pedl experiment list # Connect to PEDL master at example.org, port 8888. $ pedl -m example.org:8888 e list $ pedl --master example.org:8888 e list # Set default PEDL master address to example.org, port 8888. $ export PEDL_MASTER_ADDR="example.org:8888"
CLI subcommands usually follow a
<noun> <verb> form, similar to the paradigm of
ip. Certain abbreviations are supported, and a missing verb is the same as
list, when possible.
For example, the different commands within each of the blocks below all do the same thing:
# List all experiments. $ pedl experiment list $ pedl exp list $ pedl e list $ pedl e
# List all agents. $ pedl agent list $ pedl a list $ pedl a
# List all slots. $ pedl slot list $ pedl slot $ pedl s
For a complete description of the available nouns and abbreviations, see the output of
pedl help. Each noun also provides a
help verb that describes the possible verbs for that noun.