Terminology and Concepts¶
- agent
A machine, typically with GPUs, that is used for training models and running other tasks, such as notebooks and TensorBoards. The master handles provisioning and deprovisioning agent instances in cloud settings. More information can be found at Determined System Architecture.
- configuration file
A YAML file that contains options to pass to Determined. For example, an experiment configuration file contains information on number of steps, data location, hyperparameters, and other options for an experiment. More information can be found at Experiment Configuration, Cluster Configuration, and Command Configuration.
- context directory
The directory that is uploaded to the master when an experiment is created. It must contain all code that is part of the model definition.
- experiment
A collection of one or more trials that are exploring a user-defined hyperparameter space. For example, during a learning rate hyperparameter search, an experiment can consist of three trials with learning rates of .001, .01, and .1. In Determined, experiments are the main grouping mechanism for training tasks.
- master
The central component of the Determined system. The master serves the frontend, manages the provisioning and deprovisioning of agents in cloud settings, and schedules trials onto agents. More information can be found at Determined System Architecture.
- model definition
A specification of a deep learning model written in a supported deep learning framework. The model definition contains training code that inherits from a Python class provided by Determined (
TFKerasTrial
,PyTorchTrial
, orEstimatorTrial
). More information can be found at Model Definitions.- searcher, search algorithm
A type of hyperparameter search to use. The search algorithm determines how many trials will be run for a particular experiment and how the hyperparameters will be set. More information can be found at Hyperparameter Tuning With Determined.
- slot
A resource (GPU or CPU) that can be used for training. The maximum number of slots that an experiment can use can be set in the experiment configuration file or using the Determined CLI.
- step
A workload consisting of training a model on a certain number of batches of data, where a batch is defined by the model definition’s data loader. The default is 100 batches per step, but the number can be overridden in the configuration file or using the Determined CLI.
- trial
A training task with a dataset, a deep learning model, and a defined set of hyperparameters.
- workload
A discrete unit of work with one purpose related to training a model. A workload will either train the model on a certain amount of data, checkpoint the state, or validate the model’s performance.