Hyperparameter tuning is a common machine learning workflow that involves appropriately configuring the data, model architecture, and learning algorithm to yield an effective model. Hyperparameter tuning is a challenging problem in deep learning given the potentially large number of hyperparameters to consider.
Determined provides support for hyperparameter search as a first-class workflow that is tightly integrated with Determined’s job scheduler, which allows for efficient execution of state-of-the-art early-stopping based approaches as well as seamless parallelization of these methods.
An intuitive interface is provided to use hyperparameter searching. This document covers the following things:
Specify a searcher to find effective hyperparameter settings within the predefined ranges.
Configure hyperparameter ranges to search.
Instrument model code to use hyperparameters from the experiment configuration.
Handle Trial Errors and Early Stopping Requests
Specifying the Search Algorithm¶
Determined supports a variety of hyperparameter search algorithms.
Aside from the
single searcher, a searcher runs multiple trials and decides the hyperparameter
values to use in each trial. Every searcher is configured with the name of the validation metric to
optimize (via the
metric field), in addition to other searcher-specific options. For example,
suitable for larger experiments with many trials, is configured with the maximum number of trials to
run, the maximum training length allowed per trial, and the maximum number of trials that can be
worked on simultaneously:
searcher: name: "adaptive_asha" metric: "validation_loss" max_trials: 16 max_length: epochs: 1 max_concurrent_trials: 8
For details on the supported searchers and their respective configuration options, refer to Hyperparameter Tuning.
That’s it! After submitting an experiment, users can easily see the best validation metric observed across all trials over time in the WebUI. After the experiment has completed, they can view the hyperparameter values for the best-performing trials and then export the associated model checkpoints for downstream serving.
Our default recommended search method is Adaptive (ASHA), a state-of-the-art early-stopping based technique that speeds up traditional techniques like random search by periodically abandoning low-performing hyperparameter configurations in a principled fashion.
Adaptive (ASHA) offers asynchronous search functionality more suitable for large-scale HP search experiments in the distributed setting.
Other Supported Methods¶
Determined also supports other common hyperparameter search algorithms:
Configuring Hyperparameter Ranges¶
The first step toward automatic hyperparameter tuning is to define the hyperparameter space, e.g., by listing the decisions that may impact model performance. For each hyperparameter in the search space, the machine learning engineer specifies a range of possible values in the experiment configuration:
hyperparameters: ... dropout_probability: type: double minval: 0.2 maxval: 0.5 ...
Determined supports the following searchable hyperparameter data types:
int: an integer within a range
double: a floating point number within a range
log: a logarithmically scaled floating point number—users specify a
baseand Determined searches the space of exponents within a range
categorical: a variable that can take on a value within a specified set of values—the values themselves can be of any type
The experiment configuration reference details these data types and their associated options.
Instrumenting Model Code¶
Determined injects hyperparameters from the experiment configuration into model code via a context
object in the Trial base class. This
TrialContext object exposes a
get_hparam() method that takes the hyperparameter name. At trial
runtime, Determined injects a value for the hyperparameter. For example, to inject the value of the
dropout_probability hyperparameter defined above into the constructor of a PyTorch Dropout layer:
To see hyperparameter injection throughout a complete trial implementation, refer to the Training APIs.
Handling Trial Errors and Early Stopping Requests¶
When a trial encounters an error or fails unexpectedly, Determined will restart it from the latest
checkpoint unless we have done so max_restarts times, which is configured in
the experiment configuration. Once we have reached
max_restarts, any further trials that fail
will be marked as errored and will not be restarted. For search methods that adapt to validation
metric values (Adaptive (ASHA)), we do not
continue training errored trials, even if the search method would typically call for us to continue
training. This behavior is useful when some parts of the hyperparameter space result in models that
cannot be trained successfully (e.g., the search explores a range of batch sizes and some of those
batch sizes cause GPU OOM errors). An experiment can complete successfully as long as at least one
of the trials within it completes successfully.
Trial code can also request that training be stopped early, e.g., via a framework callback such as
tf.keras.callbacks.EarlyStopping or manually by
determined.TrialContext.set_stop_requested(). When early stopping is requested,
Determined will finish the current training or validation workload and checkpoint the trial. Trials
that are stopped early are considered to be “completed”, whereas trials that fail are marked as