Hyperparameter Tuning With Determined¶
Hyperparameter tuning is a common machine learning workflow that involves appropriately configuring the data, model architecture, and learning algorithm to yield an effective model. Hyperparameter tuning is a challenging problem in deep learning given the potentially large number of hyperparameters to consider. Determined provides support for hyperparameter search as a first-class workflow that is tightly integrated with Determined’s job scheduler, which allows for efficient execution of state-of-the-art early-stopping based approaches as well as seamless parallelization of these methods.
Our default recommended search method is Adaptive (ASHA), a state-of-the-art early-stopping based technique that speeds up traditional techniques like random search by periodically abandoning low-performing hyperparameter configurations in a principled fashion.
Adaptive (ASHA) offers asynchronous search functionality more suitable for large-scale HP search experiments in the distributed setting.
Other Supported Methods¶
Determined also supports other common hyperparameter search algorithms:
Single is appropriate for manual hyperparameter tuning, as it trains a single hyperparameter configuration.
Grid brute force evaluates all possible hyperparameter configurations and returns the best.
Random evaluates a set of hyperparameter configurations chosen at random and returns the best.
Population-based training (PBT) begins as random search but periodically replaces low-performing hyperparameter configurations with ones near the high-performing points in the hyperparameter space.
Handling Trial Errors and Early Stopping Requests¶
When a trial encounters an error or fails unexpectedly, Determined will
restart it from the latest checkpoint unless we have done so
max_restarts times, which is configured in the
experiment configuration. Once we have reached
further trials that fail will be marked as errored and will not be
restarted. For search methods that adapt to validation metric values
(Adaptive (ASHA), and
Population-based training (PBT)), we do not continue training errored
trials, even if the search method would typically call for us to
continue training. This behavior is useful when some parts of the
hyperparameter space result in models that cannot be trained
successfully (e.g., the search explores a range of batch sizes and some
of those batch sizes cause GPU OOM errors). An experiment can complete
successfully as long as at least one of the trials within it completes
Trial code can also request that training be stopped early, e.g., via a
framework callback such as tf.keras.callbacks.EarlyStopping
or manually by calling
determined.TrialContext.set_stop_requested(). When early stopping
is requested, Determined will finish the current training or validation
workload and checkpoint the trial. Trials that are stopped early are
considered to be “completed”, whereas trials that fail are marked as