# Hyperparameter Search

In machine learning, a common task is attempting to find good hyperparameters for a learning algorithm. PEDL provides support for hyperparameter search as a first-class workflow. Several hyperparameter search algorithms are supported: `single`

, `random`

, `adaptive`

, and `pbt`

.

Here we give a brief introduction to the search methods. The fields mentioned in the descriptions below are specified under the `searcher`

field in the experiment configuration file.

In addition to the parameters that are specific to individual search algorithms, the following parameters may be provided for any algorithm.

`metric`

: Validation metric for evaluating trials.- (optional)
`smaller_is_better`

: Whether a smaller value of`metric`

is considered better performance (e.g., this would be`true`

for a loss metric and`false`

for an accuracy metric). Defaults to`true`

. - (optional)
`source_trial_id`

and`source_checkpoint_uuid`

: Only one of these can be specified at a time. Initializes weights of all trials to some prior checkpoint.

## Single search¶

This search method does not technically "search": it trains a single hyperparameter configuration for `max_steps`

steps, and then computes the trial's validation metrics. This method is useful for testing or for training a single model configuration until convergence.

## Random search¶

This search method generates `max_trials`

trials with hyperparameter configurations chosen uniformly at random from the ranges of values specified in the experiment configuration file. Each trial is trained for `max_steps`

steps and then the trial's validation metrics are computed.

## Grid search¶

This search method generates trials on a "grid" of hyperparameter configurations and trains each trial for `max_steps`

steps. The configuration fields `max_steps`

, `metric`

, and `smaller_is_better`

are specified under the `searcher`

field. However, the actual hyperparameter values that form the grid are specified under the `hyperparameters`

field.

The user specifies a set of values for each hyperparameter through the experiment configuration file. The "grid" of hyperparameter configurations is generated by taking the product of these sets. For example, if the set of values for three separate hyperparameters `aparam`

, `bparam`

, and `cparam`

are specified as `{0, 1, 2}`

, `{10, 20}`

, and `{"c"}`

respectively, then the grid of tuples `(aparam, bparam, cparam)`

generated is:

(0, 10, "c") (0, 20, "c") (1, 10, "c") (1, 20, "c") (2, 10, "c") (2, 20, "c")

The way the set of hyperparameter values is specified depends on the type of hyperparameter.

`const`

: The set of values contains just the single value. For example,`cparam`

above could be specified as a`const`

hyperparameter with`val: c`

.`categorical`

: The set of values is exactly the set of categorical values. For example,`bparam`

above could be specified as a`categorical`

hyperparameter with`vals: [10, 20]`

.`int`

: The set of`count`

values is taken evenly from the range`[minval, maxval]`

, inclusive of endpoints. If`count`

is larger than the number of integer values in the range, that is interpreted as the entire range of integers in`[minval, maxval]`

. For example,`aparam`

above could be specified as an`int`

hyperparameter with`minval: 0`

,`maxval: 2`

, and`count: 3`

or`count: 100`

.`double`

: The set of`count`

values is taken evenly from the range`[minval, maxval]`

, inclusive of endpoints. The set`{0.1, 0.3, 0.5}`

could be specified as a`double`

hyperparameter with`minval: 0.1`

,`maxval: 0.5`

,`count: 3`

.`log`

: The set of`count`

values is taken logarithmically evenly from the range`[base`

, inclusive of endpoints. For example, the set^{minval}, base^{maxval}]`{0.00001, 0.0001, 0.001}`

could be specified as a`log`

hyperparameter with`base: 10`

,`minval: -5`

,`maxval: -3`

, and`count: 3`

.

Under the special case of `count: 1`

for `int`

, `double`

, or `log`

, the midpoint (with rounding for `int`

and with `base`

for ^{midpoint}`log`

) is returned.

## Adaptive Search¶

This search method is a theoretically principled and empirically state-of-the-art method that adaptively allocates resources to promising hyperparameter configurations while quickly eliminating poor ones. This section explains the configuration settings that influence the behavior of the `adaptive`

searcher and gives recommendations for how to configure those settings.

### Quick start¶

If using `adaptive`

for the first time, here are some suggested initial settings that typically work well.

Search mode:

`mode`

: Set to`standard`

.

Resource budget:

`target_trial_steps`

: The maximum number of steps that any trial that survives to the end of the experiment will be trained for (a*step*is a fixed number of batches). This quantity is domain-specific and should roughly reflect the number of training steps needed for models of interest to converge on the data set. For users who would like to experimentally find this number for their models, train a model with reasonable hyperparameters using the`single`

search method.`step_budget`

: Set`step_budget`

to roughly 10 times`target_trial_steps`

. A higher`step_budget`

will result in hyperparameter search that consumes more resources and takes longer to complete.

## Details¶

Conceptually, the `adaptive`

searcher is a carefully tuned strategy for spawning multiple *ASHA* (asynchronous successive halving algorithm) searchers, themselves hyperparameter search algorithms. ASHA can be configured to make different tradeoffs between exploration and exploitation, i.e., how many trials are explored versus how long a single trial is trained for. Because the right tradeoff between exploration and exploitation is hard to know in advance, the `adaptive`

algorithm tries several ASHA searches with different tradeoffs.

The configuration settings available to PEDL users running in `adaptive`

mode mostly affect the ASHA subroutines directly. The `mode`

configuration is the only one affecting the decisions of the `adaptive`

searcher, by changing the number and types of ASHA subroutines spawned.

The first section here gives a description of ASHA. The second section describes the configuration parameters that influence how this search method behaves. The third section gives a summary of the `adaptive`

configuration settings.

#### ASHA¶

At a high level, ASHA prunes ("halves") a set of trials in successive rounds we call *rungs*. ASHA starts with an initial set of trials. (A trial means one model for training, with a fixed set of hyperparameter values.) ASHA trains all the trials for some number of steps and the trials with the worst validation performance are discarded. In the next rung, the remaining trials are trained for a longer period of time, and then trials with the worst validation performance are pruned once again. This is repeated until the maximum number of training steps is reached.

First, an example of ASHA.

- Rung 1: ASHA creates N initial trials; the hyperparameter values for each trial are randomly sampled from the hyperparameters defined in the experiment configuration file. Each trial is trained for 3 steps, and then validation metrics are computed.
- Rung 2: ASHA picks the N/4 top-performing trials according to validation metrics. These are trained for 12 steps.
- Rung 3: ASHA picks the N/16 top-performing trials according to validation metrics. These are trained for 48 steps.

At the end, the trial with best performance has the hyperparameter setting the ASHA searcher returns.

In the example above, `divisor`

is 4, which determines what fraction of trials are kept in successive rungs, as well as the number of steps in successive rungs. `target_trial_steps`

is 48, which is the maximum number of steps a trial is trained for.

The remaining degree of freedom in this ASHA example is the number N of trials initialized. This is determined by the top-level adaptive algorithm, through `step_budget`

and the number/types of ASHA subroutines called.

In general, ASHA has a fixed `divisor`

d. In the first rung, it generates an initial set of randomly chosen trials and runs until each trial has completed the same number of steps. In the next rung, it keeps 1/d of those trials and closes the rest. Then it runs each remaining trial until it has completed d times as many steps as after the previous rung. ASHA iterates this process until some stopping criterion is reached, such as completing a specified number of rungs or having only one trial remaining. The number of steps, rungs, and trials within rungs are fixed within each ASHA searcher, but vary across different calls to ASHA by the adaptive algorithm. Note that although the name "ASHA" includes the phrase "halving", the fraction of trials pruned after every rung is controlled by `divisor`

.

#### Adaptive over ASHA¶

The adaptive algorithm calls ASHA subroutines with varying parameters. The exact calls are configured through the choice of `mode`

, which specifies how aggressively to perform early stopping. One way to think about this behavior is as a spectrum that ranges from "one ASHA run" (aggressive early stopping; eliminate most trials every rung) to "`searcher: random`

" (no early stopping; all initialized trials are allowed to run to completion).

On one end, `aggressive`

applies early stopping in a very eager manner; this mode essentially corresponds to only making a single call to ASHA. With the default `divisor`

of 4, 75% of the remaining trials will be eliminated in each rung after only being trained for 25% as many training steps as will be performed in the next rung. This implies that relatively few of the trials will be allowed to finish even a small fraction of the training steps needed for a full training run (`target_trial_steps`

). This aggressive early stopping behavior allows the searcher to start more trials for a wider exploration of hyperparameter configurations, at the risk of discarding a configuration too soon.

On the other end, `conservative`

mode is more similar to `searcher: random`

in that it performs significantly less pruning. Extra ASHA subroutines are spawned with fewer rungs and larger training steps to account for the high percentage of trials eliminated after only a few steps. However, a `conservative`

adaptive search will only explore a small fraction of the configurations explored by an `aggressive`

search, given the same step budget.

Once the number and types of calls to ASHA are determined (via `mode`

), the adaptive algorithm will allocate budgets of steps to the ASHA subroutines, from the overall step_budget for the adaptive algorithm (user-specified through `step_budget`

). This determines the number of trials at each rung (N in the above ASHA example).

#### Configuration¶

Users specify configurations for the `adaptive`

searcher through the experiment configuration file. They fall into two categories described below.

**Parameters for ASHA:**

`target_trial_steps`

: The maximum number of steps that any one trial will be trained.- (optional, for advanced users only)
`divisor`

: The multiplier for eliminating trials and increasing steps trained at each rung. The default is 4. - (optional, for advanced users only)
`max_rungs`

: The maximum number of rungs. The default is 5.

**Parameters for adaptive mode:**

`mode`

: Options are`aggressive`

,`standard`

, or`conservative`

. Specifies how aggressively to perform early stopping. We suggest using either`aggressive`

or`standard`

mode.`step_budget`

: A budget for total steps taken across all trials and ASHA calls. The budget is split evenly between ASHA calls. The recommendation above was to set`step_budget = 10 * target_trial_steps`

.

### Examples¶

The table below illustrates the difference between `aggressive`

, `standard`

, and `conservative`

for an otherwise fixed configuration. While `aggressive`

tries out 64 hyperparameter configurations, `conservative`

tries only 31 hyperparameter configurations but has the budget to run more of them to the full 16 steps. More ASHA instances are generated by `conservative`

, which are responsible for creating the trials run for the full 16 steps.

The settings are `divisor: 4`

, `max_rungs: 3`

, `target_trial_steps: 16`

, and `step_budget: 160`

.

Total steps trained | Number of trials | |||||
---|---|---|---|---|---|---|

64 | 43 | 31 | ||||

`aggressive` |
`standard` |
`conservative` |
||||

ASHA0 | ASHA0 | ASHA1 | ASHA0 | ASHA1 | ASHA2 | |

1 | 48 | 23 | 14 | |||

4 | 11 | 7 | 7 | 5 | 5 | |

16 | 5 | 2 | 4 | 2 | 2 | 3 |

For an experiment generated by a specific `.yaml`

experiment configuration file, this information (ASHA instances and number of trials vs. number of steps) can be found with the command

pedl preview-search <file_name.yaml>

### FAQ¶

**Q: How do I control how many batches a trial is trained for before it is potentially discarded?**

Two factors affect the number of batches a trial is guaranteed to be trained on. The field `batches_per_step`

affects how many batches is considered one step. The number of steps guaranteed is affected by `target_trial_steps`

, and is at least `target_trial_steps / 256`

by default, or `target_trial_steps / divisor`

in general.^{max_rungs-1}

**Q: How do I set the initial number of trials? How do I make sure x trials are run the full target_trial_steps steps?**

The number of initial trials is determined by a combination of `mode`

, `step_budget`

, `divisor`

, `max_rungs`

, and `target_trial_steps`

. Here is a rule of thumb for the default configuration of `max_rungs: 5`

and `divisor: 4`

, with `mode: aggressive`

and a large enough `step_budget`

:

- The initial number of trials is
`step_budget / (4 * target_trial_steps)`

. - To ensure that
`x`

trials are run`target_trial_steps`

, set`step_budget`

to be`4 * x * target_trial_steps`

.

A configuration setting that meets set goals can also be found by trial and error. The command

pedl preview-search <file_name.yaml>

`file_name.yaml`

. Increasing `step_budget`

increases both the initial number of trials and the number of trials run the full number of steps. On the other hand, `target_trial_steps`

decreases both. The `mode`

decides on allocation of steps between trials; `mode: conservative`

runs more trials for longer, whereas `mode: aggressive`

eliminates the most trials early in training.
**Q: The adaptive algorithm sounds great so far. What are its weaknesses?**

One downside of adaptive is that it results in doing more validations, which might be expensive.

## Population-based training¶

Population-based training (PBT) is loosely based on genetic algorithms; see the original paper or blog post for full details. The motivation is that it makes sense to explore hyperparameter configurations that are known to perform well, since the performance of a model as a function of the hyperparameters is likely to show some continuity. The algorithm works by repeatedly replacing low-performing hyperparameter configurations with modified versions of high-performing ones.

### Quick start¶

A typical set of configuration values for PBT:

`population_size`

: 40`num_rounds`

,`steps_per_round`

: The product of these values is the total number of steps that a trial that survives to the end of the experiment will be trained for; it should be chosen similarly to the value of`target_trial_steps`

for adaptive search. For a given value of the product, decreasing`steps_per_round`

creates more opportunity for evaluation and selection of good configurations at the cost of higher variance and computational overhead.`replace_function`

:`truncate_fraction`

: 0.2

`explore_function`

:`resample_probability`

: 0.2`perturb_factor`

: 0.2

### Details¶

At any time, the searcher maintains a fixed number of active trials (the *population*). Initially, each trial uses a randomly chosen hyperparameter configuration, just as with the `random`

searcher. The difference is that, periodically, every trial stops training and evaluates the validation metric for the trial's current state; some of the worst-performing trials are closed, while an equal number of the best-performing trials are *cloned* to replace them. Cloning a trial involves checkpointing it and creating a new trial that continues training from that checkpoint. The hyperparameters of the new trial are not generally equal to those of the original trial, but are derived from them in a particular way; see the description of available parameters for details.

There is an important constraint on the hyperparameters that are allowed to vary when PBT is in use: it must always be possible to load a checkpoint from a model that was created with any potential hyperparameter configuration into a model using any other configuration; otherwise, the cloning process could fail. This means that, for instance, the number of hidden units in a neural network layer cannot be such a hyperparameter. If it were, the models for different configurations could have weight matrices of different dimensions, so their checkpoints would not be compatible.

### Parameters¶

One *round* consists of a period of training followed by a validate/close/clone phase. During each round, each running trial does a fixed amount of training, determined by the experiment configuration.

`population_size`

: The number of trials that should run at the same time.`num_rounds`

: The total number of rounds to run.`steps_per_round`

: The number of training steps for each trial to run during each round.

The parameters for the cloning process are also configurable using two nested objects, called `replace_function`

and `explore_function`

, within the searcher fields of the experiment configuration file.

`replace_function`

: The configuration for deciding what trials to close.`truncate_fraction`

: The fraction of the population that is closed and replaced by clones at the end of each round.

`explore_function`

: The configuration for modifying hyperparameter configurations when cloning. Each hyperparameter is either*resampled*, meaning that it is replaced by a value drawn independently from the original configuration, or*perturbed*, meaning that it is multiplied by a configurable factor.`resample_probability`

: The probability that a hyperparameter is replaced with a new value sampled from the original distribution specified in the configuration.`perturb_factor`

: The amount by which hyperparameters that are not resampled are perturbed: each numerical hyperparameter is multiplied by either`1 + perturb_factor`

or`1 - perturb_factor`

with equal probability;`categorical`

and`const`

hyperparameters are left unchanged.