Hyperparameter Search: Adaptive (Advanced)¶
The adaptive
search method employs the same underlying algorithm as the
Adaptive (Simple) method,
but it allows users to control the behavior of the search in a more fine-grained
way at the cost of being more difficult to configure. This section explains the
configuration settings that influence the behavior of the adaptive
searcher
and gives recommendations for how to configure those settings.
Quick start¶
Here are some suggested initial settings for adaptive
that typically
work well.
Search mode:
mode
: Set tostandard
.
Resource budget:
target_trial_steps
: The maximum number of steps that any trial that survives to the end of the experiment will be trained for (a step is a fixed number of batches). This quantity is domain-specific and should roughly reflect the number of training steps needed for the model to converge on the data set. For users who would like to determine this number experimentally, train a model with reasonable hyperparameters using thesingle
search method.step_budget
: Setstep_budget
to roughly 10 timestarget_trial_steps
. A higherstep_budget
will result in hyperparameter search that consumes more resources and takes longer to complete, but may produce better-performing models.
Details¶
Conceptually, the adaptive
searcher is a carefully tuned strategy
for spawning multiple ASHA (asynchronous successive halving algorithm)
searchers, themselves hyperparameter search algorithms. ASHA can be
configured to make different tradeoffs between exploration and
exploitation, i.e., how many trials are explored versus how long a
single trial is trained for. Because the right tradeoff between
exploration and exploitation is hard to know in advance, the
adaptive
algorithm tries several ASHA searches with different
tradeoffs.
The configuration settings available to Determined experiments running
in adaptive
mode mostly affect the ASHA subroutines directly. The
mode
configuration is the only one affecting the decisions of the
adaptive
searcher, by changing the number and types of ASHA
subroutines spawned.
The first section here gives a description of ASHA. The second section
describes the configuration parameters that influence how this search
method behaves. The third section gives a summary of the adaptive
configuration settings.
ASHA¶
At a high level, ASHA prunes (“halves”) a set of trials in successive rounds we call rungs. ASHA starts with an initial set of trials. (A trial means a single model with a fixed set of hyperparameter values.) ASHA trains all the trials for some number of steps and the trials with the worst validation performance are discarded. In the next rung, the remaining trials are trained for a longer period of time, and then trials with the worst validation performance are pruned once again. This is repeated until the maximum number of training steps is reached.
First, an example of ASHA.
Rung 1: ASHA creates N initial trials; the hyperparameter values for each trial are randomly sampled from the hyperparameters defined in the experiment configuration file. Each trial is trained for 3 steps, and then validation metrics are computed.
Rung 2: ASHA picks the N/4 top-performing trials according to validation metrics. These are trained for 12 steps.
Rung 3: ASHA picks the N/16 top-performing trials according to validation metrics. These are trained for 48 steps.
At the end, the trial with best performance has the hyperparameter setting the ASHA searcher returns.
In the example above, divisor
is 4, which controls the fraction
of trials that are kept in successive rungs, as well as the number of steps
in successive rungs. target_trial_steps
is 48, which is the maximum
number of steps a trial is trained for.
The remaining degree of freedom in this ASHA example is the number N of
initial trials. This is determined by the top-level adaptive algorithm,
through step_budget
and the number/types of ASHA subroutines called.
In general, ASHA has a fixed divisor
d. In the first rung, it
generates an initial set of randomly chosen trials and runs until each
trial has completed the same number of steps. In the next rung, it keeps
1/d of those trials and closes the rest. Then it runs each remaining
trial until it has completed d times as many steps as after the previous
rung. ASHA iterates this process until some stopping criterion is
reached, such as completing a specified number of rungs or having only
one trial remaining. The number of steps, rungs, and trials within rungs
are fixed within each ASHA searcher, but vary across different calls to
ASHA by the adaptive algorithm. Note that although the name “ASHA”
includes the phrase “halving”, the fraction of trials pruned after every
rung is controlled by divisor
.
Adaptive over ASHA¶
The adaptive algorithm calls ASHA subroutines with varying parameters.
The exact calls are configured through the choice of mode
, which
specifies how aggressively to perform early stopping. One way to think
about this behavior is as a spectrum that ranges from “one ASHA run”
(aggressive early stopping; eliminate most trials every rung) to
“searcher: random
” (no early stopping; all initialized trials are
allowed to run to completion).
On one end, aggressive
applies early stopping in a very eager
manner; this mode essentially corresponds to only making a single call
to ASHA. With the default divisor
of 4, 75% of the remaining trials
will be eliminated in each rung after only being trained for 25% as many
training steps as will be performed in the next rung. This implies that
relatively few of the trials will be allowed to finish even a small
fraction of the training steps needed for a full training run
(target_trial_steps
). This aggressive early stopping behavior allows
the searcher to start more trials for a wider exploration of
hyperparameter configurations, at the risk of discarding a configuration
too soon.
On the other end, conservative
mode is more similar to a random
search, in that it performs significantly less pruning. Extra ASHA
subroutines are spawned with fewer rungs and larger training steps to
account for the high percentage of trials eliminated after only a few
steps. However, a conservative
adaptive search will only explore a
small fraction of the configurations explored by an aggressive
search, given the same step budget.
Once the number and types of calls to ASHA are determined (via
mode
), the adaptive algorithm will allocate budgets of steps to the
ASHA subroutines, from the overall step_budget for the adaptive
algorithm (user-specified through step_budget
). This determines the
number of trials at each rung (N in the above ASHA example).
Configuration¶
Users specify configurations for the adaptive
searcher through the
Experiment Configuration. They fall into two categories described below.
Parameters for ASHA:
target_trial_steps
: The maximum number of steps that any one trial will be trained.(optional, for advanced users only)
divisor
: The multiplier for eliminating trials and increasing steps trained at each rung. The default is 4.(optional, for advanced users only)
max_rungs
: The maximum number of rungs. The default is 5.
Parameters for adaptive mode:
mode
: Options areaggressive
,standard
, orconservative
. Specifies how aggressively to perform early stopping. We suggest using eitheraggressive
orstandard
mode.step_budget
: A budget for total steps taken across all trials and ASHA calls. The budget is split evenly between ASHA calls. The recommendation above was to setstep_budget = 10 * target_trial_steps
.
Examples¶
The table below illustrates the difference between aggressive
,
standard
, and conservative
for an otherwise fixed configuration.
While aggressive
tries out 64 hyperparameter configurations,
conservative
tries only 31 hyperparameter configurations but has the
budget to run more of them to the full 16 steps. More ASHA instances are
generated by conservative
, which are responsible for creating the
trials run for the full 16 steps.
The settings are divisor: 4
, max_rungs: 3
,
target_trial_steps: 16
, and step_budget: 160
.
Total steps trained |
Number of trials |
|||||
---|---|---|---|---|---|---|
64 |
43 |
31 |
||||
|
|
|
||||
ASHA0 |
ASHA0 |
ASHA1 |
ASHA0 |
ASHA1 |
ASHA2 |
|
1 |
48 |
23 |
14 |
|||
4 |
11 |
7 |
7 |
5 |
5 |
|
16 |
5 |
2 |
4 |
2 |
2 |
3 |
For an experiment generated by a specific .yaml
experiment
configuration file, this information (ASHA instances and number of
trials vs. number of steps) can be found with the command
det preview-search <file_name.yaml>
FAQ¶
Q: How do I control how many batches a trial is trained for before it is potentially discarded?
Two factors affect the number of batches a trial is guaranteed to be
trained on. The field batches_per_step affects how many
batches make up one step. The number of steps guaranteed is affected by
target_trial_steps
, and is at least target_trial_steps / 256
by
default, or target_trial_steps / divisor ^ (max_rungs-1)
in general.
Q: How do I set the initial number of trials? How do I make sure that a certain number of trials are trained to completion?
The number of initial trials is determined by a combination of mode
,
step_budget
, divisor
, max_rungs
, and target_trial_steps
.
Here is a rule of thumb for the default configuration of
max_rungs: 5
and divisor: 4
, with mode: aggressive
and a
large enough step_budget
:
The initial number of trials is
step_budget / (4 * target_trial_steps)
.To ensure that
x
trials are runtarget_trial_steps
, setstep_budget
to be4 * x * target_trial_steps
.
A configuration setting that meets set goals can also be found by trial and error. The command
det preview-search <file_name.yaml>
will display information on the number of trials versus number of steps for the
configuration specified in file_name.yaml
. Increasing step_budget
increases both the initial number of trials and the number of trials that are
trained for the full number of steps. On the other hand, increasing
target_trial_steps
decreases both. The mode
decides on allocation of
steps between trials; mode: conservative
runs more trials for longer,
whereas mode: aggressive
eliminates the most trials early in training.
Q: The adaptive algorithm sounds great so far. What are its weaknesses?
One downside of adaptive is that it results in doing more validations, which might be expensive.