Hyperparameter Search: Adaptive (Advanced)
Warning
Adaptive (Advanced) is deprecated and will be removed in a future release. We recommend using the state-of-the-art Adaptive (ASHA) searcher.
The adaptive
search method employs the same underlying algorithm as
the Adaptive (Simple) method, but it allows
users to control the behavior of the search in a more fine-grained way
at the cost of being more difficult to configure. This section explains
the configuration settings that influence the behavior of the
adaptive
searcher and gives recommendations for how to configure
those settings.
Quick start
Here are some suggested initial settings for adaptive
that typically
work well.
Search mode:
mode
: Set tostandard
.
Resource budget:
max_length
: The maximum training length (see Training Units) of any trial that survives to the end of the experiment. This quantity is domain-specific and should roughly reflect the number of minibatches the model must be trained on for it to converge on the data set. For users who would like to determine this number experimentally, train a model with reasonable hyperparameters using thesingle
search method.budget
: Setbudget
to roughly 10 timesmax_length
. A higherbudget
will result in hyperparameter search that consumes more resources and takes longer to complete, but may produce better-performing models.
Details
Conceptually, the adaptive
searcher is a carefully tuned strategy
for spawning multiple SHA (successive halving algorithm) searchers,
themselves hyperparameter search algorithms. SHA can be configured to
make different tradeoffs between exploration and exploitation, i.e., how
many trials are explored versus how long a single trial is trained for.
Because the right tradeoff between exploration and exploitation is hard
to know in advance, the adaptive
algorithm tries several SHA
searches with different tradeoffs.
The configuration settings available to Determined experiments running
in adaptive
mode mostly affect the SHA subroutines directly. The
mode
configuration is the only one affecting the decisions of the
adaptive
searcher, by changing the number and types of SHA
subroutines spawned.
The first section here gives a description of SHA. The second section
describes the configuration parameters that influence how this search
method behaves. The third section gives a summary of the adaptive
configuration settings.
SHA
At a high level, SHA prunes (“halves”) a set of trials in successive rounds we call rungs. SHA starts with an initial set of trials. (A trial means one model, with a fixed set of hyperparameter values.) SHA trains all the trials for some length and the trials with the worst validation performance are discarded. In the next rung, the remaining trials are trained for a longer period of time, and then trials with the worst validation performance are pruned once again. This is repeated until the maximum training length is reached.
First, an example of SHA.
Rung 1: SHA creates N initial trials; the hyperparameter values for each trial are randomly sampled from the hyperparameters defined in the experiment configuration file. Each trial is trained for 1 epoch, and then validation metrics are computed.
Rung 2: SHA picks the N/4 top-performing trials according to validation metrics. These are trained for 4 epochs.
Rung 3: SHA picks the N/16 top-performing trials according to validation metrics. These are trained for 16 epochs.
At the end, the trial with best performance has the hyperparameter setting the SHA searcher returns.
In the example above, divisor
is 4, which determines what fraction
of trials are kept in successive rungs, as well as the training length
in successive rungs. max_length
is 16 epochs, which is the maximum
length a trial is trained for.
The remaining degree of freedom in this SHA example is the number N of
trials initialized. This is determined by the top-level adaptive
algorithm, through budget
and the number/types of SHA subroutines
called.
In general, SHA has a fixed divisor
d. In the first rung, it
generates an initial set of randomly chosen trials and runs until each
trial has trained for the same length. In the next rung, it keeps 1/d of
those trials and closes the rest. Then it runs each remaining trial
until it has trained for d times as long as the previous rung. SHA
iterates this process until some stopping criterion is reached, such as
completing a specified number of rungs or having only one trial
remaining. The total training length, rungs, and trials within rungs are
fixed within each SHA searcher, but vary across different calls to SHA
by the adaptive algorithm. Note that although the name “SHA” includes
the phrase “halving”, the fraction of trials pruned after every rung is
controlled by divisor
.
Adaptive over SHA
The adaptive algorithm calls SHA subroutines with varying parameters.
The exact calls are configured through the choice of mode
, which
specifies how aggressively to perform early stopping. One way to think
about this behavior is as a spectrum that ranges from “one SHA run”
(aggressive early stopping; eliminate most trials every rung) to
“searcher: random
” (no early stopping; all initialized trials are
allowed to run to completion).
On one end, aggressive
applies early stopping in a very eager
manner; this mode essentially corresponds to only making a single call
to SHA. With the default divisor
of 4, 75% of the remaining trials
will be eliminated in each rung after only being trained for 25% the
length of the next rung. This implies that relatively few trials will be
allowed to finish even a small fraction of the length needed train to
convergence (max_length
). This aggressive early stopping behavior
allows the searcher to start more trials for a wider exploration of
hyperparameter configurations, at the risk of discarding a configuration
too soon.
On the other end, conservative
mode is more similar to a random
search, in that it performs significantly less pruning. Extra SHA
subroutines are spawned with fewer rungs and longer training lengths to
account for the high percentage of trials eliminated after only a short
time. However, a conservative
adaptive search will only explore a
small fraction of the configurations explored by an aggressive
search, given the same budget.
Once the number and types of calls to SHA are determined (via mode
),
the adaptive algorithm will allocate training length budgets to the SHA
subroutines, from the overall budget for the adaptive algorithm
(user-specified through budget
). This determines the number of
trials at each rung (N in the above SHA example).
Configuration
Users specify configurations for the adaptive
searcher through the
Experiment Configuration. They fall into two categories described
below.
Parameters for SHA:
max_length
: The maximum training length (see Training Units) for any one trial.(optional, for advanced users only)
divisor
: The multiplier for eliminating trials and increasing time trained at each rung. The default is 4.(optional, for advanced users only)
max_rungs
: The maximum number of rungs. The default is 5.
Parameters for adaptive mode:
mode
: Options areaggressive
,standard
, orconservative
. Specifies how aggressively to perform early stopping. We suggest using eitheraggressive
orstandard
mode.budget
: A budget for the total training length across all trials and SHA calls. The budget is split evenly between SHA calls. The recommendation above was to setbudget = 10 * max_length
.
Examples
The table below illustrates the difference between aggressive
,
standard
, and conservative
for an otherwise fixed configuration.
While aggressive
tries out 64 hyperparameter configurations,
conservative
tries only 31 hyperparameter configurations but has the
budget to run more of them to the full 16 epochs. More SHA instances are
generated by conservative
, which are responsible for creating the
trials run for the full 16 epochs.
The settings are divisor: 4
, max_rungs: 3
, max_length:
{epochs: 16}
, and budget: {epochs: 160}
.
Total epochs trained |
Number of trials |
|||||
---|---|---|---|---|---|---|
64 |
43 |
31 |
||||
|
|
|
||||
SHA0 |
SHA0 |
SHA1 |
SHA0 |
SHA1 |
SHA2 |
|
1 |
48 |
23 |
14 |
|||
4 |
11 |
7 |
7 |
5 |
5 |
|
16 |
5 |
2 |
4 |
2 |
2 |
3 |
For an experiment generated by a specific .yaml
experiment
configuration file, this information (SHA instances and number of trials
vs. training length) can be found with the command
FAQ
Q: How do I control how long a trial is trained for before it is potentially discarded?
The training length is guaranteed to be at least max_length / 256
by
default, or max_length / divisor ^ max_rungs-1
in general. It is
recommended to configure this in records or epochs if the
global_batch_size
hyperparameter is not constant, to ensure each
trial trains on the same amount of data.
Q: How do I set the initial number of trials? How do I make sure ``x`` trials are run the full training length (``max_length``)?
The number of initial trials is determined by a combination of mode
,
budget
, divisor
, max_rungs
, and max_length
. Here is a
rule of thumb for the default configuration of max_rungs: 5
and
divisor: 4
, with mode: aggressive
and a large enough budget
:
The initial number of trials is
budget / (4 * max_length)
.To ensure that
x
trials are runmax_length
, setbudget
to be4 * x * max_length
.
A configuration setting that meets set goals can also be found by trial and error. The command
will display information on the number of trials versus training length
for the configuration specified in file_name.yaml
. Increasing
budget
increases both the initial number of trials and the number of
trials that train the full length. On the other hand, max_length
decreases both. The mode
decides on allocation of training length
between trials; mode: conservative
runs more trials for longer,
whereas mode: aggressive
eliminates the most trials early in
training.
Q: The adaptive algorithm sounds great so far. What are its weaknesses?
One downside of adaptive is that it results in doing more validations, which might be expensive.