Population-based Training Method¶
Population-based training (PBT) is loosely based on genetic algorithms; see the original paper or blog post for details. The motivation is that it makes sense to explore hyperparameter configurations that are known to perform well, because the performance of a model as a function of the hyperparameters is likely to show some continuity. The algorithm works by repeatedly replacing low-performing hyperparameter configurations with modified versions of high-performing ones.
A typical set of configuration values for PBT:
length_per_round: The product of these values is the total training length for a trial that survives to the end of the experiment; it should be chosen similarly to the value of
max_lengthfor Adaptive (Asynchronous) Method. For a given value of the product, decreasing
length_per_roundcreates more opportunity for evaluation and selection of good configurations at the cost of higher variance and computational overhead.
At any time, the searcher maintains a fixed number of active trials (the population). Initially,
each trial uses a randomly chosen hyperparameter configuration, just as with the
searcher. The difference is that, periodically, every trial stops training and evaluates the
validation metric for the trial’s current state; some of the worst-performing trials are closed,
while an equal number of the best-performing trials are cloned to replace them. Cloning a trial
involves checkpointing it and creating a new trial that continues training from that checkpoint. The
hyperparameters of the new trial are not generally equal to those of the original trial, but are
derived from them in a particular way; see the description of available parameters for details.
There is an important constraint on the hyperparameters that are allowed to vary when PBT is in use: it must always be possible to load a checkpoint from a model that was created with any potential hyperparameter configuration into a model using any other configuration; otherwise, the cloning process could fail. This means that, for instance, the number of hidden units in a neural network layer cannot be such a hyperparameter. If it were, the models for different configurations could have weight matrices of different dimensions, so their checkpoints would not be compatible.
One round consists of a period of training followed by a validate/close/clone phase. During each round, each running trial does a fixed amount of training, determined by the experiment configuration.
population_size: The number of trials that should run at the same time.
num_rounds: The total number of rounds to run.
length_per_round: The training units to train each trial for during a
round, in terms of records, batches or epochs (see Training Units).
The parameters for the cloning process are also configurable using two nested objects, called
explore_function, within the searcher fields of the experiment
replace_function: The configuration for deciding which trials to close.
truncate_fraction: The fraction of the population that is closed and replaced by clones at the end of each round.
explore_function: The configuration for modifying hyperparameter configurations when cloning. Each hyperparameter is either resampled, meaning that it is replaced by a value drawn independently from the original configuration, or perturbed, meaning that it is multiplied by a configurable factor.
resample_probability: The probability that a hyperparameter is replaced with a new value sampled from the original distribution specified in the configuration.
perturb_factor: The amount by which hyperparameters that are not resampled are perturbed: each numerical hyperparameter is multiplied by either
1 + perturb_factoror
1 - perturb_factorwith equal probability;
consthyperparameters are left unchanged.