When training a single model or performing a hyperparameter search, we need to specify how much data a model should be trained on before certain actions are taken (e.g., before training is terminated or before a checkpoint of the state of the model is taken). Determined supports a flexible system of training units to specify this. Training units can be specified as records, batches, or epochs:
records: A record is a single labeled example (sometimes called a sample).
batches: A batch is a group of records. The number of records in a batch is configured via the
epochs: An epoch is a single copy of the entire training data set. The number of records in an epoch is configured via the
records_per_epochexperiment configuration setting.
Training units must always be positive integers.
Several experiment configuration parameters can be specified in terms of training units, including:
searcher.max_length in most searchers
searcher.length_per_round when using the
For example, an experiment that trains a single trial on 10,000 labeled examples can be configured as follows:
searcher: name: single metric: validation_error max_length: records: 10000 smaller_is_better: true
More examples and details on each of the types of training units can be seen below.
This feature is designed to allow users to configure their experiments using whatever unit is most familiar for the task at hand. In most cases, a value expressed using one type of training unit can be converted to a different type of training unit with identical behavior, with a few caveats:
Because training units must be positive integers, converting between quantities of different types is not always possible. For example, converting 50
recordsinto batches is not possible if the batch size is 64.
When doing a hyperparameter search over a range of values for
global_batch_size, values specified in
batcheswill differ between trials of the search, and hence cannot be converted to a fixed number of records or epochs.
When using adaptive_asha, a single training unit is treated as atomic (unable to be divided into fractional parts) when dividing
max_lengthinto the series of rounds (or rungs) by which we early-stop underperforming trials. This rounding may result in unexpected behavior when configuring
max_lengthin terms of a small number of large epochs or batches.
To verify your search is working as intended before committing to a full run, you can use the CLI’s “preview search” feature:
det preview-search <configuration.yaml>
When using epochs, records_per_epoch must also be specified. The epoch size configured here is only used for interpreting configuration fields that are expressed in epochs. Actual epoch boundaries are still determined by the dataset itself (specifically, the end of an epoch occurs when the training data loader runs out of records).
The snippet below configures an experiment that trains a single trial on 5 epochs of data, performs validation at least every 2 epochs, and checkpoints after every epoch.
records_per_epoch: 10000 searcher: name: single metric: validation_error max_length: epochs: 5 smaller_is_better: true min_validation_period: epochs: 2 min_checkpoint_period: epochs: 1
The snippet below configures an experiment that trains a single trial on 50,000 records of data, performs validation at least once every 20,000 records, and checkpoints at least once after every 10,000 records.
searcher: name: single metric: validation_error max_length: records: 50000 smaller_is_better: true min_validation_period: records: 20000 min_checkpoint_period: records: 10000
The number of records in a batch is configured via the
global_batch_size hyperparameter. When doing a hyperparameter search
that explores multiple batch sizes, this can result in different trials
using different values for fields expressed using batches. This may
result in unexpected behavior – for example, specifying
for a grid search in batches would result in different grid points being
trained on different amounts of data, which is typically undesirable.
The snippet below configures an experiment that trains a single trial on 500 batches of data, performs validation at least once every 200 batches, and checkpoints at least once after every 100 batches.
hyperparameters: global_batch_size: 100 searcher: name: single metric: validation_error max_length: batches: 500 smaller_is_better: true min_validation_period: batches: 200 min_checkpoint_period: batches: 100
If the amount of data to train a model on is specified using records or epochs and the batch size does not divide evenly into the configured number of inputs, the remaining “partial batch” of data will be dropped (ignored). For example, if an experiment is configured to train a single model on 10 records with a configured batch size of 3, the model will only be trained on 9 records of data. In the corner case that a trial is configured to be trained for less than a single batch of data, a single complete batch will be used instead.