Shortcuts

Training Units

When training a single model or performing a hyperparameter search, we need to specify how much data a model should be trained on before certain actions are taken (e.g., before training is terminated or before a checkpoint of the state of the model is taken). Determined supports a flexible system of training units to specify this. Training units can be specified as records, batches, or epochs:

  • records: A record is a single labeled example (sometimes called a sample).

  • batches: A batch is a group of records. The number of records in a batch is configured via the global_batch_size hyperparameter.

  • epochs: An epoch is a single copy of the entire training data set. The number of records in an epoch is configured via the records_per_epoch experiment configuration setting.

Training units must always be positive integers.

Several experiment configuration parameters can be specified in terms of training units, including:

For example, an experiment that trains a single trial on 10,000 labeled examples can be configured as follows:

searcher:
  name: single
  metric: validation_error
  max_length:
    records: 10000
  smaller_is_better: true

More examples and details on each of the types of training units can be seen below.

This feature is designed to allow users to configure their experiments using whatever unit is most familiar for the task at hand. In most cases, a value expressed using one type of training unit can be converted to a different type of training unit with identical behavior, with a few caveats:

  • Because training units must be positive integers, converting between quantities of different types is not always possible. For example, converting 50 records into batches is not possible if the batch size is 64.

  • When doing a hyperparameter search over a range of values for global_batch_size, values specified in batches will differ between trials of the search, and hence cannot be converted to a fixed number of records or epochs.

  • When using adaptive searchers, such as adaptive_asha, a single training unit is treated as atomic (unable to be divided into fractional parts) when dividing max_length into the series of rounds (or rungs) by which we early-stop underperforming trials. This rounding may result in unexpected behavior when configuring max_length in terms of a small number of large epochs or batches.

To verify your search is working as intended before committing to a full run, you can use the CLI’s “preview search” feature:

det preview-search <configuration.yaml>

Epochs

When using epochs, records_per_epoch must also be specified. The epoch size configured here is only used for interpreting configuration fields that are expressed in epochs. Actual epoch boundaries are still determined by the dataset itself (specifically, the end of an epoch occurs when the training data loader runs out of records).

The snippet below configures an experiment that trains a single trial on 5 epochs of data, performs validation at least every 2 epochs, and checkpoints after every epoch.

records_per_epoch: 10000
searcher:
  name: single
  metric: validation_error
  max_length:
    epochs: 5
  smaller_is_better: true
min_validation_period:
  epochs: 2
min_checkpoint_period:
  epochs: 1

Records

The snippet below configures an experiment that trains a single trial on 50,000 records of data, performs validation at least once every 20,000 records, and checkpoints at least once after every 10,000 records.

searcher:
  name: single
  metric: validation_error
  max_length:
    records: 50000
  smaller_is_better: true
min_validation_period:
  records: 20000
min_checkpoint_period:
  records: 10000

Batches

The number of records in a batch is configured via the global_batch_size hyperparameter. When doing a hyperparameter search that explores multiple batch sizes, this can result in different trials using different values for fields expressed using batches. This may result in unexpected behavior – for example, specifying max_length for a grid search in batches would result in different grid points being trained on different amounts of data, which is typically undesirable.

The snippet below configures an experiment that trains a single trial on 500 batches of data, performs validation at least once every 200 batches, and checkpoints at least once after every 100 batches.

hyperparameters:
  global_batch_size: 100
searcher:
  name: single
  metric: validation_error
  max_length:
    batches: 500
  smaller_is_better: true
min_validation_period:
  batches: 200
min_checkpoint_period:
  batches: 100

Note

If the amount of data to train a model on is specified using records or epochs and the batch size does not divide evenly into the configured number of inputs, the remaining “partial batch” of data will be dropped (ignored). For example, if an experiment is configured to train a single model on 10 records with a configured batch size of 3, the model will only be trained on 9 records of data. In the corner case that a trial is configured to be trained for less than a single batch of data, a single complete batch will be used instead.