Training: Reproducibility

Determined aims to support reproducible machine learning experiments: that is, the result of running a Determined experiment should be deterministic, so that rerunning a previous experiment should produce an identical model. For example, this ensures that if the model produced from an experiment is ever lost, it can be recovered by rerunning the experiment that produced it.

Status

The current version of Determined provides limited support for reproducibility; unfortunately, the current state of the hardware and software stack typically used for deep learning makes perfect reproducibility very challenging.

Determined can control and reproduce the following sources of randomness:

  1. Hyperparameter sampling decisions.

  2. The initial weights for a given hyperparameter configuration.

  3. Shuffling of training data in a trial.

  4. Dropout or other random layers.

Determined currently does not offer support for:

  1. Controlling non-determinism in floating-point operations. Modern deep learning frameworks typically implement training using floating point operations that result in non-deterministic results, particularly on GPUs. If only CPUs are used for training, reproducible results can be achieved—see below.

Random Seeds

Each Determined experiment is associated with an experiment seed: an integer ranging from 0 to 231–1. The experiment seed can be set using the reproducibility.experiment_seed field of the experiment configuration. If an experiment seed is not explicitly specified, the master will assign one automatically.

The experiment seed is used as a source of randomness for any hyperparameter sampling procedures. The experiment seed is also used to generate a trial seed for every trial associated with the experiment.

In the Trial interface, the trial seed is accessible within the trial class using self.ctx.get_trial_seed().

Coding Guidelines

To achieve reproducible initial conditions in an experiment, please follow these guidelines:

  • Use the np.random or random APIs for random procedures, such as shuffling of data. Both PRNGs will be initialized with the trial seed by Determined automatically.

  • Use the trial seed to seed any randomized operations (e.g., initializers, dropout) in your framework of choice. For example, Keras initializers accept an optional seed parameter. Again, it is not necessary to set any graph-level PRNGs (e.g., TensorFlow’s tf.set_random_seed), as Determined manages this for you.

Deterministic Floating Point on CPUs

When doing CPU-only training with TensorFlow, it is possible to achieve floating-point reproducibility throughout optimization. If using the TFKerasTrial API, implement the optional session_config() method to override the default session configuration:

def session_config(self) -> tf.ConfigProto:
    return tf.ConfigProto(
        intra_op_parallelism_threads=1, inter_op_parallelism_threads=1
    )

Warning

Disabling thread parallelism may negatively affect performance. Only enable this feature if you understand and accept this trade-off.

Pausing Experiments

TensorFlow does not fully support the extraction or restoration of a single, global RNG state. Consequently, pausing experiments that use a TensorFlow-based framework may introduce an additional source of entropy.