Shortcuts

Python API determined.keras

determined.keras.TFKerasTrial

class determined.keras.TFKerasTrial(context: determined.keras._tf_keras_context.TFKerasTrialContext)

To implement a new tf.keras trial, subclass this class and implement the abstract methods described below (build_model(), build_training_data_loader(), and build_validation_data_loader()). In most cases you should provide a custom __init__() method as well.

By default, experiments use TensorFlow 1.x. To configure your trial to use TensorFlow 2.x, specify a TensorFlow 2.x image in the environment.image field of the experiment configuration (e.g., determinedai/environments:cuda-11.1-pytorch-1.9-lightning-1.3-tf-2.4-gpu-0.16.4).

Trials default to using eager execution with TensorFlow 2.x but not with TensorFlow 1.x. To override the default behavior, call the appropriate function at the top of your code. For example, if you want to disable eager execution while using TensorFlow 2.x, call tf.compat.v1.disable_eager_execution after your import statements. If you are using TensorFlow 1.x in eager mode, please add experimental_run_tf_function=False to your model compile function.

For more information on writing tf.keras trial classes, refer to the tutorial.

__init__(context: determined.keras._tf_keras_context.TFKerasTrialContext) → None

Initializes a trial using the provided context.

This method should typically be overridden by trial definitions: at minimum, it is important to store context as an instance variable so that it can be accessed by other methods of the trial class. This can also be a convenient place to initialize other state that is shared between methods.

abstract build_model() → tensorflow.python.keras.engine.training.Model

Returns the deep learning architecture associated with a trial. The architecture might depend on the current values of the model’s hyperparameters, which can be accessed via context.get_hparam(). This function returns a tf.keras.Model object.

After constructing the tf.keras.Model object, users must do two things before returning it:

  1. Wrap the model using context.wrap_model().

  2. Compile the model using model.compile().

abstract build_training_data_loader() → Union[tensorflow.python.keras.utils.data_utils.Sequence, tensorflow.python.data.ops.dataset_ops.DatasetV2, SequenceAdapter, tuple]

Defines the data loader to use during training.

Should return one of the following:

1) A tuple (x_train, y_train), where x_train is a NumPy array (or array-like), a list of arrays (in case the model has multiple inputs), or a dict mapping input names to the corresponding array, if the model has named inputs. y_train should be a NumPy array.

2) A tuple (x_train, y_train, sample_weights) of NumPy arrays.

3) A tf.data.Dataset returning a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

4) A keras.utils.Sequence returning a tuple of either (inputs, targets) or (inputs, targets, sample weights).

When using tf.data.Dataset, you must wrap the dataset using determined.keras.TFKerasTrialContext.wrap_dataset(). This wrapper is used to shard the dataset for distributed training. For optimal performance, users should wrap a dataset immediately after creating it.

Warning

If you are using tf.data.Dataset, Determined’s support for automatically checkpointing the dataset does not currently work correctly. This means that resuming workloads will start from the beginning of the dataset if using tf.data.Dataset.

abstract build_validation_data_loader() → Union[tensorflow.python.keras.utils.data_utils.Sequence, tensorflow.python.data.ops.dataset_ops.DatasetV2, SequenceAdapter, tuple]

Defines the data loader to use during validation.

Should return one of the following:

1) A tuple (x_val, y_val), where x_val is a NumPy array (or array-like), a list of arrays (in case the model has multiple inputs), or a dict mapping input names to the corresponding array, if the model has named inputs. y_val should be a NumPy array.

2) A tuple (x_val, y_val, sample_weights) of NumPy arrays.

3) A tf.data.Dataset returning a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

4) A keras.utils.Sequence returning a tuple of either (inputs, targets) or (inputs, targets, sample weights).

When using tf.data.Dataset, you must wrap the dataset using determined.keras.TFKerasTrialContext.wrap_dataset(). This wrapper is used to shard the dataset for distributed training. For optimal performance, users should wrap a dataset immediately after creating it.

session_config() → tensorflow.core.protobuf.config_pb2.ConfigProto

Specifies the tf.ConfigProto to be used by the TensorFlow session. By default, tf.ConfigProto(allow_soft_placement=True) is used.

keras_callbacks() → List[tensorflow.python.keras.callbacks.Callback]

Specifies a list of determined.keras.callbacks.Callback objects to be used during training.

Callbacks should avoid calling model.predict(), as this will affect Determined training behavior.

Data Loading

There are five supported data types for loading data into tf.keras models:

  1. A tuple (x, y) of Numpy arrays. x must be a NumPy array (or array-like), a list of arrays (in case the model has multiple inputs), or a dict mapping input names to the corresponding array, if the model has named inputs. y should be a numpy array.

  2. A tuple (x, y, sample_weights) of Numpy arrays.

  3. A tf.data.dataset returning a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

  4. A keras.utils.Sequence returning a tuple of either (inputs, targets) or (inputs, targets, sample weights).

Loading data is done by defining build_training_data_loader() and build_validation_data_loader() methods. Each should return one of the supported data types mentioned above.

Passing Additional arguments to model.fit()

The TFKerasTrial interface allows the user to configure how model.fit is called by calling self.context.configure_fit().

Required Wrappers

Users are required wrap their model prior to compiling it using self.context.wrap_model. This is typically done inside build_model().

If using tf.data.Dataset, users are required to wrap both their training and validation dataset using self.context.wrap_dataset. This wrapper is used to shard the dataset for Distributed Training. For optimal performance, users should wrap a dataset immediately after creating it.

Trial Context

determined.keras.TFKerasTrialContext is a sub-class of determined.TrialContext that provides useful methods for writing tf.keras trial definitions, as well as functions to wrap the model and dataset.

class determined.keras.TFKerasTrialContext(env: determined._env_context.EnvContext, hvd_config: determined.horovod.HorovodContext, rendezvous_info: determined._rendezvous_info.RendezvousInfo)

TFKerasTrialContext always has a DistributedContext accessible via context.distributed for information related to distributed training.

TFKerasTrialContext always has a TFKerasExperimentalContext accessible via context.experimental for information related to experimental features.

wrap_model(model: Any) → Any

This should be used to wrap tf.keras.Model objects immediately after they have been created but before they have been compiled. This function takes a tf.keras.Model and returns a wrapped version of the model; the return value should be used in place of the original model.

Parameters

model – tf.keras.Model

configure_fit(verbose: Optional[bool] = None, class_weight: Any = <determined.keras._tf_keras_context._ArgNotProvided object>, workers: Optional[int] = None, use_multiprocessing: Optional[bool] = None, max_queue_size: Optional[bool] = None, shuffle: Optional[bool] = None, validation_steps: Any = <determined.keras._tf_keras_context._ArgNotProvided object>) → None

Configure parameters of model.fit(). See the Keras documentation for the meaning of each parameter.

Note that the output of verbose=True will be visually different in Determined than with Keras, for better rendering in trial logs.

Note that if configure_fit() is called multiple times, any keyword arguments which are not provided in the second call will not overwrite any settings configured by the first call.

Usage Example

class MyTFKerasTrial(det.keras.TFKerasTrial):
    def __init__(self, context):
        ...
        self.context.configure_fit(verbose=False, workers=5)

        # It is safe to call configure_fit() multiple times.
        self.context.configure_fit(use_multiprocessing=True)
wrap_dataset(dataset: Any, shard_dataset: bool = True) → Any

This should be used to wrap tf.data.Dataset objects immediately after they have been created. Users should use the output of this wrapper as the new instance of their dataset. If users create multiple datasets (e.g., one for training and one for validation), users should wrap each dataset independently.

Parameters
  • dataset – tf.data.Dataset

  • shard_dataset – When performing multi-slot (distributed) training, this controls whether the dataset is sharded so that each training process (one per slot) sees unique data. If set to False, users must manually configure each process to use unique data.

wrap_optimizer(optimizer: tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2) → tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2

This should be user to wrap tf.keras.optimizers.Optimizer objects. Users should use the output use the output of this wrapper as the new instance of their optimizer. If users create multiple optimizers, users should wrap each optimizer independently.

Parameters

optimizer – tf.keras.optimizers.Optimizer

class determined.keras.TFKerasExperimentalContext(env: determined._env_context.EnvContext, hvd_config: determined.horovod.HorovodContext)

Context class that contains experimental runtime information and features for any Determined workflow that uses the tf.keras API.

TFKerasExperimentalContext extends EstimatorTrialContext under the context.experimental namespace.

cache_train_dataset(dataset_id: str, dataset_version: str, shuffle: bool = False, skip_shuffle_at_epoch_end: bool = False) → Callable

cache_train_dataset is a decorator for creating your training dataset. It should decorate a function that outputs a tf.data.Dataset object. The dataset will be stored in a cache, keyed by dataset_id and dataset_version. The cache is re-used in subsequent calls.

Parameters
  • dataset_id – A string that will be used as part of the unique identifier for this dataset.

  • dataset_version – A string that will be used as part of the unique identifier for this dataset.

  • shuffle – A bool indicating if the dataset should be shuffled. Shuffling will be performed with the trial’s random seed which can be set in Experiment Configuration.

  • skip_shuffle_at_epoch_end – A bool indicating if shuffling should be skipped at the end of epochs.

Example Usage:

def make_train_dataset(self):
    @self.context.experimental.cache_train_dataset("range_dataset", "v1")
    def make_dataset():
        ds = tf.data.Dataset.range(10)
        return ds

    dataset = make_dataset()
    dataset = dataset.batch(self.context.get_per_slot_batch_size())
    dataset = dataset.map(...)
    return dataset

Note

dataset.batch() and runtime augmentation should be done after caching. Additionally, users should never need to call dataset.repeat().

cache_validation_dataset(dataset_id: str, dataset_version: str, shuffle: bool = False) → Callable

cache_validation_dataset is a decorator for creating your validation dataset. It should decorate a function that outputs a tf.data.Dataset object. The dataset will be stored in a cache, keyed by dataset_id and dataset_version. The cache is re-used in subsequent calls.

Parameters
  • dataset_id – A string that will be used as part of the unique identifier for this dataset.

  • dataset_version – A string that will be used as part of the unique identifier for this dataset.

  • shuffle – A bool indicating if the dataset should be shuffled. Shuffling will be performed with the trial’s random seed which can be set in Experiment Configuration.

Callbacks

To execute arbitrary Python code during the lifecycle of a TFKerasTrial, implement the determined.keras.callbacks.Callback interface (an extension of the tf.keras.callbacks.Callbacks interface) and supply them to the TFKerasTrial by implementing keras_callbacks().

determined.keras.TFKerasTrial.keras_callbacks(self) → List[tensorflow.python.keras.callbacks.Callback]

Specifies a list of determined.keras.callbacks.Callback objects to be used during training.

Callbacks should avoid calling model.predict(), as this will affect Determined training behavior.

determined.keras.callbacks

class determined.keras.callbacks.Callback

A Determined subclass of the tf.keras.callbacks.Callback interface which supports additional new callbacks.

Warning

The following behaviors differ between normal Keras operation and Keras operation within Determined:

  • Keras calls on_epoch_end at the end of the training dataset, but Determined calls it based on the records_per_epoch setting in the experiment config.

  • Keras calls on_epoch_end with training and validation logs, but Determined does not schedule training or validation around epochs in general, so Determined cannot guarantee that those values are available for on_epoch_end calls. As a result, on_epoch_end will be called with an empty dictionary for its logs.

  • Keras does not support stateful callbacks, but Determined does. Therefore:

    • The tf.keras version of EarlyStopping will not work right in Determined. You should use you should use determined.keras.callbacks.EarlyStopping instead.

    • The tf.keras version of ReduceLROnPlateau will not work right in Determined. You should use you should use determined.keras.callbacks.ReduceLRScheduler instead.

    The Determined versions are based around on_test_end rather than on_epoch_end, which can be influenced by setting min_validation_period in the experiment configuration.

get_state() → Any

get_state should return a pickleable object that represents the state of this callback.

When training is continued from a checkpoint, the value returned from get_state() will be passed back to the Callback object via load_state().

load_state(state: Any) → None

load_state should accept the exact pickleable object returned by get_state to restore the internal state of a stateful Callback as it was when load_state was called.

on_checkpoint_end(checkpoint_dir: str) → None

on_checkpoint_end is called after a checkpoint is finished, and allows users to save arbitrary files alongside the checkpoint.

Parameters

checkpoint_dir – The path to the checkpoint_dir where new files may be added.

on_train_workload_begin(total_batches_trained: int, batches_requested: Optional[int], logs: Dict) → None

on_train_workload_begin is called before a chunk of model training. The number of batches in the workload may vary, but will not exceed the scheduling_unit setting for the experiment.

Parameters
  • total_batches_trained – The number of batches trained at the start of the workload.

  • batches_requested – The number of batches expected to train during the workload.

  • logs – a dictionary (presently always an empty dictionary)

on_train_workload_end(total_batches_trained: int, logs: Dict) → None

on_train_workload_end is called after a chunk of model training.

Parameters
  • total_batches_trained – The number of batches trained at the end of the workload.

  • logs – a dictionary of training metrics aggregated during this workload.

class determined.keras.callbacks.EarlyStopping(*arg: Any, **kwarg: Any)

EarlyStopping behaves exactly like the tf.keras.callbacks.EarlyStopping except that it checks after every on_test_end() rather than every on_epoch_end() and it can save and restore its state after pauses in training.

Therefore, part of configuring the Determined implementation of EarlyStopping is to configure min_validation_period for the experiment appropriately (likely it should be configured to validate every epoch).

In Determined, on_test_end may be called slightly more often than min_validation_period during some types of hyperparameter searches, but it is unlikely for that to occur often enough have a meaningful impact on this callback’s operation.

class determined.keras.callbacks.ReduceLROnPlateau(*arg: Any, **kwarg: Any)

ReduceLROnPlateau behaves exactly like the tf.keras.callbacks.ReduceLROnPlateau except that it checks after every on_test_end() rather than every on_epoch_end() and it can save and restore its state after pauses in training.

Therefore, part of configuring the Determined implementation of ReduceLROnPlateau is to configure min_validation_period for the experiment appropriately (likely it should be configured to validate every epoch).

In Determined, on_test_end may be called slightly more often than min_validation_period during some types of hyperparameter searches, but it is unlikely for that to occur often enough have a meaningful impact on this callback’s operation.

class determined.keras.callbacks.TensorBoard(*args: Any, **kwargs: Any)

This is a thin wrapper over the TensorBoard callback that ships with tf.keras. For more information, see the TensorBoard Guide or the upstream docs for tf.keras.callbacks.TensorBoard.

Note that if a log_dir argument is passed to the constructor, it will be ignored.

Debugging

Please see Model Debugging in Determined.