Skip to content

Model Definitions

The model definition is the interface between PEDL and the user's application framework (e.g., Keras, TensorFlow), in terms of loading training data, describing a model architecture, and specifying the underlying iterative optimization training algorithms. See the Defining Models chapter in the quick start guide for a brief introduction.

Users may specify two types of model definitions:

  1. Standard Model Definition: Implement PEDL's provided Trial interface for your desired task. This option provides finer-grained control over PEDL model construction and computation.
  2. Simple Model Definition: Specify a directory of model code together with an entrypoint script that executes a training and validation procedure. This option requires very few code changes to set up and may be simplest if you're new to PEDL.

When the model definition is a directory, a .pedlignore file in the top level may optionally be used specify file or directory patterns to ignore. The .pedlignore file is expected to use the same syntax and pattern formatting as a .gitignore file.

Standard Model Definition

A standard model definition defines the interface between PEDL and user model code by implementing a framework specific Trial subclass. Users can provide these implementations either in a single file or in a directory containing a Python package, e.g., something importable containing a top-level __init__.py that exposes the Trial implementation. Unless the TensorFlow Estimator interface is used, the single file or Python package should also expose a make_data_loaders() implementation. examples/mnist_tf provides an example of a directory model definition. examples/cifar10_cnn_keras provides an example of a single file model definition.

PEDL currently supports five types of Trial interfaces encompassing three application frameworks:

KerasTrial Interface

Keras trials are created by subclassing the abstract class KerasTrial. The KerasTrial interface supports models that use the Keras Sequential API; to use the Keras Functional API, see KerasFunctionalTrial below.

Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • build_model(self, hparams): Defines the deep learning architecture associated with a trial, and typically depends on the trial's specific hyperparameter settings which are stored in the hparams dictionary. This function returns a keras.models.Sequential object.
  • optimizer(self): Specifies the learning algorithm, e.g., keras.optimizers.RMSProp or keras.optimizers.Adam.
  • loss(self): Specifies the loss associated with the objective function to be optimized, e.g., keras.losses.mean_squared_error or keras.losses.categorical_crossentropy.
  • batch_size(self): Specifies the batch size to use for training.
  • validation_metrics(self): Specifies the performance metrics that will be evaluated on the validation data. This function should return a dictionary that maps user-specified metric names to metrics. The metrics can take one of two forms:
    • The first form is a valid Keras metric function, which is a Python function that takes two TensorFlow tensors containing the predictions and labels, respectively, and returns a tensor result. The element-wise mean of this tensor result across all validation batches is saved as the metric value for a given validation step.
    • The second form is a pair of batch metric function and reducer function. The batch metric function is a valid Keras metric function as described above, and the reducer is run on the collected results. An example of a reducer function is provided in pedl.util.elementwise_mean; this is the default reduction function used if a metric is specified without a reducer function. This second form is useful if it is desirable to overwrite the default reduction procedure.

Optional Methods

  • training_metrics(self): Specifies performance metrics that will be evaluated on each batch of training data. Training loss is always computed and reported as a metric named loss. If supplied, this function defines a set of metrics to be computed in addition to the training loss. This function should return a dictionary that maps user-specified metric names to metric functions. A training metric function is a Python function that takes two TensorFlow tensors and returns a JSON-serializable object (e.g., a floating point value). Users can supply custom metric functions or use one of the built-in Keras metrics. Since the training metrics are evaluated on every batch, we recommend only including metrics that are computed as part of the forward pass of training, e.g., keras.metrics.categorical_accuracy.
  • session_config(self): Specifies the tf.ConfigProto to be used by the TensorFlow session. By default, tf.ConfigProto(allow_soft_placement=True) is used.

Examples

KerasFunctionalTrial Interface

The KerasFunctionalTrial interface is designed to support the Keras Functional API. This interface is appropriate for complex models that may require multiple inputs, multiple loss functions, and/or multiple outputs. The interface is similar to the KerasTrial interface with a few significant differences:

  • build_model(self, hparams): Defines the deep learning architecture associated with a trial, and typically depends on the trial's specific hyperparameter settings which are stored in the hparams dictionary. This function returns a keras.models.Model object. All output layers and input layers should be explicitly named so they can be referenced in the losses, training_metrics, and validation_metrics methods.
  • optimizer(self): Specifies the learning algorithm, e.g., keras.optimizers.RMSProp or keras.optimizers.Adam.
  • losses(self): Specifies the loss(es) associated with the objective function to be optimized, e.g., keras.losses.mean_squared_error or keras.losses.categorical_crossentropy. This function should return a dict where the keys are output layer names and the values are Keras loss functions.
  • batch_size(self): Specifies the batch size to use for training.
  • validation_metrics(self): Specifies the performance metrics that will be evaluated on the validation data. This function should return a dictionary that maps user-specified metric names to tuples of length 2 or 3, e.g.:

    {
        "metric1_name": ("output_layer": str,
                         metric1_operation: MetricOp),
        "metric2_name": ("output_layer": str,
                         metric2_operation: MetricOp,
                         metric2_reducer: Reducer),
        ...
    }
    
    The first element of the tuple is the string name of the Keras output layer the metric should be evaluated on. The second element of the tuple is a valid Keras metric function, which is a Python function that takes two TensorFlow tensors containing the predictions and labels, respectively, and returns a tensor result. The third and optional element of the tuple is a reducer function that defines how the per-batch values of each metric are reduced to a single value. An example of a reducer function is provided in pedl.util.elementwise_mean; this is the default reduction function used if a metric is specified without a reducer function.

    Note

    When a metric is specified on an output layer that doesn't have a loss function, PEDL will follow the behavior of Keras and ignore the metric function.

Optional Methods

  • training_metrics(self): Specifies performance metrics that will be evaluated on each batch of training data. Total training loss is always computed as the sum of all specified losses and reported as a metric named loss. If supplied, this function defines a set of metrics to be computed in addition to the training loss. This function should return a dictionary that maps user-specified metric names to 2-tuples of output layer name and metric function. A layer name is a string containing the name of an output layer in the model. A metric function is a Python function that takes two TensorFlow tensors and returns a JSON-serializable object (e.g., a floating point value). Users can supply custom metric functions or use one of the built-in Keras metrics. Since the training metrics are evaluated on every batch, we recommend only including metrics that are computed as part of the forward pass of training, e.g., keras.metrics.categorical_accuracy.
  • session_config(self): Specifies the tf.ConfigProto to be used by the TensorFlow session. By default, tf.ConfigProto(allow_soft_placement=True) is used.

Examples

TensorFlowTrial Interface

TensorFlow trials are created by subclassing the abstract class TensorFlowTrial. Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • data_placeholders(self): Specifies the input data as a dictionary that maps string names to tf.placeholder values. If there are multiple entries in the dictionary, they should match the string names provided to the first argument of the Batch constructor in make_data_loaders().
  • label_placeholders(self): Specifies the label data as a dictionary that maps string names to tf.placeholder values. If there are multiple entries in the dictionary, they should match the string names provided to the first argument of the Batch constructor in make_data_loaders().
  • batch_size(self): Specifies the batch size to use for training.
  • optimizer(self): Specifies the learning algorithm, e.g., the minimize() method of tf.train.MomentumOptimizer.
  • build_graph(self, data_placeholders, label_placeholders, is_training): Builds the TensorFlow graph of variables and operations used for training and validation. data_placeholders and label_placeholders are dictionaries that map string names to placeholder tensors, as specified by data_placeholders(self) and label_placeholders(self). is_training is a Boolean tf.Tensor that is True during a training step and False during a validation step. This function should return a dictionary that maps string names to tf.Tensor nodes in the graph. Any outputs that may potentially be used as training and/or validation metrics should be returned by this function. This function must return at least one tf.Tensor named "loss", which is the value that will be optimized during training.
  • validation_metrics(self): Specifies a list of names of metrics that will be evaluated on the validation data. Each of these names must correspond to a tf.Tensor value returned by build_graph(self).

Note

We recommend initializing the TensorFlow graph in __init__(self, training_loader, validation_loader, hparams) because (i) the graph is required to define all of the aforementioned abstract methods; and (ii) __init__() is the only method of TensorFlowTrial that has access to hparams.

Optional Methods

  • training_metrics(self): Specifies a list of names of metrics that will be evaluated on each batch of training data. Training loss is always computed and reported as a metric named loss. If supplied, this function defines the metrics to be computed in addition to the training loss. Each of the returned metric names must correspond to a tf.Tensor value returned by build_graph(self).
  • session_config(self): Specifies the tf.ConfigProto to be used by the TensorFlow session. By default, tf.ConfigProto(allow_soft_placement=True) is used.

Examples

EstimatorTrial Interface

To use TensorFlow's high-level tf.estimator.Estimator API with PEDL, users should subclass the abstract class EstimatorTrial. Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • build_estimator(self, hparams): Specifies the tf.estimator.Estimator instance to be used during training and validation. This may be an instance of a Premade Estimator provided by the TensorFlow team, or a Custom Estimator created by the user.
  • build_train_spec(self, hparams): Specifies the tf.estimator.TrainSpec to be used for training steps. This training specification will contain a TensorFlow input_fn which constructs the input data for a training step. Unlike the standard Tensorflow input_fn interface, EstimatorTrial only supports an input_fn that returns a tf.data.Dataset object. A function that returns a tuple of features and labels is currently not supported by EstimatorTrial. Additionally, the max_steps attribute of the training specification will be ignored; instead, the batches_per_step option in the experiment configuration is used to determine how many batches each training step uses.
  • build_eval_spec(self, hparams): Specifies the tf.estimator.EvalSpec to be used for validation steps. This evaluation spec will contain a TensorFlow input_fn which constructs the input data for a validation step. The validation step will evaluate steps batches, or evaluate until the input_fn raises an end-of-input exception if steps is None.

Optional Methods

  • build_serving_input_receiver_fns(self, hparams): Optionally returns a Python dictionary mapping string names to serving_input_receiver_fns. If specified, each serving input receiver function will be used to export a distinct SavedModel inference graph when a PEDL checkpoint is saved, using Estimator.export_saved_model. The exported models are saved under subdirectories named by the keys of the respective serving input receiver functions. For example, returning
    {
      "raw": tf.estimator.export.build_raw_serving_input_receiver_fn(...),
      "parsing": tf.estimator.export.build_parsing_serving_input_receiver_fn(...)
    }
    
    from this function would configure PEDL to export two SavedModel inference graphs in every checkpoint under raw and parsing subdirectories, respectively. By default, this function returns an empty dictionary and the PEDL checkpoint directory only contains metadata associated with the training graph.

Examples

PyTorchTrial Interface

PyTorch trials are created by subclassing the abstract class PyTorchTrial. Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • build_model(self, hparams): Defines the deep learning architecture associated with a trial, which typically depends on the trial's specific hyperparameter settings stored in the hparams dictionary. This method returns the model as an an instance or subclass of nn.Module.

    The input to the model's forward method will be the Batch.data for the Batch that was returned by this experiment's BatchLoader. For simple models, that data will often be a plain tensor, but for multi-input models, it will be in the form of a dictionary of named inputs.

    The output of the model's forward method will be fed directly into the user-defined losses, training_metrics, and validation_metrics methods as predictions.

  • losses(self, predictions, labels): Calculates loss(es) of the model. If the model only returns a single loss, the output of this method can be a scalar tensor which will be used for backpropagation. If the model reports multiple losses, this method must return a dictionary of losses which contains the special key "loss" corresponding to a scalar tensor which will be used for backpropagation. The output of this method is fed directly into the training_metrics and validation_metrics methods.

  • optimizer(self, model): Specifies an instance of torch.optim.Optimizer to be used for training the given model, e.g., torch.optim.SGD(model.parameters(), learning_rate).
  • batch_size(self): Specifies the batch size to use for training.
  • validation_metrics(self, predictions, labels, losses): Calculates and returns a dictionary mapping string names to validation metrics. Metrics may be non-scalar tensors. Results from each batch of validation data will be averaged to compute validation metrics for a given model.

Optional Methods

  • training_metrics(self, predictions, labels, losses): Calculates and returns a dictionary mapping string names to training metrics. If supplied, this method defines a set of metrics to be computed in addition to the training loss. Metrics may be non-scalar tensors.

Data Loading

We provide convenience classes in pedl.frameworks.data.pytorch to wrap existing PyTorch Datasets and DataLoaders. You may also provide a custom data loader by implementing the BatchLoader interface, and feeding data as torch.Tensors or np.ndarrays.

Examples

Callbacks

Trial offers an optional interface to execute arbitrary Python functions before or after each training or validation step. This is useful for integrating with external systems, such as TensorBoard (see example below). To use callbacks in your experiment, implement the following optional interface in your Trial subclass:

  • callbacks(self, hparams): Returns a list of pedl.callback.Callback instances that will be used to run arbitrary Python functions during the lifetime of a PEDL trial. Callbacks are invoked in the order specified by this list.

The following predefined callbacks are provided by PEDL:

  • pedl.frameworks.tensorflow.TensorBoard(log_directory): log_directory specifies the container path where TensorBoard event logs will be written from the trial runner containers. The event logs for each trial will be saved under sub-directories under the log_directory labelled with the trial ID: <trial_id>/training and <trial_id>/validation for training and validation metrics, respectively. For a complete example, see TensorBoard Integration.
Custom Callbacks

To define custom callbacks, users may subclass pedl.callback.Callback and implement one or more of its optional interface functions:

  • on_trial_begin(): Executed before the start of the first training step of a trial.
  • on_train_step_begin(step_id): Executed at the beginning of a training step.
  • on_train_step_end(step_id, metrics): Executed at the end of a training step. metrics is a list of Python dictionaries for this training step, where each dictionary contains the metrics of a single training batch.
  • on_validation_step_begin(step_id): Executed at the beginning of a validation step.
  • on_validation_step_end(step_id, metrics): Executed at the end of a validation step. metrics is a Python dictionary that contains the metrics for this validation step.

Simple Model Definition

Simple model definitions provide a mechanism for running models in PEDL without needing to implement a Trial API. Instead, features like automatic checkpointing and task migration are implemented by intercepting method calls from the model code into the deep learning framework (e.g., Keras).

To create an experiment using a simple model definition, the experiment configuration file should specify an entrypoint section. The entrypoint script is the Python script that creates and loads the training data, describes a model architecture, and runs the training and validation procedure using framework API's (e.g., Keras's fit_generator()). PEDL will run the entrypoint script in a containerized trial runner environment and intercept framework calls to control the execution of the model training and validation. To access hyperparameters in model code, use the pedl.get_hyperparameter(name) function, where name is the string name of a hyperparameter as specified in the experiment configuration.

Currently, simple model definitions are only supported for Keras models.

Keras

To use a simple model definition with Keras, specify an entrypoint section in the experiment configuration, where script is set to the location of the entrypoint script relative to the model definition directory. Optionally, specify a list of arguments to be passed to the entrypoint script under args.

Please ensure that your model definition conforms to the following requirements:

  • The model is trained using the fit_generator() API during execution of the entrypoint script. All the same argument requirements to fit_generator() apply in PEDL, except as follows.
    • steps_per_epoch and epochs are ignored if provided. Instead, the searcher section in the experiment configuration defines how long the model will be trained for.
    • validation_data must be specified as a generator.
    • validation_steps must be specified unless the validation generator is of type keras.utils.Sequence.
      • In the case that validation_steps is unspecified and validation_data is of type keras.utils.Sequence, then len(validation_data) will be used as validation_steps. This mimics the behavior of the Keras fit_generator() API.
      • A PEDL validation step will use validation_steps batches to compute validation metrics.
    • Code cannot rely on the return value or side effects of fit_generator().
    • Certain types of callbacks may not be supported—see Callbacks below for more details.
  • Any training generator or validation generator used must not reference non-pickleable objects, including threading.Lock and file objects. One exception to this rule is Keras' ImageDataGenerator, which contains a threading.Lock instance that is specially handled by PEDL.

An example is provided at examples/mnist_keras_simple.

Callbacks

The following is a non-exhaustive list of supported Keras callbacks:

  • LearningRateScheduler

    The first argument to the schedule function will be interpreted as a PEDL step ID instead of an epoch index. Note that a PEDL step ID is 1-based, as opposed to the 0-based epoch index used by Keras. The learning rate will be applied to the optimizer before the training step is executed. For example, the following code uses a learning rate of 0.01 for the first 10 training steps and a learning rate of 0.001 for the rest of training.

    def lr_schedule(step_id: int) -> float:
        if step_id <= 10:
            return 0.01
        else:
            return 0.001
    
    fit_generator(
        ...
        callbacks = [LearningRateScheduler(schedule=lr_schedule)],
        ...
    )
    
  • Validation Metric Callbacks

    The Keras metric API makes it difficult to compute unbatched metrics, such as mAP. One workaround is to pass in a reference to validation data in a Keras callback and compute the metric in on_epoch_end(), as demonstrated by this Github issue. To integrate this workaround into PEDL, make sure your callback inherits from pedl.frameworks.keras_simple_trial.KerasValidationCallback, and add the computed metric value to the logs argument of on_epoch_end(). This will indicate to PEDL that this callback should be run during a validation step instead of during a training step. An example callback that computes the Mean Absolute Error (MAE) is provided below:

    from pedl.frameworks.keras_simple_trial import KerasValidationCallback
    
    class ComputeMAEMetricCallback(KerasValidationCallback):
        def __init__(self, validation_gen) -> None:
            super().__init__()
            self.validation_gen = validation_gen
    
        def on_epoch_end(self, epoch: int, logs: Dict[str, Any]) -> None:
            data, labels = next(self.validation_gen)
            predictions = self.model.predict(data)
            predictions = np.squeeze(predictions)
            mae = np.sum(np.abs(predictions - labels))
            logs["mae"] = mae
    
    ...
    
    model.fit_generator(
        ...
        callbacks=[ComputeMAEMetricCallback(validation_gen)]
    )
    
  • TensorBoard

    If using the TensorBoard callback, the update_freq argument will be ignored and PEDL will serialize the metrics at the end of every training and validation step. All metrics will be serialized following the metric name conventions used by Keras ("val_" is prepended to the validation metric names).

  • ReduceLROnPlateau

    When ReduceLROnPlateau is used as part of a Keras simple model definition, it adheres to the following semantics: If the monitor argument is set to monitor a training metric, the patience and cooldown arguments refer to the number of training steps instead of number of epochs. If the monitor argument is set to track a validation metric (any metric prefixed with "val_"), the patience and cooldown arguments refer to the number of validation steps instead of number of epochs. If using ReduceLROnPlateau to track a validation metric, it is recommended to set a min_validation_period to keep the schedule of validation steps at evenly paced intervals.

Please reach out to the Determined AI team for more information on whether a Keras callback you are using is supported.