Skip to content

Model Definitions

The model definition is the interface between PEDL and the user's application framework (e.g., Keras, TensorFlow), in terms of loading training data, describing a model architecture, and specifying the underlying iterative optimization training algorithms. See the Defining Models chapter in the quick start guide for a brief introduction.

Users may specify two types of model definitions:

  1. Standard Model Definition: Implement PEDL's provided Trial interface for your desired task. This option provides finer-grained control over PEDL model construction and computation.
  2. Simple Model Definition: Specify a directory of model code together with an entrypoint script that executes a training and validation procedure. This option requires very few code changes to set up and may be simplest if you're new to PEDL.

When the model definition is a directory, a .pedlignore file in the top level may optionally be used specify file or directory patterns to ignore. The .pedlignore file is expected to use the same syntax and pattern formatting as a .gitignore file.

Standard Model Definition

A standard model definition defines the interface between PEDL and user model code by implementing a framework specific Trial subclass. Users can provide these implementations either in a single file or in a directory containing a Python package, e.g., something importable containing a top-level __init__.py that exposes the Trial implementation. Unless the TensorFlow Estimator interface is used, the single file or Python package should also expose a make_data_loaders() implementation. examples/mnist_tf provides an example of a directory model definition. examples/cifar10_cnn_keras provides an example of a single file model definition.

PEDL currently supports five types of Trial interfaces encompassing three application frameworks:

KerasTrial Interface

Keras trials are created by subclassing the abstract class KerasTrial. The KerasTrial interface supports models that use the Keras Sequential API; to use the Keras Functional API, see KerasFunctionalTrial below.

Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • build_model(self, hparams): Defines the deep learning architecture associated with a trial, and typically depends on the trial's specific hyperparameter settings which are stored in the hparams dictionary. This function returns a keras.models.Sequential object.
  • optimizer(self): Specifies the learning algorithm, e.g., keras.optimizers.RMSProp or keras.optimizers.Adam.
  • loss(self): Specifies the loss associated with the objective function to be optimized, e.g., keras.losses.mean_squared_error or keras.losses.categorical_crossentropy.
  • batch_size(self): Specifies the batch size to use for training.
  • validation_metrics(self): Specifies the performance metrics that will be evaluated on the validation data. This function should return a dictionary that maps user-specified metric names to metrics. The metrics can take one of two forms:
    • The first form is a valid Keras metric function, which is a Python function that takes two TensorFlow tensors containing the predictions and labels, respectively, and returns a tensor result. The element-wise mean of this tensor result across all validation batches is saved as the metric value for a given validation step.
    • The second form is a pair of batch metric function and reducer function. The batch metric function is a valid Keras metric function as described above, and the reducer is run on the collected results. An example of a reducer function is provided in pedl.util.elementwise_mean; this is the default reduction function used if a metric is specified without a reducer function. This second form is useful if it is desirable to overwrite the default reduction procedure.

Optional Methods

  • training_metrics(self): Specifies performance metrics that will be evaluated on each batch of training data. Training loss is always computed and reported as a metric named loss. If supplied, this function defines a set of metrics to be computed in addition to the training loss. This function should return a dictionary that maps user-specified metric names to metric functions. A training metric function is a Python function that takes two TensorFlow tensors and returns a JSON-serializable object (e.g., a floating point value). Users can supply custom metric functions or use one of the built-in Keras metrics. Since the training metrics are evaluated on every batch, we recommend only including metrics that are computed as part of the forward pass of training, e.g., keras.metrics.categorical_accuracy.
  • session_config(self): Specifies the tf.ConfigProto to be used by the TensorFlow session. By default, tf.ConfigProto(allow_soft_placement=True) is used.

Examples

KerasFunctionalTrial Interface

The KerasFunctionalTrial interface is designed to support the Keras Functional API. This interface is appropriate for complex models that may require multiple inputs, multiple loss functions, and/or multiple outputs. The interface is similar to the KerasTrial interface with a few significant differences:

  • build_model(self, hparams): Defines the deep learning architecture associated with a trial, and typically depends on the trial's specific hyperparameter settings which are stored in the hparams dictionary. This function returns a keras.models.Model object. All output layers and input layers should be explicitly named so they can be referenced in the losses, training_metrics, and validation_metrics methods.
  • optimizer(self): Specifies the learning algorithm, e.g., keras.optimizers.RMSProp or keras.optimizers.Adam.
  • losses(self): Specifies the loss(es) associated with the objective function to be optimized, e.g., keras.losses.mean_squared_error or keras.losses.categorical_crossentropy. This function should return a dict where the keys are output layer names and the values are Keras loss functions.
  • batch_size(self): Specifies the batch size to use for training.
  • validation_metrics(self): Specifies the performance metrics that will be evaluated on the validation data. This function should return a dictionary that maps user-specified metric names to tuples of length 2 or 3, e.g.:

    {
        "metric1_name": ("output_layer": str,
                         metric1_operation: MetricOp),
        "metric2_name": ("output_layer": str,
                         metric2_operation: MetricOp,
                         metric2_reducer: Reducer),
        ...
    }
    
    The first element of the tuple is the string name of the Keras output layer the metric should be evaluated on. The second element of the tuple is a valid Keras metric function, which is a Python function that takes two TensorFlow tensors containing the predictions and labels, respectively, and returns a tensor result. The third and optional element of the tuple is a reducer function that defines how the per-batch values of each metric are reduced to a single value. An example of a reducer function is provided in pedl.util.elementwise_mean; this is the default reduction function used if a metric is specified without a reducer function.

    Note

    When a metric is specified on an output layer that doesn't have a loss function, PEDL will follow the behavior of Keras and ignore the metric function.

Optional Methods

  • training_metrics(self): Specifies performance metrics that will be evaluated on each batch of training data. Total training loss is always computed as the sum of all specified losses and reported as a metric named loss. If supplied, this function defines a set of metrics to be computed in addition to the training loss. This function should return a dictionary that maps user-specified metric names to 2-tuples of output layer name and metric function. A layer name is a string containing the name of an output layer in the model. A metric function is a Python function that takes two TensorFlow tensors and returns a JSON-serializable object (e.g., a floating point value). Users can supply custom metric functions or use one of the built-in Keras metrics. Since the training metrics are evaluated on every batch, we recommend only including metrics that are computed as part of the forward pass of training, e.g., keras.metrics.categorical_accuracy.
  • session_config(self): Specifies the tf.ConfigProto to be used by the TensorFlow session. By default, tf.ConfigProto(allow_soft_placement=True) is used.

Examples

TensorFlowTrial Interface

TensorFlow trials are created by subclassing the abstract class TensorFlowTrial. Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • data_placeholders(self): Specifies the input data as a dictionary that maps string names to tf.placeholder values. If there are multiple entries in the dictionary, they should match the string names provided to the first argument of the Batch constructor in make_data_loaders().
  • label_placeholders(self): Specifies the label data as a dictionary that maps string names to tf.placeholder values. If there are multiple entries in the dictionary, they should match the string names provided to the first argument of the Batch constructor in make_data_loaders().
  • batch_size(self): Specifies the batch size to use for training.
  • optimizer(self): Specifies the learning algorithm, e.g., the minimize() method of tf.train.MomentumOptimizer.
  • build_graph(self, data_placeholders, label_placeholders, is_training): Builds the TensorFlow graph of variables and operations used for training and validation. data_placeholders and label_placeholders are dictionaries that map string names to placeholder tensors, as specified by data_placeholders(self) and label_placeholders(self). is_training is a Boolean tf.Tensor that is True during a training step and False during a validation step. This function should return a dictionary that maps string names to tf.Tensor nodes in the graph. Any outputs that may potentially be used as training and/or validation metrics should be returned by this function. This function must return at least one tf.Tensor named "loss", which is the value that will be optimized during training.
  • validation_metrics(self): Specifies a list of names of metrics that will be evaluated on the validation data. Each of these names must correspond to a tf.Tensor value returned by build_graph(self).

Note

We recommend initializing the TensorFlow graph in __init__(self, training_loader, validation_loader, hparams) because (i) the graph is required to define all of the aforementioned abstract methods; and (ii) __init__() is the only method of TensorFlowTrial that has access to hparams.

Optional Methods

  • training_metrics(self): Specifies a list of names of metrics that will be evaluated on each batch of training data. Training loss is always computed and reported as a metric named loss. If supplied, this function defines the metrics to be computed in addition to the training loss. Each of the returned metric names must correspond to a tf.Tensor value returned by build_graph(self).
  • session_config(self): Specifies the tf.ConfigProto to be used by the TensorFlow session. By default, tf.ConfigProto(allow_soft_placement=True) is used.

Examples

EstimatorTrial Interface

To use TensorFlow's high-level tf.estimator.Estimator API with PEDL, users should subclass the abstract class EstimatorTrial. Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • build_estimator(self, hparams): Specifies the tf.estimator.Estimator instance to be used during training and validation. This may be an instance of a Premade Estimator provided by the TensorFlow team, or a Custom Estimator created by the user.
  • build_train_spec(self, hparams): Specifies the tf.estimator.TrainSpec to be used for training steps. This training specification will contain a TensorFlow input_fn which constructs the input data for a training step. The max_steps attribute of the training specification will be ignored; instead, the batches_per_step option in the experiment configuration is used to determine how many batches each training step uses.
  • build_eval_spec(self, hparams): Specifies the tf.estimator.EvalSpec to be used for validation steps. This evaluation spec will contain a TensorFlow input_fn which constructs the input data for a validation step. The validation step will evaluate steps batches, or evaluate until the input_fn raises an end-of-input exception if steps is None.

Optional Methods

  • build_serving_input_receiver_fns(self, hparams): Optionally returns a Python dictionary mapping string names to serving_input_receiver_fns. If specified, each serving input receiver function will be used to export a distinct SavedModel inference graph when a PEDL checkpoint is saved, using Estimator.export_saved_model. The exported models are saved under subdirectories named by the keys of the respective serving input receiver functions. For example, returning
    {
      "raw": tf.estimator.export.build_raw_serving_input_receiver_fn(...),
      "parsing": tf.estimator.export.build_parsing_serving_input_receiver_fn(...)
    }
    
    from this function would configure PEDL to export two SavedModel inference graphs in every checkpoint under raw and parsing subdirectories, respectively. By default, this function returns an empty dictionary and the PEDL checkpoint directory only contains metadata associated with the training graph.

Examples

PyTorchTrial Interface

PyTorch trials are created by subclassing the abstract class PyTorchTrial. Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • build_model(self, hparams): Defines the deep learning architecture associated with a trial, which typically depends on the trial's specific hyperparameter settings stored in the hparams dictionary. This function returns a torch.nn.Module object that defines the network through its forward function. Prediction model(s) and calculation of loss/metrics should all be built into the model, as follows.

    • Inputs to the network are arguments of the forward function. These should include Batch inputs and outputs (data and labels of the dataset).
    • The forward function defines the network and returns a dict of torch.nn.Moduless, which defines the output nodes.
    • Output nodes must contain a node named "loss" for training. Output nodes should also include any nodes of interest for prediction, training metrics, or validation metrics.
  • optimizer(self, model): Specifies an instance of torch.optim.Optimizer to be used for training the given model, e.g., torch.optim.SGD(model.parameters(), learning_rate).

  • batch_size(self): Specifies the batch size to use for training.
  • validation_metrics(self): Specifies the performance metrics that will be evaluated on the validation data. This function should return a list of strings that are keys of the output dict returned by the forward method of the model. Validation metrics are evaluated per batch then averaged over the batches.

Optional Methods

  • training_metrics(self): Specifies performance metrics that will be evaluated on each batch of training data. Training loss is always computed and reported as a metric named loss. If supplied, this function defines a set of metrics to be computed in addition to the training loss. This function should, like validation_metrics(), return a list of training metrics by specifying keys within the output dict of the model.

Data Loading

We provide convenience classes in pedl.frameworks.data.pytorch to wrap existing PyTorch Datasets and DataLoaders. You may also provide a custom data loader by implementing the BatchLoader interface, and feeding data as np.ndarrays.

Instances of BatchLoader used with PyTorchTrial must satisfy the following conditions: (1) No name can appear as both a batch input and output. (2) The combined set of batch input and output names must match the names of the arguments of the build_model().forward method.

Examples

Callbacks

StandardTrial offers an optional interface to execute arbitrary Python functions before or after each training or validation step. This is useful for integrating with external systems, such as TensorBoard (see example below). To use callbacks in your experiment, implement the following optional interface in your StandardTrial subclass:

  • callbacks(self, hparams): Returns a list of pedl.callback.Callback instances that will be used to run arbitrary Python functions during the lifetime of a PEDL trial. Callbacks are invoked in the order specified by this list.

The following predefined callbacks are provided by PEDL:

  • pedl.frameworks.tensorflow.TensorBoard(log_directory): log_directory specifies the container path where TensorBoard event logs will be written from the trial runner containers. The event logs for each trial will be saved under sub-directories under the log_directory labelled with the trial ID: <trial_id>/training and <trial_id>/validation for training and validation metrics, respectively. For a complete example, see TensorBoard Integration.
Custom Callbacks

To define custom callbacks, users may subclass pedl.callback.Callback and implement one or more of its optional interface functions:

  • on_trial_begin(): Executed before the start of the first training step of a trial.
  • on_train_step_begin(step_id): Executed at the beginning of a training step.
  • on_train_step_end(step_id, metrics): Executed at the end of a training step. metrics is a list of Python dictionaries for this training step, where each dictionary contains the metrics of a single training batch.
  • on_validation_step_begin(step_id): Executed at the beginning of a validation step.
  • on_validation_step_end(step_id, metrics): Executed at the end of a validation step. metrics is a Python dictionary that contains the metrics for this validation step.

Simple Model Definition

Simple model definitions provide a mechanism for running models in PEDL without needing to implement a Trial API. Instead, features like automatic checkpointing and task migration are implemented by intercepting method calls from the model code into the deep learning framework (e.g., Keras).

To create an experiment using a simple model definition, the experiment configuration file should specify an entrypoint section. The entrypoint script is the Python script that creates and loads the training data, describes a model architecture, and runs the training and validation procedure using framework API's (e.g., Keras's fit_generator()). PEDL will run the entrypoint script in a containerized trial runner environment and intercept framework calls to control the execution of the model training and validation. To access hyperparameters in model code, use the pedl.get_hyperparameter(name) function, where name is the string name of a hyperparameter as specified in the experiment configuration.

Currently, simple model definitions are only supported for Keras models.

Keras

To use a simple model definition with Keras, specify an entrypoint section in the experiment configuration, where script is set to the location of the entrypoint script relative to the model definition directory. Optionally, specify a list of arguments to be passed to the entrypoint script under args.

Please ensure that your model definition conforms to the following requirements:

  • The model is trained using the fit_generator() API during execution of the entrypoint script. All the same argument requirements to fit_generator() apply in PEDL, except as follows.
    • steps_per_epoch and epochs are ignored if provided. Instead, the searcher section in the experiment configuration defines how long the model will be trained for.
    • validation_data must be specified as a generator.
    • validation_steps must be specified unless the validation generator is of type keras.utils.Sequence.
      • In the case that validation_steps is unspecified and validation_data is of type keras.utils.Sequence, then len(validation_data) will be used as validation_steps. This mimics the behavior of the Keras fit_generator() API.
      • A PEDL validation step will use validation_steps batches to compute validation metrics.
    • Code cannot rely on the return value or side effects of fit_generator().
    • Certain types of callbacks may not be supported—see Callbacks below for more details.
  • Any training generator or validation generator used must not reference non-pickleable objects, including threading.Lock and file objects. One exception to this rule is Keras' ImageDataGenerator, which contains a threading.Lock instance that is specially handled by PEDL.

An example is provided at examples/mnist_keras_simple.

Callbacks

The following is a non-exhaustive list of supported Keras callbacks:

  • LearningRateScheduler

    The first argument to the schedule function will be interpreted as a PEDL step ID instead of an epoch index. Note that a PEDL step ID is 1-based, as opposed to the 0-based epoch index used by Keras. The learning rate will be applied to the optimizer before the training step is executed. For example, the following code uses a learning rate of 0.01 for the first 10 training steps and a learning rate of 0.001 for the rest of training.

    def lr_schedule(step_id: int) -> float:
        if step_id <= 10:
            return 0.01
        else:
            return 0.001
    
    fit_generator(
        ...
        callbacks = [LearningRateScheduler(schedule=lr_schedule)],
        ...
    )
    
  • Validation Metric Callbacks

    The Keras metric API makes it difficult to compute unbatched metrics, such as mAP. One workaround is to pass in a reference to validation data in a Keras callback and compute the metric in on_epoch_end(), as demonstrated by this Github issue. To integrate this workaround into PEDL, make sure your callback inherits from pedl.frameworks.keras_simple_trial.KerasValidationCallback, and add the computed metric value to the logs argument of on_epoch_end(). This will indicate to PEDL that this callback should be run during a validation step instead of during a training step. An example callback that computes the Mean Absolute Error (MAE) is provided below:

    from pedl.frameworks.keras_simple_trial import KerasValidationCallback
    
    class ComputeMAEMetricCallback(KerasValidationCallback):
        def __init__(self, validation_gen) -> None:
            super().__init__()
            self.validation_gen = validation_gen
    
        def on_epoch_end(self, epoch: int, logs: Dict[str, Any]) -> None:
            data, labels = next(self.validation_gen)
            predictions = self.model.predict(data)
            predictions = np.squeeze(predictions)
            mae = np.sum(np.abs(predictions - labels))
            logs["mae"] = mae
    
    ...
    
    model.fit_generator(
        ...
        callbacks=[ComputeMAEMetricCallback(validation_gen)]
    )
    
  • TensorBoard

    If using the TensorBoard callback, the update_freq argument will be ignored and PEDL will serialize the metrics at the end of every training and validation step. All metrics will be serialized following the metric name conventions used by Keras ("val_" is prepended to the validation metric names).

  • ReduceLROnPlateau

    When ReduceLROnPlateau is used as part of a Keras simple model definition, it adheres to the following semantics: If the monitor argument is set to monitor a training metric, the patience and cooldown arguments refer to the number of training steps instead of number of epochs. If the monitor argument is set to track a validation metric (any metric prefixed with "val_"), the patience and cooldown arguments refer to the number of validation steps instead of number of epochs. If using ReduceLROnPlateau to track a validation metric, it is recommended to set a min_validation_period to keep the schedule of validation steps at evenly paced intervals.

Please reach out to the Determined AI team for more information on whether a Keras callback you are using is supported.