TensorFlow Model Definition

This part of the documentation describes how to train a TensorFlow model in PEDL. For TensorFlow models, there are two interfaces.

  • TensorFlowTrial provides finer-grained control over data loading, model construction and computation flow; it is the interface that most closely supports low-level TensorFlow models.
  • EstimatorTrial supports the high-level TensorFlow Estimators interface.

Support for Keras models is described on the Keras trial page.

TensorFlowTrial Interface

There are two steps needed to define a TensorFlow model in PEDL using TensorFlowTrial.

  1. Define a make_data_loaders() function to specify data access and any preprocessing in the data pipeline.
  2. Subclass the abstract class TensorFlowTrial. This part of the interface defines the deep learning model, including the graph, loss, and optimizers.

Data Loading via make_data_loaders()

A PEDL user prescribes data access in TensorFlowTrial by writing a make_data_loaders() function. This function should return a pair of batched tf.data.Dataset objects, the first for the training set and the second for the validation set.

def make_data_loaders(experiment_config, hparams):
    return trainDataset, valDataset

In TensorFlowTrial, users define optimizations in the input pipeline for their training and validation Datasets within the make_data_loaders function, before returning these Datasets. See the TensorFlow Data Input Pipeline Performance guide for more details on capabilities of Datasets. For example, to specify prefetching and batching for Datasets generated from passing TFRecords through a map foo:

def make_data_loaders(experiment_config, hparams):
    trainDataset = tf.data.Dataset.list_files("/path/train-*.tfrecord")
    trainDataset = trainDataset.map(map_func=foo)
    # PEDL requires that make_data_loaders() return batched Datasets.
    trainDataset = trainDataset.batch(batch_size=BATCH_SIZE)
    trainDataset = trainDataset.prefetch(tf.data.experimental.AUTOTUNE)

    valDataset = tf.data.Dataset.list_files("/path/val-*.tfrecord")
    valDataset = valDataset.map(map_func=foo)
    # PEDL requires that make_data_loaders() return batched Datasets.
    valDataset = valDataset.batch(batch_size=BATCH_SIZE)
    valDataset = valDataset.prefetch(tf.data.experimental.AUTOTUNE)

    return trainDataset, valDataset

Record format: PEDL does not restrict the format of the records in the training and validation Datasets returned by make_data_loaders. However, the two Datasets should have the same output_classes, output_shapes, and output_types properties so that they can feed the same graph. When defining the TensorFlow graph in their model code (via the build_graph(record, is_training) method explained below), the parameter record represents the output of the Dataset.

Passing fields from the experiment configuration file: The make_data_loaders function takes two arguments, experiment_config and hparams. We recommend users pass in metadata regarding the data pipeline through the data field in the experiment configuration file; its subfields can be accessed as experiment_config["data"].get("field_name"). Subfield names are up to the user to define. The second argument hparams gives the user access to this trial's sample of the hyperparameters in the experiment configuration file. As an example, if we add fields in the experiment configuration file as follows:

    path: /my_data_path/
      type: categorical
      vals: [8, 16]

then we could update the above example to be:

def make_data_loaders(experiment_config, hparams):
    # data_path will evaluate to "/my_data_path/"
    data_path = experiment_config["data"]["path"]
    # batch_size will evaluate to either 8 or 16, depending on this trial's
    # sample of hyperparameters.
    batch_size = hparams["batch_size"]

    trainDataset = tf.data.Dataset.list_files(data_path + "train-*.tfrecord")
    trainDataset = trainDataset.map(map_func=foo)
    # PEDL requires that make_data_loaders() return batched Datasets.
    trainDataset = trainDataset.batch(batch_size)
    trainDataset = trainDataset.prefetch(tf.data.experimental.AUTOTUNE)

    valDataset = tf.data.Dataset.list_files(data_path + "val-*.tfrecord")
    valDataset = valDataset.map(map_func=foo)
    # PEDL requires that make_data_loaders() return batched Datasets.
    valDataset = valDataset.batch(batch_size)
    valDataset = valDataset.prefetch(tf.data.experimental.AUTOTUNE)

    return trainDataset, valDataset

Subclassing TensorFlowTrial

TensorFlow trials are created by subclassing the abstract class TensorFlowTrial. Users must define the following abstract methods that will specify the deep learning model (including the TensorFlow graph, with input nodes and loss) associated with a trial in the experiment, as well as how to subsequently train and evaluate it:

  • optimizer(self): Specifies the optimizer by subclassing tf.train.Optimizer, e.g., tf.train.MomentumOptimizer or tf.train.RMSPropOptimizer.
  • build_graph(self, record, is_training): Builds a TensorFlow graph of variables and operations used for both training and validation.
    • Arguments: The argument record is a nested structure representing the symbolic output of the appropriate Dataset (the training Dataset during training or the validation Dataset during validation). Typically, record is a list or dictionary of tf.Tensors. Users should use the tf.Tensors in record as the inputs to their computational graph (see the CIFAR10 and MNIST examples). The argument is_training is a Boolean tf.Tensor that is True during a training step and False during a validation step.
    • Return: This method should return a dict that maps string names to tf.Tensor nodes in the graph. Any outputs that may potentially be used as training and/or validation metrics should be included in the returned dict. This function must return a tf.Tensor named "loss", for the value that will be optimized during training.
  • validation_metrics(self): Specifies a list of metric names that will be evaluated on the validation data set. The returned list of strings must contain only names of tf.Tensor values returned by build_graph().

Optional Methods

  • training_metrics(self): Specifies a list of names of metrics that will be evaluated on each batch of training data. Defaults to the empty list. The training loss (the tf.Tensor named loss returned by build_graph()) is always computed and reported; if supplied, this function specifies additional training metrics to be computed. The returned list of strings must 1) exclude the string "loss" and 2) contain only names of tf.Tensors values returned by build_graph().
  • session_config(self): Specifies the tf.ConfigProto to be used by the TensorFlow session. By default, tf.ConfigProto(allow_soft_placement=True) is used.

PEDL: Graphs, Sessions, and Control Flow

In part to support its scheduling capabilities, PEDL handles creating sessions and sess.run() calls for users. PEDL will also call the user's implementation of build_graph() to build the TensorFlow graph for training. The user's implementation of build_graph() might not return the same graph in each trial of the experiment; build_graph() is responsible for incorporating each trial's sample of hyperparameters into the graph. PEDL initializes the session, calls build_graph(), and proceeds with training and validation of the graph.

  • Initialization: The session is initialized using session_config().
  • Training steps: Records of the training Dataset specified by the user in make_data_loaders() are fed into build_graph(), with is_training set to True. For each training batch, PEDL will make a sess.run(t_metrics) call, where t_metrics is the union of the loss (the Tensor named "loss" in the output of build_graph()) and training metrics (the Tensors in the output of build_graph() named by training_metrics()). Metrics for each step are computed by averaging over batches. In one training step, batches_per_step batches are fed through the graph. (batches_per_step defaults to 100; a custom value may be set via the experiment configuration file).
  • Validation steps: Records of the validation Dataset in make_data_loaders() are fed into build_graph(), with is_training set to False. For each validation batch, PEDL will make a sess.run(v_metrics) call, where v_metrics contains the Tensors in the output of build_graph() named by training_metrics(). Metrics for each step are computed by averaging over batches. (Support for tf.metrics and other metric aggregation methods for validation is forthcoming.) In one validation step, the validation set is fed through once.

Note: The same graph output by one call to build_graph() is used for both training and validation. To distinguish between these cases (e.g., to use dropout at training time but not at validation time), build_graph takes a parameter is_training, which is a Boolean tf.Tensor.

Examples for TensorFlowTrial

EstimatorTrial Interface

To use TensorFlow's high-level tf.estimator.Estimator API with PEDL, users should subclass the abstract class EstimatorTrial. (No make_data_loaders() function is needed.) Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • build_estimator(self, hparams): Specifies the tf.estimator.Estimator instance to be used during training and validation. This may be an instance of a Premade Estimator provided by the TensorFlow team, or a Custom Estimator created by the user.
  • build_train_spec(self, hparams): Specifies the tf.estimator.TrainSpec to be used for training steps. This training specification will contain a TensorFlow input_fn which constructs the input data for a training step. Unlike the standard Tensorflow input_fn interface, EstimatorTrial only supports an input_fn that returns a tf.data.Dataset object. A function that returns a tuple of features and labels is currently not supported by EstimatorTrial. Additionally, the max_steps attribute of the training specification will be ignored; instead, the batches_per_step option in the experiment configuration is used to determine how many batches each training step uses.
  • build_validation_spec(self, hparams): Specifies the tf.estimator.EvalSpec to be used for validation steps. This evaluation spec will contain a TensorFlow input_fn which constructs the input data for a validation step. The validation step will evaluate steps batches, or evaluate until the input_fn raises an end-of-input exception if steps is None.

Required Wrappers

To use EstimatorTrial users need to wrap their optimizer and datasets using PEDL-provided wrappers.

  • pedl.frameworks.tensorflow.wrap_dataset(dataset): This should be used to wrap tf.data.Dataset objects immediately after they have been created. Users should use the output of this wrapper as the new instance of their dataset. If users create multiple datasets (e.g., one for training and one for testing) users should wrap each dataset independently. E.g., If users instantiate their training dataset within build_train_spec(), they should call dataset = wrap_dataset(dataset) prior to passing it into tf.estimator.TrainSpec.
  • pedl.frameworks.tensorflow.wrap_optimizer(optimizer): This should be used to wrap optimizers objects immediately after they have been created. Users should use the output of this wrapper as the new instance of their optimizer. E.g., If users create their optimizer within build_estimator(), they should call optimizer = wrap_optimizer(optimzer) prior to passing the optimizer into their Estimator.

Optional Methods

  • build_serving_input_receiver_fns(self, hparams): Optionally returns a Python dictionary mapping string names to serving_input_receiver_fns. If specified, each serving input receiver function will be used to export a distinct SavedModel inference graph when a PEDL checkpoint is saved, using Estimator.export_saved_model. The exported models are saved under subdirectories named by the keys of the respective serving input receiver functions. For example, returning
      "raw": tf.estimator.export.build_raw_serving_input_receiver_fn(...),
      "parsing": tf.estimator.export.build_parsing_serving_input_receiver_fn(...)
    from this function would configure PEDL to export two SavedModel inference graphs in every checkpoint under raw and parsing subdirectories, respectively. By default, this function returns an empty dictionary and the PEDL checkpoint directory only contains metadata associated with the training graph.

Data Downloading

When doing distributed training or optimized_parallel single machine training of a Estimator model, a single process is created for each GPU being used on a given agent. Each of these processes will invoke the build_train_spec() and build_val_spec() function; in most cases these calls will happen concurrently on all GPUs. If those functions download the entire data set, this causes two problems: (1) the data set will be downloaded multiple times (2) if storing the data set on disk, different copies of the download might overwrite or conflict with one another.

PEDL provides an optional API for downloading data as part of the EstimatorTrial training process. If the developer implements a download_data() API function, this function will be invoked once on each machine, before any training or validation occurs. This function can be used to download a single copy of the data set, and should return the path of a directory on disk containing the data set. This path can then be fetched by the user via pedl.get_download_data_dir().

Function signature: download_data(experiment_config: Dict[str, Any], hparams: Dict[str, Any]) -> str.

Examples for EstimatorTrial