PyTorch Model Definition

This part of the documentation describes how to train a PyTorch model in PEDL.

There are three steps needed to define a PyTorch model in PEDL using a Standard Model Definition:

  1. If downloading data, define a download_data() function. See Data Downloading for more information.
  2. Define a make_data_loaders() function. See Data Loading for more information.
  3. Implement the PyTorchTrial interface.

Data Downloading

When doing distributed training or optimized_parallel single machine training of a PyTorch model, a single process is created for each GPU being used on a given agent. Each of these processes will invoke the make_data_loaders() function; in most cases these calls will happen concurrently. If each copy of the make_data_loaders() downloads the entire data set, this causes two problems: (1) the data set will be downloaded multiple times (2) if storing the data set on disk, different copies of the download might overwrite or conflict with one another.

PEDL provides an optional API for downloading data as part of the PytorchTrial training process. If the developer implements a download_data() API function, this function will be invoked once on each machine, before any data loaders are created. This function can be used to download a single copy of the data set, and should return the path of a directory on disk containing the data set. This path can be fetched by calling pedl.get_download_data_dir(), which is commonly done in make_data_loaders().

Function signature: download_data(experiment_config: Dict[str, Any], hparams: Dict[str, Any]) -> str

Data Loading

Loading data into PyTorchTrial models is done by defining a make_data_loaders() function. This function must return a pair of objects (one for training and one for validation); both objects should be instances of behaves the same as, and is a drop-in replacement.

Function signature: make_data_loaders(experiment_config: Dict[str, Any], hparams: Dict[str, Any]) -> Tuple[,]

Each DataLoader is allowed to return batches with arbitrary structures of the following types, which will be fed directly to the train_batch and evaluate_batch functions:

  • np.ndarray

    np.array([[0, 0], [0, 0]])

  • torch.Tensor

    torch.Tensor([[0, 0], [0, 0]])

  • tuple of np.ndarrays or torch.Tensors

    (torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]]))

  • list of np.ndarrays or torch.Tensors

    [torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]])]

  • dictionary mapping strings to np.ndarrays or torch.Tensors

    {"data": torch.Tensor([[0, 0], [0, 0]]), "label": torch.Tensor([[1, 1], [1, 1]])}

  • combination of the above

        "data": [
            {"sub_data1": torch.Tensor([[0, 0], [0, 0]])},
            {"sub_data2": torch.Tensor([0, 0])},
        "label": (torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]])),

PyTorchTrial Interface

PyTorch trials are created by subclassing the abstract class PyTorchTrial. Users must define the following abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it:

  • build_model(self): Defines the deep learning architecture associated with a trial. When doing a hyperparameter search, the value of the a specific hyperparameter can be accessed via pedl.get_hyperparameter and incorporated into the model architecture as appropriate. This method returns the model as an instance or subclass of nn.Module.

  • optimizer(self, model): Specifies an instance of torch.optim.Optimizer to be used for training the given model, e.g., torch.optim.SGD(model.parameters(), learning_rate).

  • train_batch(batch, model, epoch_idx, batch_idx): Returns the loss (a scalar torch.Tensor) from training the model on the given batch. Multiple training metrics can be returned in the form of a dictionary mapping metric names to metric values, with the requirement that the special name "loss" must always map to the main training loss to be used for backpropagation.

  • evaluate_batch(self, batch, model) OR evaluate_full_dataset(self, data_loader, model): Users must implement one of these methods (but not both). evaluate_batch calculates validation metrics for a single batch and returns a dictionary mapping metric names to metric values. evaluate_full_dataset calculates validation metrics for an entire validation set and returns a dictionary mapping metric names to reduced metric values (i.e., each returned metric is the average or sum of that metric across the entire validation set). While evaluate_full_dataset is more flexible, evaluate_batch should be preferred, since it can be parallelized in distributed environments, whereas evaluate_full_dataset cannot.

Optional Methods

  • create_lr_scheduler(self, optimizer): Returns a pedl.frameworks.pytorch.LRScheduler to control the learning rate during training.

  • evaluation_reducer(self): Returns a value of pedl.frameworks.pytorch.Reducer to control how the results from evaluate_batch are aggregated. Must be one of Reducer.AVG (the default), Reducer.SUM, Reducer.MAX, or Reducer.MIN. If multiple validation metrics are used and require different reducers, this method may return a dictionary mapping metric names to per-metric reducers. If evaluate_full_dataset is implemented, evaluate_reducer is ignored.