determined.pytorch¶

`determined.pytorch.PyTorchTrial`¶

class determined.pytorch.PyTorchTrial(trial_context: determined._train_context.TrialContext)¶

PyTorch trials are created by subclassing the abstract class PyTorchTrial. Users must define all abstract methods to create the deep learning model associated with a specific trial, and to subsequently train and evaluate it.

trial_context_class¶: alias of determined.pytorch._pytorch_context.PyTorchTrialContext

abstract build_model() → torch.nn.modules.module.Module¶: Defines the deep learning architecture associated with a trial, which typically depends on the trial’s specific hyperparameter settings stored in the hparams dictionary. This method returns the model as an an instance or subclass of nn.Module.

abstract optimizer(model: torch.nn.modules.module.Module) → torch.optim.optimizer.Optimizer¶: Describes the optimizer to be used during training of the given model, an instance of torch.optim.Optimizer.

abstract train_batch(batch: Union[Dict[str, torch.Tensor], Sequence[torch.Tensor], torch.Tensor], model: torch.nn.modules.module.Module, epoch_idx: int, batch_idx: int) → Union[torch.Tensor, Dict[str, Any]]¶: Calculate the loss for a batch and return it in a dictionary. batch_idx represents the total number of batches processed per device (slot) since the start of training.

abstract build_training_data_loader() → determined.pytorch._data.DataLoader¶

Defines the data loader to use during training.

Must return an instance of determined.pytorch.DataLoader.

abstract build_validation_data_loader() → determined.pytorch._data.DataLoader¶

Defines the data loader to use during validation.

Must return an instance of determined.pytorch.DataLoader.

create_lr_scheduler(optimizer: torch.optim.optimizer.Optimizer) → Optional[determined.pytorch._lr_scheduler.LRScheduler]¶

Create a learning rate scheduler for the trial given an instance of the optimizer.

Parameters: optimizer (torch.optim.Optimizer) – instance of the optimizer to be used for training
Returns: Wrapper around a torch.optim.lr_scheduler._LRScheduler.
Return type: det.pytorch.LRScheduler

build_callbacks() → Dict[str, determined.pytorch._callback.PyTorchCallback]¶

Defines a dictionary of string names to callbacks (if any) to be used during training and/or validation.

The string name will be used as the key to save and restore callback state for any callback that defines load_state_dict() and state_dict().

evaluate_batch(batch: Union[Dict[str, torch.Tensor], Sequence[torch.Tensor], torch.Tensor], model: torch.nn.modules.module.Module) → Dict[str, Any]¶

Calculate evaluation metrics for a batch and return them as a dictionary mapping metric names to metric values.

There are two ways to specify evaluation metrics. Either override evaluate_batch() or evaluate_full_dataset(). While evaluate_full_dataset() is more flexible, evaluate_batch() should be preferred, since it can be parallelized in distributed environments, whereas evaluate_full_dataset() cannot. Only one of evaluate_full_dataset() and evaluate_batch() should be overridden by a trial.

The metrics returned from this function must be JSON-serializable.

evaluation_reducer() → Union[determined.pytorch._reducer.Reducer, Dict[str, determined.pytorch._reducer.Reducer]]¶: Return a reducer for all evaluation metrics, or a dict mapping metric names to individual reducers. Defaults to det.pytorch.Reducer.AVG.

evaluate_full_dataset(data_loader: torch.utils.data.dataloader.DataLoader, model: torch.nn.modules.module.Module) → Dict[str, Any]¶

Calculate validation metrics on the entire validation dataset and return them as a dictionary mapping metric names to reduced metric values (i.e., each returned metric is the average or sum of that metric across the entire validation set).

This validation can not be distributed and is performed on a single device, even when multiple devices (slots) are used for training. Only one of evaluate_full_dataset() and evaluate_batch() should be overridden by a trial.

The metrics returned from this function must be JSON-serializable.

abstract __init__(trial_context: determined._train_context.TrialContext) → None¶

Initializes a trial using the provided trial_context.

Override this function to initialize any shared state between the function implementations.

class determined.pytorch.LRScheduler(scheduler: torch.optim.lr_scheduler._LRScheduler, step_mode: determined.pytorch._lr_scheduler.LRScheduler.StepMode)¶

class StepMode¶

Specifies when and how scheduler.step() should be executed.

STEP_EVERY_EPOCH¶

STEP_EVERY_BATCH¶

MANUAL_STEP¶

__init__(scheduler: torch.optim.lr_scheduler._LRScheduler, step_mode: determined.pytorch._lr_scheduler.LRScheduler.StepMode)¶

Wrapper for a PyTorch LRScheduler.

Usage of this wrapper is required to properly scheduler the optimizer’s learning rate.

This wrapper fulfills two main functions:

Save and restore of the learning rate in case a trial is paused, preempted, etc.
Step the learning rate scheduler for predefined frequencies (every batch or every epoch).

Parameters

scheduler (torch.optim.lr_scheduler._LRScheduler) – Learning rate scheduler to be used by Determined.
step_mode (det.pytorch.LRSchedulerStepMode) –
The strategy Determined will use to call (or not call) scheduler.step().
1. STEP_EVERY_EPOCH: Determined will call scheduler.step() after every training epoch. No arguments will be passed to step().
2. STEP_EVERY_BATCH: Determined will call scheduler.step() after every training batch. No arguments will be passed to step().
3. MANUAL_STEP: Determined will not call scheduler.step() at all. It is up to the user to decide when to call scheduler.step(), and whether to pass any arguments.

get_last_lr() → List¶

Return last computed learning rate by current scheduler.

This function is equivalent to calling get_last_lr() on the wrapped LRScheduler.

step(*args: Any, **kwargs: Any) → None¶: Call step() on the wrapped LRScheduler instance.

class determined.pytorch.Reducer¶

The available methods for reducing metrics available to users.

AVG¶

SUM¶

MAX¶

MIN¶

class determined.pytorch.PyTorchCallback¶

Abstract base class used to define a callback that should execute during the lifetime of a PyTorchTrial.

Warning

If you are defining a stateful callback (e.g. it mutates a self attribute over it’s lifetime), you must also override state_dict() and load_state_dict() to ensure this state can be serialized and deserialized over checkpoints.

Warning

If distributed training is enabled, every GPU will execute a copy of this callback (except for on_validation_step_end and on_checkpoint_end). To configure a callback implementation to execute on a subset of GPUs, please condition your implementation on trial.context.distributed.get_rank().

load_state_dict(state_dict: Dict[str, Any]) → None¶: Load the state of this using the deserialized state_dict.

on_checkpoint_end(checkpoint_dir: str) → None¶: Run after every checkpoint.

Warning

This callback only executes on the chief GPU when doing distributed training.

on_train_step_end(step_id: int, metrics: Dict[str, Any]) → None¶

Run after every training step ends.

..warning::: If distributed training is enabled, every GPU will execute a copy of this callback at the end of every training step. If optimizations.average_training_metrics is enabled, then the metrics will be averaged across all GPUs before the callback is executed. If optimizations.average_training_metrics is disabled, then the metrics will be local to the GPU.

on_train_step_start(step_id: int) → None¶: Run before every training step begins.

on_validation_step_end(metrics: Dict[str, Any]) → None¶: Run after every validation step ends.

Warning

This callback only executes on the chief GPU when doing distributed training.

on_validation_step_start() → None¶: Run before every validation step begins.

state_dict() → Dict[str, Any]¶: Serialize the state of this callback to a dictionary. Return value must be pickle-able.

Data Loading¶

Loading data into PyTorchTrial models is done by defining two functions, build_training_data_loader() and build_validation_data_loader(). These functions should each return an instance of determined.pytorch.DataLoader. determined.pytorch.DataLoader behaves the same as torch.utils.data.DataLoader and is a drop-in replacement.

Each DataLoader is allowed to return batches with arbitrary structures of the following types, which will be fed directly to the train_batch and evaluate_batch functions:

np.ndarray
```
np.array([[0, 0], [0, 0]])
```
torch.Tensor
```
torch.Tensor([[0, 0], [0, 0]])
```

tuple of np.ndarrays or torch.Tensors

(torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]]))

list of np.ndarrays or torch.Tensors

[torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]])]

dictionary mapping strings to np.ndarrays or torch.Tensors

{"data": torch.Tensor([[0, 0], [0, 0]]), "label": torch.Tensor([[1, 1], [1, 1]])}

combination of the above

{
    "data": [
        {"sub_data1": torch.Tensor([[0, 0], [0, 0]])},
        {"sub_data2": torch.Tensor([0, 0])},
    ],
    "label": (torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]])),
}

Trial Context¶

determined.pytorch.PyTorchTrialContext subclasses determined.TrialContext. It provides useful methods for writing Trial subclasses.

class determined.pytorch.PyTorchTrialContext(*args: Any, **kwargs: Any)¶

Base context class that contains runtime information for any Determined workflow that uses the pytorch API.

get_lr_scheduler() → Optional[determined.pytorch._lr_scheduler.LRScheduler]¶

Get the scheduler associated with the trial, if one is defined. This function should not be called from:

__init__

build_model()

optimizer()

create_lr_scheduler()

get_model() → torch.nn.modules.module.Module¶

Get the model associated with the trial. This function should not be called from:

__init__

build_model()

get_optimizer() → torch.optim.optimizer.Optimizer¶

Get the optimizer associated with the trial. This function should not be called from:

__init__

build_model()

optimizer()

Callbacks¶

To execute arbitrary Python functionality during the lifecycle of a PyTorchTrial, implement the callback interface:

class determined.pytorch.PyTorchCallback

Abstract base class used to define a callback that should execute during the lifetime of a PyTorchTrial.

Warning

If you are defining a stateful callback (e.g. it mutates a self attribute over it’s lifetime), you must also override state_dict() and load_state_dict() to ensure this state can be serialized and deserialized over checkpoints.

Warning

If distributed training is enabled, every GPU will execute a copy of this callback (except for on_validation_step_end and on_checkpoint_end). To configure a callback implementation to execute on a subset of GPUs, please condition your implementation on trial.context.distributed.get_rank().

load_state_dict(state_dict: Dict[str, Any]) → None: Load the state of this using the deserialized state_dict.

on_checkpoint_end(checkpoint_dir: str) → None: Run after every checkpoint.

Warning

This callback only executes on the chief GPU when doing distributed training.

on_train_step_end(step_id: int, metrics: Dict[str, Any]) → None

Run after every training step ends.

..warning::: If distributed training is enabled, every GPU will execute a copy of this callback at the end of every training step. If optimizations.average_training_metrics is enabled, then the metrics will be averaged across all GPUs before the callback is executed. If optimizations.average_training_metrics is disabled, then the metrics will be local to the GPU.

on_train_step_start(step_id: int) → None: Run before every training step begins.

on_validation_step_end(metrics: Dict[str, Any]) → None: Run after every validation step ends.

Warning

This callback only executes on the chief GPU when doing distributed training.

on_validation_step_start() → None: Run before every validation step begins.

state_dict() → Dict[str, Any]: Serialize the state of this callback to a dictionary. Return value must be pickle-able.

`ReduceLROnPlateau`¶

To use the torch.optim.lr_scheduler.ReduceLROnPlateau class with PyTorchTrial, implement the following callback:

class ReduceLROnPlateauEveryValidationStep(PyTorchCallback):
    def __init__(self, context):
        self.reduce_lr = torch.optim.lr_scheduler.ReduceLROnPlateau(
            context.get_optimizer(), "min", verbose=True
        )  # customize arguments as desired here

    def on_validation_step_end(self, metrics):
        self.reduce_lr.step(metrics["validation_error"])

    def state_dict(self):
        return self.reduce_lr.state_dict()

    def load_state_dict(self, state_dict):
        self.reduce_lr.load_state_dict(state_dict)

Then, implement the build_callbacks function in PyTorchTrial:

def build_callbacks(self):
    return {"reduce_lr": ReduceLROnPlateauEveryValidationStep(self.context)}

Examples¶

cifar10_cnn_pytorch (PyTorch Sequential model)
mnist_pytorch (two examples: PyTorch Sequential model and true multi-input multi-output model)

determined.pytorch¶