determined.pytorch¶
determined.pytorch.PyTorchTrial
¶
-
class
determined.pytorch.
PyTorchTrial
(trial_context: determined.pytorch._pytorch_context.PyTorchTrialContext)¶ PyTorch trials are created by subclassing the abstract class
PyTorchTrial
.-
trial_context_class
¶ alias of
determined.pytorch._pytorch_context.PyTorchTrialContext
-
abstract
__init__
(trial_context: determined.pytorch._pytorch_context.PyTorchTrialContext) → None¶ Initializes a trial using the provided trial context.
Override this function to initialize any shared state between the function implementations.
Models, optimizers, and LR schedulers can be defined in the abstract methods.
-
build_model
() → torch.nn.modules.module.Module¶ Defines the deep learning architecture associated with a trial. This method returns the model as an instance or subclass of
nn.Module
.
-
optimizer
(model: torch.nn.modules.module.Module) → torch.optim.optimizer.Optimizer¶ Describes the optimizer to be used during training of the given model, an instance of
torch.optim.Optimizer
.
-
create_lr_scheduler
(optimizer: torch.optim.optimizer.Optimizer) → Optional[determined.pytorch._lr_scheduler.LRScheduler]¶ Create a learning rate scheduler for the trial given an instance of the optimizer.
- Parameters
optimizer (torch.optim.Optimizer) – instance of the optimizer to be used for training
- Returns
Wrapper around a
torch.optim.lr_scheduler._LRScheduler
.- Return type
det.pytorch.LRScheduler
-
abstract
train_batch
(batch: Union[Dict[str, torch.Tensor], Sequence[torch.Tensor], torch.Tensor], model: torch.nn.modules.module.Module, epoch_idx: int, batch_idx: int) → Union[torch.Tensor, Dict[str, Any]]¶ Calculate the loss for a batch and return it in a dictionary.
batch_idx
represents the total number of batches processed per device (slot) since the start of training.
-
abstract
build_training_data_loader
() → determined.pytorch._data.DataLoader¶ Defines the data loader to use during training.
Must return an instance of
determined.pytorch.DataLoader
.
-
abstract
build_validation_data_loader
() → determined.pytorch._data.DataLoader¶ Defines the data loader to use during validation.
Must return an instance of
determined.pytorch.DataLoader
.
-
build_callbacks
() → Dict[str, determined.pytorch._callback.PyTorchCallback]¶ Defines a dictionary of string names to callbacks (if any) to be used during training and/or validation.
The string name will be used as the key to save and restore callback state for any callback that defines
load_state_dict()
andstate_dict()
.
-
evaluate_batch
(batch: Union[Dict[str, torch.Tensor], Sequence[torch.Tensor], torch.Tensor], model: torch.nn.modules.module.Module) → Dict[str, Any]¶ Calculate evaluation metrics for a batch and return them as a dictionary mapping metric names to metric values.
There are two ways to specify evaluation metrics. Either override
evaluate_batch()
orevaluate_full_dataset()
. Whileevaluate_full_dataset()
is more flexible,evaluate_batch()
should be preferred, since it can be parallelized in distributed environments, whereasevaluate_full_dataset()
cannot. Only one ofevaluate_full_dataset()
andevaluate_batch()
should be overridden by a trial.The metrics returned from this function must be JSON-serializable.
-
evaluation_reducer
() → Union[determined.pytorch._reducer.Reducer, Dict[str, determined.pytorch._reducer.Reducer]]¶ Return a reducer for all evaluation metrics, or a dict mapping metric names to individual reducers. Defaults to
det.pytorch.Reducer.AVG
.
-
evaluate_full_dataset
(data_loader: torch.utils.data.dataloader.DataLoader, model: torch.nn.modules.module.Module) → Dict[str, Any]¶ Calculate validation metrics on the entire validation dataset and return them as a dictionary mapping metric names to reduced metric values (i.e., each returned metric is the average or sum of that metric across the entire validation set).
This validation can not be distributed and is performed on a single device, even when multiple devices (slots) are used for training. Only one of
evaluate_full_dataset()
andevaluate_batch()
should be overridden by a trial.The metrics returned from this function must be JSON-serializable.
-
-
class
determined.pytorch.
LRScheduler
(scheduler: torch.optim.lr_scheduler._LRScheduler, step_mode: determined.pytorch._lr_scheduler.LRScheduler.StepMode)¶ Wrapper for a PyTorch LRScheduler.
This wrapper fulfills two main functions:
Save and restore the learning rate when a trial is paused, preempted, etc.
Step the learning rate scheduler at the configured frequency (e.g., every batch or every epoch).
-
class
StepMode
¶ Specifies when and how scheduler.step() should be executed.
-
STEP_EVERY_EPOCH
¶
-
STEP_EVERY_BATCH
¶
-
MANUAL_STEP
¶
-
-
__init__
(scheduler: torch.optim.lr_scheduler._LRScheduler, step_mode: determined.pytorch._lr_scheduler.LRScheduler.StepMode)¶ LRScheduler constructor
- Parameters
scheduler (
torch.optim.lr_scheduler._LRScheduler
) – Learning rate scheduler to be used by Determined.step_mode (
det.pytorch.LRSchedulerStepMode
) –The strategy Determined will use to call (or not call) scheduler.step().
STEP_EVERY_EPOCH
: Determined will call scheduler.step() after every training epoch. No arguments will be passed to step().STEP_EVERY_BATCH
: Determined will call scheduler.step() after every training batch. No arguments will be passed to step().MANUAL_STEP
: Determined will not call scheduler.step() at all. It is up to the user to decide when to call scheduler.step(), and whether to pass any arguments.
-
class
determined.pytorch.
Reducer
¶ The available methods for reducing metrics available to users.
-
AVG
¶
-
SUM
¶
-
MAX
¶
-
MIN
¶
-
-
class
determined.pytorch.
PyTorchCallback
¶ Abstract base class used to define a callback that should execute during the lifetime of a PyTorchTrial.
Warning
If you are defining a stateful callback (e.g., it mutates a
self
attribute over its lifetime), you must also overridestate_dict()
andload_state_dict()
to ensure this state can be serialized and deserialized over checkpoints.Warning
If distributed training is enabled, every GPU will execute a copy of this callback (except for
on_validation_end()
,on_validation_step_end()
andon_checkpoint_end()
). To configure a callback implementation to execute on a subset of GPUs, please condition your implementation ontrial.context.distributed.get_rank()
.-
load_state_dict
(state_dict: Dict[str, Any]) → None¶ Load the state of this using the deserialized
state_dict
.
-
on_before_optimizer_step
(parameters: Iterator) → None¶ Run before every before
optimizer.step()
. For multi-GPU training, executes after gradient updates have been communicated. Typically used to perform gradient clipping.
-
on_checkpoint_end
(checkpoint_dir: str) → None¶ Run after every checkpoint.
Warning
This callback only executes on the chief GPU when doing distributed training.
-
on_validation_end
(metrics: Dict[str, Any]) → None¶ Run after every validation ends.
Warning
This callback only executes on the chief GPU when doing distributed training.
-
on_validation_start
() → None¶ Run before every validation begins.
-
on_validation_step_end
(metrics: Dict[str, Any]) → None¶ Run after every validation step ends.
Warning
This callback only executes on the chief GPU when doing distributed training.
-
on_validation_step_start
() → None¶ Run before every validation step begins.
-
state_dict
() → Dict[str, Any]¶ Serialize the state of this callback to a dictionary. Return value must be pickle-able.
-
Data Loading¶
Loading data into PyTorchTrial
models is done by defining two functions,
build_training_data_loader()
and build_validation_data_loader()
.
These functions should each return an instance of
determined.pytorch.DataLoader
. determined.pytorch.DataLoader
behaves
the same as torch.utils.data.DataLoader
and is a drop-in replacement.
Each DataLoader
is allowed to return batches with arbitrary
structures of the following types, which will be fed directly to the
train_batch
and evaluate_batch
functions:
np.ndarray
np.array([[0, 0], [0, 0]])
torch.Tensor
torch.Tensor([[0, 0], [0, 0]])
tuple of
np.ndarray
s ortorch.Tensor
s(torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]]))
list of
np.ndarray
s ortorch.Tensor
s[torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]])]
dictionary mapping strings to
np.ndarray
s ortorch.Tensor
s{"data": torch.Tensor([[0, 0], [0, 0]]), "label": torch.Tensor([[1, 1], [1, 1]])}
combination of the above
{ "data": [ {"sub_data1": torch.Tensor([[0, 0], [0, 0]])}, {"sub_data2": torch.Tensor([0, 0])}, ], "label": (torch.Tensor([0, 0]), torch.Tensor([[0, 0], [0, 0]])), }
Trial Context¶
determined.pytorch.PyTorchTrialContext
subclasses determined.TrialContext.
It provides useful methods for writing Trial
subclasses.
-
class
determined.pytorch.
PyTorchTrialContext
(*args: Any, **kwargs: Any)¶ Contains runtime information for any Determined workflow that uses the
pytorch
API.-
get_lr_scheduler
() → Optional[determined.pytorch._lr_scheduler.LRScheduler]¶ Get the scheduler associated with the trial, if one is defined. This function should not be called from:
__init__
build_model()
optimizer()
create_lr_scheduler()
-
get_model
() → torch.nn.modules.module.Module¶ Get the model associated with the trial. This function should not be called from:
__init__
build_model()
-
get_optimizer
() → torch.optim.optimizer.Optimizer¶ Get the optimizer associated with the trial. This function should not be called from:
__init__
build_model()
optimizer()
-
is_epoch_end
() → bool¶ Returns true if the current batch is the last batch of the epoch.
Warning
Not accurate for variable size epochs.
-
is_epoch_start
() → bool¶ Returns true if the current batch is the first batch of the epoch.
Warning
Not accurate for variable size epochs.
-
Callbacks¶
To execute arbitrary Python functionality during the lifecycle of a
PyTorchTrial
, implement the callback interface:
-
class
determined.pytorch.
PyTorchCallback
Abstract base class used to define a callback that should execute during the lifetime of a PyTorchTrial.
Warning
If you are defining a stateful callback (e.g., it mutates a
self
attribute over its lifetime), you must also overridestate_dict()
andload_state_dict()
to ensure this state can be serialized and deserialized over checkpoints.Warning
If distributed training is enabled, every GPU will execute a copy of this callback (except for
on_validation_end()
,on_validation_step_end()
andon_checkpoint_end()
). To configure a callback implementation to execute on a subset of GPUs, please condition your implementation ontrial.context.distributed.get_rank()
.-
load_state_dict
(state_dict: Dict[str, Any]) → None Load the state of this using the deserialized
state_dict
.
-
on_before_optimizer_step
(parameters: Iterator) → None Run before every before
optimizer.step()
. For multi-GPU training, executes after gradient updates have been communicated. Typically used to perform gradient clipping.
-
on_checkpoint_end
(checkpoint_dir: str) → None Run after every checkpoint.
Warning
This callback only executes on the chief GPU when doing distributed training.
-
on_validation_end
(metrics: Dict[str, Any]) → None Run after every validation ends.
Warning
This callback only executes on the chief GPU when doing distributed training.
-
on_validation_start
() → None Run before every validation begins.
-
on_validation_step_end
(metrics: Dict[str, Any]) → None Run after every validation step ends.
Warning
This callback only executes on the chief GPU when doing distributed training.
-
on_validation_step_start
() → None Run before every validation step begins.
-
state_dict
() → Dict[str, Any] Serialize the state of this callback to a dictionary. Return value must be pickle-able.
-
ReduceLROnPlateau
¶
To use the torch.optim.lr_scheduler.ReduceLROnPlateau
class with PyTorchTrial
, implement the following callback:
class ReduceLROnPlateauEveryValidationStep(PyTorchCallback):
def __init__(self, context):
self.reduce_lr = torch.optim.lr_scheduler.ReduceLROnPlateau(
context.get_optimizer(), "min", verbose=True
) # customize arguments as desired here
def on_validation_end(self, metrics):
self.reduce_lr.step(metrics["validation_error"])
def state_dict(self):
return self.reduce_lr.state_dict()
def load_state_dict(self, state_dict):
self.reduce_lr.load_state_dict(state_dict)
Then, implement the build_callbacks
function in PyTorchTrial
:
def build_callbacks(self):
return {"reduce_lr": ReduceLROnPlateauEveryValidationStep(self.context)}
Gradient Clipping
¶
To perform gradient clipping Determined provides two pre-made callback classes:
-
class
determined.pytorch.
ClipGradsL2Norm
(clip_value: float)¶ Callback that performs gradient clipping using L2 Norm.
-
on_before_optimizer_step
(parameters: Iterator) → None¶ Run before every before
optimizer.step()
. For multi-GPU training, executes after gradient updates have been communicated. Typically used to perform gradient clipping.
-
-
class
determined.pytorch.
ClipGradsL2Value
(clip_value: float)¶ Callback that performs gradient clipping using L2 Value.
-
on_before_optimizer_step
(parameters: Iterator) → None¶ Run before every before
optimizer.step()
. For multi-GPU training, executes after gradient updates have been communicated. Typically used to perform gradient clipping.
-