Advanced Usage¶
Training Multiple Model Engines¶
If the model engines use the same ModelParallelUnit
, you can
train multiple model engines in a single DeepSpeedTrial
by
calling wrap_model_engine()
on additional
model engines you want to use, and by modifying
train_batch()
and
evaluate_batch()
accordingly.
The accounting for number of samples is with respect to the train_batch_size
for the first model
engine passed to wrap_model_engine()
.
For more advanced cases where model engines have different model parallel topologies, contact support on the Determined community Slack.
Custom Reducers¶
Determined supports arbitrary training and validation metrics reduction, including during
distributed training, by letting you define custom reducers. Custom reducers can be a function or an
implementation of the determined.pytorch.MetricReducer
interface. See
determined.pytorch.PyTorchTrialContext.wrap_reducer()
for more information.
Manual Distributed Backend Initialization¶
By default, DeepSpeedTrial
initializes the distributed
backend by calling deepspeed.init_distributed
before a trial is created. This initializes the
torch.distributed
backend to use the NVIDIA Collective Communications Library (NCCL). If you
want to customize the distributed backend initialization, set the DET_MANUAL_INIT_DISTRIBUTED
environment variable in your experiment configuration:
environment:
environment_variables:
- DET_MANUAL_INIT_DISTRIBUTED=1
Manual Gradient Aggregation¶
DeepSpeedTrial
automatically ensures a total of
train_batch_size
samples are processed in each training iteration. With the assumption that
train_batch()
calls the model engine’s forward,
backward, and optimizer step methods once, DeepSpeedTrial
calls train_batch()
:
gradient_accumulation_steps
times when not using pipeline parallelismonce when using pipeline parallelism
to reach model_engine.train_batch_size()
for the first wrapped model engine.
To disable this behavior, call
disable_auto_grad_accumulation()
in the
__init__()
method of
DeepSpeedTrial
. In this case, make sure the first model
engine processes train_batch_size
samples in each call to
train_batch()
.
Custom Data Loaders¶
By default, build_training_data_loader()
and
build_validation_data_loader()
are expected to
return a determined.pytorch.DataLoader
, which is a thin wrapper around
torch.utils.data.DataLoader
that supports reproducibility and data sharding for distributed
training.
Override this requirement and return a torch.utils.data.DataLoader
by setting
disable_dataset_reproducibility_checks()
.
Review customizing a reproducible dataset for recommended best
practices when using a custom data loader.
A common use case for a custom data loader is if you created the data loader when building the model engine as show in this example:
class MyTrial(DeepSpeedTrial):
def __init__(self, context: DeepSpeedTrialContext) -> None:
self.context = context
self.args = AttrDict(self.context.get_hparams())
training_data = ...
model = Net(self.args)
parameters = filter(lambda p: p.requires_grad, model.parameters())
model_engine, __, __, self.train_dataloader = deepspeed.initialize(
args=self.args,
model=model,
model_parameters=parameters,
training_data=training_data,
)
self.model_engine = self.context.wrap_model_engine(model_engine)
def build_training_data_loader(self) -> torch.utils.data.DataLoader:
return self.train_dataloader
Custom Model Parallelism¶
DeepSpeedTrial
relies on a
ModelParallelUnit
to provide data parallel world size and to
determine whether a GPU slot should build the data loaders and report metrics. For data parallel
training with DeepSpeed, the data parallel world size is equal to the number of GPU slots and all
GPU slots build the data loaders and report metrics. If the model engine passed to
wrap_model_engine()
is a
PipelineEngine
, the ModelParallelUnit
is built using the
MPU associated with the model engine. To change this behavior to support custom model parallelism,
pass a custom set_mpu
as shown in the following
example:
context.set_mpu(
ModelParallelUnit(
data_parallel_rank=[fill in],
data_parallel_world_size=[fill in],
should_report_metrics=[fill in],
should_build_dataloader=[fill in]
)
)