Optimizing Training#
When optimizing the training speed of a model, the first step is to understand where and why training is slow. Once the bottlenecks have been identified, the next step is to do further investigation and experimentation to alleviate those bottlenecks.
To understand the performance profile of a training job, the training code and infrastructure need to be instrumented. Many different layers can be instrumented, from raw throughput all the way down to GPU kernels.
Determined provides two tools out-of-the-box for instrumenting training:
System Metrics: measurements of hardware usage
Timings: durations of actions taken during training, such as data loading
System Metrics are useful to see if the software is taking full advantage of the available hardware, particularly around GPU usage, data loading, and network communication during distributed training. Timings are useful for identifying the section of code to focus on for optimizations. Most commonly, Timings help answer the question of whether the dataloader is the main bottleneck in training.
System Metrics#
System Metrics are statistics around hardware usage, such as GPU utilization and network throughput. These metrics are useful for seeing whether training is using the hardware effectively. When the System Metrics reported for an experiment are below what is expected from the hardware, that is a sign that the software may be able to be optimized to make better use of the hardware resources.
Specifically, Determined tracks:
GPU utilization
GPU free memory
Network throughput (sent)
Network throughput (received)
Disk IOPS
Disk throughput (read)
Disk throughput (write)
Host available memory
CPU utilization averaged across cores
For distributed training, these metrics are collected for every agent. The data are broken down by agent, and GPU metrics can be further broken down by GPU.
Note
System Metrics record agent-level metrics, so when there are multiple experiments on the same agent, it is difficult to analyze. We suggest that profiling is done with only a single experiment per agent.
Timings#
The other type of profiling metric that Determined tracks is Timings. Timings are measurements of how long specific training events take. Examples of training events include retrieving data from the dataloader, moving data between host and device, running the forward/backward pass, and executing callbacks.
Note
Timings are currently only supported for PyTorchTrial
.
These measurements provide a high-level picture of where to focus optimization efforts. Specifically, Determined tracks the following Timings:
dataloader_next
: time to retrieve the next item from the dataloaderto_device
: time to transfer input from host to devicetrain_batch
: how long the user-definedtrain_batch
function takes to execute*step_lr_schedulers
: amount of time to update the LR schedulesfrom_device
: time to transfer output from device to hostreduce_metrics
: time taken to calculate global metrics in distributed training
* train_batch
is typically the forward pass and the backward pass, but it is a user-defined
function so it could include other steps.