Distributed Training with Determined#
Learn how to perform optimized distributed training with Determined to speed up the training of a single trial.
In Concepts of Distributed Training, you’ll learn about the following topics:
How Determined distributed training works
Reducing computation and communication overhead
Training effectively with large batch sizes
Model characteristics that affect performance
Debugging performance bottlenecks
Optimizing training
Visit Implementing Distributed Training to discover how to implement distributed training, including the following:
Connectivity considerations for multi-machine training
Configuration including slots per trial and global batch size
Considerations for concurrent data downloads
Details to be aware regarding scheduler behavior
Accelerating inference workloads
Additional Resources:
Learn how Configuration Templates can help reduce redundancy.
Discover how Determined aims to support reproducible machine learning experiments in Reproducibility.
In Optimizing Training, you’ll learn about out-of-the box tools you can use for instrumenting training.