Native API: Distributed Training¶

One powerful application of the Native API is that it can be used to seamlessly launch distributed training jobs (both single and multi instance) with a minimal set of code changes. This example builds on top of Native API: Basics to demonstrate this.

import tensorflow as tf
import determined as det
from determined import experimental
from determined.experimental.keras import init

config = {
    "searcher": {"name": "single", "metric": "val_accuracy", "max_steps": 5},
    "hyperparameters": {"global_batch_size": "256"},
    "resources": {"slots_per_trial": 8},
}

First, configure the resources.slots_per_trial field in the experiment to choose the number of slots to train on. You should ensure that the Determined cluster you’re using to launch the experiment has a sufficient amount of slots available.

In this case, we’ve configured our experiment to use a global_batch_size of 256 across all slots, or a sub-batch size of 32 on each slot.

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# When running from this code from a notebook, add a `command` argument to
# init() specifying the notebook file name.
context = init(config, mode=experimental.Mode.CLUSTER, context_dir=".")
model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation="softmax"),
    ]
)
model = context.wrap_model(model)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=5)

Now, configure and launch the experiment training job as done in Native API: Basics. Note that no code changes are required to scale up to distributed training.

We use determined.keras.TFKerasNativeContext.get_per_slot_batch_size() to set the framework batch_size argument. Determined will handle initializing the context of each distributed training worker such that it’s sub-batch size is returned by this function. Because Determined manipulates the batch size as a first-class configuration property, global_batch_size is a required hyperparameter in all experiments.

Reference

Gallery generated by Sphinx-Gallery