Shortcuts

Native API: Basics

First, let’s consider what it looks like to train a very simple model on MNIST using tf.keras, taken directly from TensorFlow documentation.

import tensorflow as tf

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation="softmax"),
    ]
)
model.compile(
    tf.keras.optimizers.Adam(name='Adam'),
    loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=1)

Here is what it looks like to train the exact same model using the Native API to launch an experiment on a Determined cluster.

import tensorflow as tf

import determined as det
from determined import experimental
from determined.experimental.keras import init

config = {
    "searcher": {"name": "single", "metric": "val_acc", "max_length": {"batches": 500}},
    "hyperparameters": {"global_batch_size": 32},
}

# When running this code from a notebook, add a `command` argument to init()
# specifying the notebook file name.
context = init(config, context_dir=".")

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation="softmax"),
    ]
)
model = context.wrap_model(model)
model.compile(
    tf.keras.optimizers.Adam(name='Adam'),
    loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=5)

Paste the code above into a Python file named tf_keras_native.py and run it as a Python script.

Note

Before submitting any experiments using the Native API, make sure the DET_MASTER environment variable is configured to connect to the appropriate IP address.

$ python tf_keras_native.py

You can also use any environment that supports Python to launch an experiment with this code, such as a Jupyter notebook or an IDE.

Let’s walk through some of the concepts introduced by the Native API.

Configuration

config = {
    "searcher": {"name": "single", "metric": "val_acc", "max_length": {"batches": 500}},
    "hyperparameters": {"global_batch_size": 16},
}

Configuring any experiment for use with Determined requires an Experiment Configuration. In the Native API, this is represented as a Python dictionary. There are two required fields for every configuration submitted via the Native API:

searcher:

This field describes how many different Trials (models) should be trained. In this case, we’ve specified to train a "single" model for 500 batches.

hyperparameters:

This field describes the hyperparameters used. global_batch_size is a required hyperparameter for every experiment – we’ll revisit this requirement in Native API: Distributed Training.

Context

context = init(config, local=False, test=False, context_dir=".")

determined.keras.TFKerasTensorBoard is the function that initializes the Determined training context. We can think of it as the moment in the training script where Determined will “assume control” of the execution of your code. It has two three in addition to the configuration:

local (bool):

local=False will submit the experiment to a Determined cluster. local=True will execute the training loop in your local Python environment (although currently, local training is not implemented, so you must also set test=True). Defaults to False.

test (bool):

test=True will execute a minimal training loop rather than a full experiment. This can be useful for porting or debugging a model because many common errors will surface quickly. Defaults to False.

context_dir (str):

Specifies the location of the code you want submitted to the cluster. This is required by Determined to execute your training script in a remote environment (local=False). In the common case, “.” submits your entire working directory to the Determined cluster.

Wrap Model (tf.keras only)

model = context.wrap_model(model)

In the case of tf.keras, we will need to use the wrap_model API to make the Determined context aware of the model we want to train with. After calling wrap_model, we proceed with the compile() and fit() interfaces defined by TensorFlow to begin training our model remotely.

Next Steps

Gallery generated by Sphinx-Gallery