Shortcuts

Native API

The Native API allows developers to seamlessly move between training in a local development environment and training at cluster-scale on a Determined cluster. It also provides an interface to train tf.keras and tf.estimator models using idiomatic framework patterns, reducing (or eliminating) the effort to port model code for use with Determined.

In this guide, we’ll cover what happens under the hood when a user submits an experiment to Determined using the Native API. This topic guide is meant as deep-dive for advanced users. For a tutorial on getting started with the Native API, see the Native API Tutorial.

Create an Experiment via init()

A user can submit an experiment from Python by executing a Python script that conforms to the Native API:

context = det.experimental.keras.init(config, context_dir=".")
model = ...
model = context.wrap_model(model)
model.compile(...)
model.fit(...)

The init() APIs require a context directory (context_dir) argument. The context directory specifies the root directory of the code containing the native implementation – for most users, this is the current working directory (.). init() also accepts two boolean keyword arguments:

local (bool):

local=False will submit the experiment to a Determined cluster. local=True will execute the training loop in your local Python environment (although currently, local training is not implemented, so you must also set test=True). Defaults to False.

test (bool):

test=True will execute a minimal training loop rather than a full experiment. This can be useful for porting or debugging a model because many common errors will surface quickly. Defaults to False.

Determined requires that the Native Python script contains a high-level training loop that it can intercept. Currently, the following training loop functions are supported:

To learn more about the init() APIs, see:

Life of a Native API Experiment

../../_images/life-of-a-native-experiment.png

The diagram above demonstrates the flow of execution when an experiment is created in cluster mode.

The Python script will first be executed in the User Environment, such as a Python virtualenv or a Jupyter notebook where the determined Python package is installed. When the code is executed, init() will create an experiment by submitting the contents of the context directory and the experiment configuration to the Determined master. Note that any code that comes after the init() call (typically involving defining the model and training loop) is not executed in the User Environment. In the case of tf.keras, building and compiling the model is only done on the Determined cluster.

The Determined master will schedule the new workload on an agent machine. The agent will launch a task container and initialize a Trial Runner Environment; in this environment, the user script will be re-executed from the beginning using runpy.run_path. In the trial runner environment, init will return a determined.NativeContext object that holds information specific to that trial (e.g. hyperparameter choices) and continue executing. Once the script hits the training loop function (in this case, tf.keras.Models.fit), Determined will launch into the managed training loop.

Warning

Any user code that occurs after the training loop has started will never be executed!