Shortcuts

Native API

The Native API allows developers to seamlessly move between training in a local development environment and training at cluster-scale on a Determined cluster. It also provides an interface to train tf.keras and tf.estimator models using idiomatic framework patterns, reducing (or eliminating) the effort to port model code for use with Determined.

In this guide, we’ll cover what happens under the hood when a user submits an experiment to Determined using the Native API. This topic guide is meant as deep-dive for advanced users – for a tutorial on getting started with the Native API, see Native API Tutorial.

Create an Experiment via init()

A user can submit an experiment from Python by executing a Python script that conforms to the Native API:

context = det.experimental.keras.init(config, context_dir=".")
model = ...
model = context.wrap_model(model)
model.compile(...)
model.fit(...)

The init() APIs require a context directory (context_dir) argument. The context directory specifies the root directory of the code containing the native implementation – for a majority of users this is the current working directory (.). init() also accepts two boolean keyword arguments:

local (bool):

local=False will sumbit the experiment to a Determined cluster. local=True will execute the the training loop in your local Python environment (although currently, local training is not implemented, so you must also set test=True). Defaults to False.

test (bool):

test=True will execute a minimal trianing loop rather than a full experiment. This can be useful for porting or debugging a model because many common errors will surface quickly. Defaults to False.

Determined requires that the Native Python script contains a high-level training loop that it can intercept. Currently, the following training loop functions are supported:

To learn more about the init() APIs, see:

Life of a Native API Experiment

../../_images/life-of-a-native-experiment.png

The diagram above demonstrates the flow of execution when an experiment is created cluster mode.

The Python script will first be executed by us in the User environment, such as a Python virtualenv or a Jupyter notebook where the determined Python package is installed. When the code is executed, init() will create an experiment by submitting the contents of the context directory and the experiment configuration in a network request to the Determined master. Note that any code that comes after the init() call (typically involving defining the model and training loop) is not executed in the User environment. In the case of tf.keras, building and compiling the model is only done on the Determined cluster.

Once the Determined master has initialized a Trial Runner environment, the user script is re-executed from the beginning using runpy.run_path. In the trial runner environment, init will return a determined.NativeContext object that holds information specific to that trial (e.g. hyperparameter choices) and continue executing. Once the script hits the training loop function (in this case, tf.keras.Models.fit), Determined will launch into the managed training loop.

Warning

Any user code that occurs after the training loop has started will never be executed!