Native API¶
The Native API allows developers to seamlessly move between training in a local
development environment and training at cluster-scale on a Determined cluster.
It also provides an interface to train tf.keras
and tf.estimator
models
using idiomatic framework patterns, reducing (or eliminating) the effort to
port model code for use with Determined.
In this guide, we’ll cover what happens under the hood when a user submits an experiment to Determined using the Native API. This topic guide is meant as deep-dive for advanced users. For a tutorial on getting started with the Native API, see the Native API Tutorial.
Create an Experiment via init()
¶
A user can submit an experiment from Python by executing a Python script that conforms to the Native API:
context = det.experimental.keras.init(config, context_dir=".")
model = ...
model = context.wrap_model(model)
model.compile(...)
model.fit(...)
The init()
APIs require a context directory (context_dir
) argument.
The context directory specifies the root directory of the code containing the
native implementation – for most users, this is the current working
directory (.
). init()
also accepts two boolean keyword arguments:
local
(bool
):local=False
will submit the experiment to a Determined cluster.local=True
will execute the training loop in your local Python environment (although currently, local training is not implemented, so you must also settest=True
). Defaults to False.test
(bool
):test=True
will execute a minimal training loop rather than a full experiment. This can be useful for porting or debugging a model because many common errors will surface quickly. Defaults to False.
Determined requires that the Native Python script contains a high-level training loop that it can intercept. Currently, the following training loop functions are supported:
To learn more about the init()
APIs, see:
Life of a Native API Experiment¶
The diagram above demonstrates the flow of execution when an experiment is created in cluster mode.
The Python script will first be executed in the User Environment, such
as a Python virtualenv or a Jupyter notebook where the determined
Python
package is installed. When the code is executed, init()
will create an
experiment by submitting the contents of the context directory and the
experiment configuration to
the Determined master. Note that any code that comes after the init()
call
(typically involving defining the model and training loop) is not executed in
the User Environment. In the case of tf.keras
, building and compiling the
model is only done on the Determined cluster.
The Determined master will schedule the new workload on an agent
machine. The agent will launch a task container and initialize a Trial
Runner Environment; in this environment, the
user script will be re-executed from the beginning using runpy.run_path. In the trial
runner environment, init
will return a determined.NativeContext
object that holds information specific to that trial (e.g. hyperparameter
choices) and continue executing. Once the script hits the training loop
function (in this case, tf.keras.Models.fit
), Determined will launch into
the managed training loop.
Warning
Any user code that occurs after the training loop has started will never be executed!