Experiment Lifecycle¶
In this guide, we’ll go over some of what happens under the hood when a user submits an experiment to Determined. We’ll cover how the experiment gets to the master, how it’s turned into concrete trials to run on the cluster, and what happens within each trial.
Uploading¶
First, the CLI gathers all the necessary information about the
experiment and sends it to the master. The det experiment create
command requires two arguments: a configuration file and a context.
The context must be a directory containing all of the Python code
necessary to run the model.
We don’t allow the total size of the files in the context to exceed 95 MiB. As a result, datasets should not be included directly in the experiment definition; instead, users should set up data loaders to read data from an external source.
If the context is valid, the CLI takes its contents, along with the configuration file, and sends them to the master over the network.
Trial Setup¶
Once the context and configuration for an experiment have reached the master, the experiment waits for the scheduler to assign slots to it. The master then creates trials to train the model. The searcher described by the experiment configuration defines a set of hyperparameter configurations, each of which corresponds to one trial.
When a trial is ready to begin running, the master communicates with the appropriate agent (or agents, in the case of distributed training), which creates Docker containers holding the user model code, along with the Determined harness.
Importing the Model¶
From this point on, most of the work happens inside the Docker containers running on the agent machines. It’s also here that Determined begins to look at the model code in detail—beforehand, it was just sent around blindly.
Using the code in the context and the entrypoint specified in the model configuration, the harness finds the user-defined Python class that describes the model to be trained.
The user-provided class must be a subclass of a trial class included in Determined. Each trial class is designed to support one deep learning application framework; together, the classes provide a consistent interface so that models of any framework can interact with the master in the same way.
Running the Code¶
After loading the necessary code, each trial runs a series of workloads. Each workload is a discrete unit of work with one purpose related to training the model:
training the model,
taking a checkpoint of the model’s state, or
validating the model’s performance.
After each workload, the trial communicates with the master to send back the results of the workload and obtain the next workload to run. Depending on the searcher in use, the results of validation workloads may affect what workloads are run in the future.
While running training or validation workloads, the trial may need to load data from an external source; the model code has to specify how to do that by defining data loaders.
Pausing and Activating¶
An important feature of Determined is the ability to have trials stop running and then start again later without losing any training progress. The scheduler might choose to stop running a trial to allow a trial from another experiment to run, but a user can also manually pause an experiment at any time, which causes all of its trials to stop.
Checkpoint workloads are essential to this ability. After a trial is set to be stopped, it takes a checkpoint at the next available opportunity (i.e., once its current workload finishes running) and then stops running, freeing up the slots it was using. When it resumes running, either because more slots become available in the cluster or because a user activates the experiment, it loads the saved checkpoint, allowing it to continue training from the same state it had before.