In this guide, we’ll go over some of what happens under the hood when a user submits an experiment to Determined. We’ll cover how the experiment gets to the master, how it is turned into one or more trials that run on the cluster, and what happens within each trial.


First, the CLI gathers all the necessary information about the experiment and sends it to the master. The det experiment create command requires two arguments: a configuration file and a context. The context must be a directory containing all of the Python code necessary to run the model.

We don’t allow the total size of the files in the context to exceed 95 MiB. As a result, datasets should typically not be included directly in the experiment definition; instead, users should set up data loaders to read data from an external source. Refer to preparing data for more suggestions on data loading.

If the context is valid, the CLI takes its contents, along with the configuration file, and sends them to the master over the network.

Trial Setup

Once the context and configuration for an experiment have reached the master, the experiment waits for the scheduler to assign slots to it. The master then creates trials to train the model. The searcher described by the experiment configuration defines a set of hyperparameter configurations, each of which corresponds to one trial.

When a trial is ready to begin running, the master communicates with the appropriate agent (or agents, in the case of distributed training), which creates Docker containers holding the user model code, along with the Determined harness. Determined supplies a set of default Docker container images that are appropriate for many deep learning tasks, but users can also supply a custom image if desired.

Importing the Model

From this point on, most of the work happens inside the Docker containers running on the agent machines. It’s also here that Determined begins to look at the model code in detail—beforehand, it was just sent around blindly.

Using the code in the context and the entrypoint specified in the model configuration, the harness finds the user-defined Python class that describes the model to be trained.

The user-provided class must be a subclass of a trial class included in Determined. Each trial class is designed to support one deep learning application framework; together, the classes provide a consistent interface so that models of any framework can interact with the master in the same way.

Running the Code

After loading the necessary code, each trial runs a series of workloads. Each workload is a discrete unit of work with one purpose related to training the model:

  • training the model,

  • taking a checkpoint of the model’s state, or

  • validating the model’s performance.

After each workload, the trial communicates with the master to send back the results of the workload and obtain the next workload to run. Depending on the searcher in use, the results of validation workloads may affect what workloads are run in the future.

While running training or validation workloads, the trial may need to load data from an external source; the model code has to specify how to do that by defining data loaders.

Pausing and Activating

An important feature of Determined is the ability to have trials stop running and then start again later without losing any training progress. The scheduler might choose to stop running a trial to allow a trial from another experiment to run, but a user can also manually pause an experiment at any time, which causes all of its trials to stop.

Checkpoint workloads are essential to this ability. After a trial is set to be stopped, it takes a checkpoint at the next available opportunity (i.e., once its current workload finishes running) and then stops running, freeing up the slots it was using. When it resumes running, either because more slots become available in the cluster or because a user activates the experiment, it loads the saved checkpoint, allowing it to continue training from the same state it had before.


What happens when an experiment is archived?

Archiving is designed to make it easier to organize experiments by omitting information about experiment runs that are no longer relevant (e.g., training jobs that failed with an error or jobs submitted as part of the model development process).

When an experiment is archived, it is hidden from the default view in both the WebUI and the CLI, but all of the metadata associated with the experiment (including checkpoints) is preserved. An experiment can subsequently be unarchived if desired, without losing any of the experiment’s metadata.

How can I delete model checkpoints that are no longer useful?

The best way to delete a checkpoint is to modify the garbage collection policy of the experiment that created the checkpoint. For example, to delete all of the experiments associated with an experiment, run:

det experiment set gc-policy --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 <experiment-id>