Experiment Configuration¶

The behavior of an experiment can be configured via a YAML file. A configuration file is typically passed as a command-line argument when an experiment is created with the Determined CLI. For example:

det experiment create config-file.yaml model-directory

Top-Level Fields¶

Required Fields

entrypoint: The location of the trial class in a user’s model definition as an entrypoint specification string. The entrypoint specification is expected to take the form <module>:<object reference>. <module> specifies the module containing the trial class within the model definition, relative to the root. <object reference> specifies the naming of the trial class within the module. It may be a nested object delimited by dots. For more information and examples, please see Model Definitions.

Optional Fields

description: A human-readable description of the experiment. This does not need to be unique.
labels: A list of label names (strings). Assigning labels to experiments allows you to identify experiments that share the same property or should be grouped together. You can add and remove labels using either the CLI (det experiment label) or the WebUI.

data: This field can be used to specify information about how the experiment accesses and loads training data. The content and format of this field is user-defined: it should be used to specify whatever configuration is needed for loading data for use by the experiment’s model definition. For example, if your experiment loads data from Amazon S3, the data field might contain the S3 bucket name, object prefix, and AWS authentication credentials.

min_validation_period: Instructs Determined to periodically compute validation metrics for each trial during training. If set, this variable specifies the maximum number of training steps that can be completed for a given trial since the last validation operation for that trial; if this limit is reached, a new validation operation is performed. Validation metrics can be computed more frequently than specified by this parameter, depending on the hyperparameter search method being used by the experiment.

min_checkpoint_period

Instructs Determined to take periodic checkpoints of each trial during training. If set, this variable specifies the maximum number of training steps that can be completed for a given trial since the last checkpoint of that trial; if this limit is reached, a checkpoint of the trial is taken. There are two other situations in which a trial might be checkpointed: (a) during training, a model may be checkpointed to allow the trial’s execution to be suspended and later resumed on a different Determined agent (b) when the trial’s experiment is completed, to allow the resulting model to be exported from Determined (e.g., for deployment).

checkpoint_policy

Controls how Determined performs checkpoints after validation operations, if at all. Should be set to one of the following values:

best (default): A checkpoint will be taken after every validation operation that performs better than all previous validations for this experiment. Validation metrics are compared according to the metric and smaller_is_better options in the searcher configuration.
all: A checkpoint will be taken after every validation step, no matter the validation performance of the step.
none: A checkpoint will never be taken due to a validation step. However, even with this policy selected, checkpoints are still expected to be taken after the last training step of a trial, due to cluster scheduling decisions or due to min_checkpoint_period.

batches_per_step

The number of batches in a single training step. As discussed above, Determined divides the training of a single trial into a sequence of steps; each step corresponds to a certain number of model updates. Therefore, this configuration parameter can be used to control how long a trial is trained at a single agent:

Doing more work in a step allows per-step overheads (such as downloading training data) to be amortized over more training work. However, if the step size is too large, a single trial might be trained for a long time before Determined gets an opportunity to suspend training of that trial and replace it with a different workload.
The default value is 100. As a rule of thumb, the step size should be set so that training a single step takes 60–180 seconds.

The step size is defined as a fixed number of batches; the number of records in a batch is controlled by the global_batch_size hyperparameter.

max_restarts: The maximum number of times that trials in this experiment will be restarted due to an error. If an error occurs while a trial is running (e.g., a container crashes abruptly), the Determined master will automatically restart the trial and continue running it. This parameter specifies a limit on the number of times to try restarting a trial; this ensures that Determined does not go into an infinite loop if a trial encounters the same error repeatedly. Once max_restarts trial failures have occurred for a given experiment, subsequent failed trials will not be restarted – instead, they will be marked as errored. The experiment itself will continue running; an experiment is considered to complete successfully if at least one of its trials completes successfully. The default value is 5.

Checkpoint Storage¶

The checkpoint_storage section defines how model checkpoints will be stored. A checkpoint contains the architecture and weights of the model being trained. Determined currently supports four kinds of checkpoint storage, gcs, hdfs, s3, and shared_fs, identified by the type subfield. Additional fields may also be required, depending on the type of checkpoint storage in use. For example, to store checkpoints on Google Cloud Storage:

checkpoint_storage:
  type: gcs
  bucket: <your-bucket-name>

If this field is not specified, the experiment will default to the checkpoint storage configured in the Master Configuration.

When an experiment finishes, the system will optionally delete some checkpoints to reclaim space. The save_experiment_best, save_trial_best and save_trial_latest parameters specify which checkpoints to save. See the documentation on Checkpoint Garbage Collection for more details.

Google Cloud Storage¶

If type: gcs is specified, checkpoints will be stored on Google Cloud Storage (GCS). Authentication is done using GCP’s “Application Default Credentials” approach. When using Determined inside Google Compute Engine (GCE), the simplest approach is to ensure that the VMs used by Determined are running in a service account that has the “Storage Object Admin” role on the GCS bucket being used for checkpoints. As an alternative (or when running outside of GCE), you can add the appropriate service account credentials to your container (e.g., via a bind-mount), and then set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the container path where the credentials are located. See Environment Variables for more details on how to set environment variables in containers.

The following fields are required when using GCS checkpoint storage:

bucket: The GCS bucket name to use.

HDFS¶

If type: hdfs is specified, checkpoints will be stored in HDFS using the WebHDFS API for reading and writing checkpoint resources.

Required Fields

hdfs_url: Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode. Multiple namenodes are allowed as a semicolon-separated list (e.g., "http://namenode1:50070;http://namenode2:50070").
hdfs_path: The prefix path where all checkpoints will be written to and read from. The resources of each checkpoint will be saved in a subdirectory of hdfs_path, where the subdirectory name is the checkpoint’s UUID.

Optional Fields

user: The user name to use for all read and write requests. If not specified, this defaults to the user of the trial runner container.

Amazon S3¶

If type: s3 is specified, checkpoints will be stored in Amazon S3 or an S3-compatible object store such as MinIO.

Required Fields

bucket: The S3 bucket name to use.
access_key: The AWS access key to use.
secret_key: The AWS secret key to use.

Optional Fields

endpoint_url: The endpoint to use for S3 clones, e.g., http://127.0.0.1:8080/. If not specified, Amazon S3 will be used.

Shared File System¶

If type: shared_fs is specified, checkpoints will be written to a directory on the agent’s file system. The assumption is that the system administrator has arranged for the same directory to be mounted at every agent machine, and for the content of this directory to be the same on all agent hosts (e.g., by using a distributed or network file system such as GlusterFS or NFS).

Warning

When downloading checkpoints from a shared file system (e.g., using det checkpoint download), we assume the same shared file system is mounted locally at the same host_path.

Required Fields

host_path: The file system path on each agent to use. This directory will be mounted to /determined_shared_fs inside the trial container.

Optional Fields

storage_path: The path where checkpoints will be written to and read from. Must be a subdirectory of the host_path or an absolute path containing the host_path. If not specified, checkpoints are written to and read from the host_path.
propagation: Propagation behavior for replicas of the bind-mount. Defaults to rprivate.

Hyperparameters¶

The hyperparameters section defines the hyperparameter space for the experiment. Which hyperparameters are appropriate for a given model is up to the user and depends on the nature of the model being trained. In Determined, it is common to specify hyperparameters that influence many aspects of the model’s behavior, including how data augmentation is done, the architecture of the neural network, and which optimizer to use, along with how that optimizer should be configured.

The value chosen for a hyperparameter in a given trial can be accessed via the trial context using context.get_hparam(). For instance, the current value of a hyperparameter named learning_rate can be accessed by context.get_hparam("learning_rate").

Note

Every experiment must specify a hyperparameter named global_batch_size. This is because this hyperparameter is treated specially: when doing distributed training, the global batch size must be known so that the per-worker batch size can be computed appropriately. Batch size per slot is computed at runtime, based on the number of slots used to train a single trial of this experiment (see resources.slots_per_trial). The updated values should be accessed via the trial context, using context.get_per_slot_batch_size() and context.get_global_batch_size().

The hyperparameter space is defined by a dictionary. Each key in the dictionary is the name of a hyperparameter; the associated value defines the range of the hyperparameter. If the value is a scalar, the hyperparameter is a constant; otherwise, the value should be a nested map. Here is an example:

hyperparameters:
  global_batch_size: 64
  optimizer:
    type: categorical
    vals:
      - SGD
      - Adam
      - RMSprop
  layer1_dropout:
    type: double
    minval: 0.2
    maxval: 0.5
  learning_rate:
    type: log
    minval: -5.0
    maxval: 1.0
    base: 10.0

This configuration defines four hyperparameters: global_batch_size, optimizer, layer1_dropout, and learning_rate. global_batch_size is set to a constant value; the other hyperparameters can take on a range of possible values. A hyperparameter’s range is configured by the type field of the map; it must be one of categorical, double, int, or log. More details on these types are given below.

Categorical¶

A categorical hyperparameter ranges over a set of specified values. The possible values are defined by the vals key. vals is a list; each element of the list can be of any valid YAML type, such as a boolean, a string, a number, or a collection.

Double¶

A double hyperparameter is a floating point variable. The minimum and maximum values of the variable are defined by the minval and maxval keys, respectively.

When doing a grid search, the count key can also be specified; this defines the number of points in the grid for this hyperparameter. Grid points are evenly spaced between minval and maxval. See Hyperparameter Search: Grid for details.

Integer¶

An int hyperparameter is an integer variable. The minimum and maximum values of the variable are defined by the minval and maxval keys, respectively.

When doing a grid search, the count key can also be specified; this defines the number of points in the grid for this hyperparameter. Grid points are evenly spaced between minval and maxval. See Hyperparameter Search: Grid for details.

Log¶

A log hyperparameter is a floating point variable that is searched on a logarithmic scale. The base of the logarithm is specified by the base field; the minimum and maximum exponent values of the hyperparameter are given by the minval and maxval fields, respectively.

When doing a grid search, the count key can also be specified; this defines the number of points in the grid for this hyperparameter. Grid points are evenly spaced between minval and maxval. See Hyperparameter Search: Grid for details.

Searcher¶

The searcher section defines how the experiment’s hyperparameter space will be explored. To run an experiment that trains a single trial with fixed hyperparameters, specify the single searcher and specify constant values for the model’s hyperparameters. Otherwise, Determined supports five different hyperparameter search algorithms: random, grid, adaptive_simple, adaptive, and pbt.

The name of the hyperparameter search algorithm to use is configured via the name field; the remaining fields configure the behavior of the searcher and depend on the searcher being used. For example, to configure a random hyperparameter search that trains 5 trials for 10 steps each:

searcher:
  name: random
  metric: accuracy
  max_trials: 5
  max_steps: 10

For details on using Determined to perform hyperparameter search, refer to Hyperparameter Tuning. For more information on the search methods supported by Determined, refer to Hyperparameter Tuning With Determined.

Single¶

The single search method does not really perform a hyperparameter search at all; rather, it trains a single trial for a fixed number of steps. When using this search method, all of the hyperparameters specified in the hyperparameters section must be constants. By default, validation metrics are only computed once, after the specified number of training steps have been completed; min_validation_period can be used to specify that validation metrics should be computed more frequently.

Required Fields

metric: The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_steps: The number of steps to train the model for.

Optional Fields

smaller_is_better: Whether to minimize or maximize the metric defined above. The default value is true (minimize).
source_trial_id: If specified, the weights of this trial will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of this experiment.
source_checkpoint_uuid: Like source_trial_id, but specifies an arbitrary checkpoint from which to initialize weights. At most one of source_trial_id or source_checkpoint_uuid should be set.

Random¶

The random search method implements a simple random search. The user specifies how many hyperparameter configurations should be trained and how long each configuration should be trained for; the configurations are sampled randomly from the hyperparameter space. Each trial is trained for the specified number of steps and then validation metrics are computed. min_validation_period can be used to specify that validation metrics should be computed more frequently.

Required Fields

metric: The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_trials: The number of trials, i.e., hyperparameter configurations, to evaluate.
max_steps: The number of steps to train each trial for.

Optional Fields

smaller_is_better: Whether to minimize or maximize the metric defined above. The default value is true (minimize).
source_trial_id: If specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is incompatible with the model architecture of any of the trials in this experiment.
source_checkpoint_uuid: Like source_trial_id but specifies an arbitrary checkpoint from which to initialize weights. At most one of source_trial_id or source_checkpoint_uuid should be set.

Grid¶

The grid search method performs a grid search. The coordinates of the hyperparameter grid are specified via the hyperparameters field. For more details see the Hyperparameter Search: Grid.

Required Fields

metric: The name of the validation metric used to evaluate the performance of a hyperparameter configuration.

Optional Fields

smaller_is_better: Whether to minimize or maximize the metric defined above. The default value is true (minimize).
source_trial_id: If specified, the weights of this trial will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of this experiment.
source_checkpoint_uuid: Like source_trial_id, but specifies an arbitrary checkpoint from which to initialize weights. At most one of source_trial_id or source_checkpoint_uuid should be set.

Adaptive¶

The adaptive search method is a theoretically principled and empirically state-of-the-art method that adaptively allocates resources to promising hyperparameter configurations while quickly eliminating poor ones.

Required Fields

metric: The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
target_trial_steps: The maximum number of training steps to allocate to any one trial. The vast majority of trials will be stopped early, and thus only a small fraction of trials will actually be trained for this number of steps. We recommend setting this to a multiple of divisor^(max_rungs-1), which is 4^(5-1) = 256 with the default values.
step_budget: The total number of steps to allocate across all trials. We recommend setting this to be a multiple of target_trial_steps, which implies interpreting this subfield as the effective number of complete trials to evaluate. Note that some trials might be in-progress when this budget is exhausted; adaptive search will allocate some additional steps to complete these in-progress trials.

Optional Fields

smaller_is_better

Whether to minimize or maximize the metric defined above. The default value is true (minimize).

mode

How aggressively to perform early stopping. There are three modes: aggressive, standard, and conservative; the default is standard.

These modes differ in the degree to which early-stopping is used. In aggressive mode, the searcher quickly stops underperforming trials, which enables the searcher to explore more hyperparameter configurations, but at the risk of discarding a configuration too soon. On the other end of the spectrum, conservative mode performs significantly less downsampling, but as a consequence does not explore as many configurations given the same budget. We recommend using either aggressive or standard mode.

divisor

The fraction of trials to keep at each rung, and also determines how many steps are allocated at each rung. The default setting is 4; only advanced users should consider changing this value.

max_rungs

The maximum number of times we evaluate intermediate results for a trial and terminate poorly performing trials. The default value is 5; only advanced users should consider changing this value.

source_trial_id

If specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of any of the trials in this experiment.

source_checkpoint_uuid

Like source_trial_id, but specifies an arbitrary checkpoint from which to initialize weights. At most one of source_trial_id or source_checkpoint_uuid should be set.

Adaptive (Simple)¶

The adaptive_simple search method is a simpler interface to the adaptive search method described above. adaptive_simple is designed to be simpler to configure for most applications of hyperparameter search.

Required Fields

metric: The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_steps: The maximum number of training steps to allocate to any one trial. The vast majority of trials will be stopped early, and thus only a small fraction of trials will actually be trained for this number of steps. This quantity is domain-specific and should roughly reflect the number of training steps needed for the model to converge on the data set.
max_trials: The number of trials, i.e., hyperparameter configurations, to evaluate.

Optional Fields

smaller_is_better

Whether to minimize or maximize the metric defined above. The default value is true (minimize).

mode

How aggressively to perform early stopping. There are three modes: aggressive, standard, and conservative; the default is standard.

These modes differ in the degree to which early-stopping is used. In aggressive mode, the searcher quickly stops underperforming trials, which enables the searcher to explore more hyperparameter configurations, but at the risk of discarding a configuration too soon. On the other end of the spectrum, conservative mode performs significantly less downsampling, but as a consequence does not explore as many configurations given the same budget. We recommend using either aggressive or standard mode.

divisor

The fraction of trials to keep at each rung, and also determines how many steps are allocated at each rung. The default setting is 4; only advanced users should consider changing this value.

max_rungs

The maximum number of times we evaluate intermediate results for a trial and terminate poorly performing trials. The default value is 5; only advanced users should consider changing this value.

source_trial_id

If specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of any of the trials in this experiment.

source_checkpoint_uuid

Like source_trial_id, but specifies an arbitrary checkpoint from which to initialize weights. At most one of source_trial_id or source_checkpoint_uuid should be set.

Adaptive (ASHA)¶

The adaptive_asha search method is an asynchronous version of the adaptive <experiment-configuration-searcher-adaptive> method above that is more suitable for large experiments with many trials.

Required Fields

metric: The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
target_trial_steps: The maximum number of training steps to allocate to any one trial. The vast majority of trials will be stopped early, and thus only a small fraction of trials will actually be trained for this number of steps. This quantity is domain-specific and should roughly reflect the number of training steps needed for the model to converge on the data set.
max_trials: The number of trials, i.e., hyperparameter configurations, to evaluate.
max_concurrent_trials: The maximum number of trials that can be worked on simultaneously. This is akin to controlling the degree of parallelism of the experiment.

Optional Fields

smaller_is_better

Whether to minimize or maximize the metric defined above. The default value is true (minimize).

mode

How aggressively to perform early stopping. There are three modes: aggressive, standard, and conservative; the default is standard.

These modes differ in the degree to which early-stopping is used. In aggressive mode, the searcher quickly stops underperforming trials, which enables the searcher to explore more hyperparameter configurations, but at the risk of discarding a configuration too soon. On the other end of the spectrum, conservative mode performs significantly less downsampling, but as a consequence does not explore as many configurations given the same budget. We recommend using either aggressive or standard mode.

divisor

The fraction of trials to keep at each rung, and also determines how many steps are allocated at each rung. The default setting is 4; only advanced users should consider changing this value.

max_rungs

The maximum number of times we evaluate intermediate results for a trial and terminate poorly performing trials. The default value is 5; only advanced users should consider changing this value.

source_trial_id

If specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of any of the trials in this experiment.

source_checkpoint_uuid

Like source_trial_id, but specifies an arbitrary checkpoint from which to initialize weights. At most one of source_trial_id or source_checkpoint_uuid should be set.

PBT¶

The pbt search method uses population-based training, which maintains a population of active trials to train. After each trial has been trained for a certain number of steps, all the trials are validated. The searcher then closes some trials and replaces them with altered copies of other trials. This process makes up one “round”; the searcher runs some number of rounds to execute a complete search. The model definition class must be able to restore from a checkpoint that was created with a different set of hyperparameters; in particular, you will not be able to vary any hyperparameters that change the sizes of weight matrices without taking special steps to save or restore models.

Required Fields

metric

Specifies the name of the validation metric used to evaluate the performance of a hyperparameter configuration.

population_size

The number of trials (i.e., different hyperparameter configurations) to keep active at a time.

steps_per_round

The number of steps to train each trial between validations.

num_rounds

The total number of rounds to execute.

replace_function

How to choose which trials to close and which trials to copy at the end of each round. At present, only a single replacement function is supported:

truncate_fraction: Defines truncation selection, in which the worst truncate_fraction (multiplied by the population size) trials, ranked by validation metric, are closed and the same number of top trials are copied.

explore_function

How to alter a set of hyperparameters when a copy of a trial is made. Each parameter is either resampled (i.e., its value is chosen from the configured distribution) or perturbed (i.e., its value is computed based on the value in the original set). explore_function has two required sub-fields:

resample_probability: The probability that a parameter is replaced with a new value sampled from the original distribution specified in the configuration.
perturb_factor: The amount by which parameters that are not resampled are perturbed. Each numerical hyperparameter is multiplied by either 1 + perturb_factor or 1 - perturb_factor with equal probability; categorical and const hyperparameters are left unchanged.

Optional Fields

smaller_is_better: Whether to minimize or maximize the metric defined above. The default value is true (minimize).

Resources¶

The resources section defines the resources that an experiment is allowed to use.

Optional Fields

max_slots: The maximum number of scheduler slots that this experiment is allowed to use at any one time. The slot limit of an active experiment can be changed using det experiment set max-slots <id> <slots>. By default, there is no limit on the number of slots an experiment can use.

Warning

max_slots is only considered when scheduling jobs; it is not currently used when provisioning dynamic agents. This means that we may provision more instances than the experiment can schedule.
weight: The weight of this experiment in the scheduler. When multiple experiments are running at the same time, the number of slots assigned to each experiment will be approximately proportional to its weight. The weight of an active experiment can be changed using det experiment set weight <id> <weight>. The default weight is 1.

slots_per_trial: The number of slots to use for each trial of this experiment. The default value is 1; specifying a value greater than 1 means that multiple GPUs will be used in parallel. Training on multiple GPUs is done using data parallelism. Configuring slots_per_trial to be greater than max_slots is not sensible and will result in an error.

Note

Using slots_per_trial to enable data parallel training for PyTorch can alter the behavior of certain models, as described in the PyTorch documentation.
shm_size: The size in bytes of /dev/shm for trial containers. Defaults to 4294967296 (4GiB). If set, this value overrides the value specified in the master configuration.

Bind Mounts¶

The bind_mounts section specifies directories that are bind-mounted into every container launched for this experiment. Bind mounts are often used to enable trial containers to access additional data that is not part of the model definition directory.

This field should consist of an array of entries; each entry has the form described below. Users must ensure that the specified host paths are accessible on all agent hosts (e.g., by configuring a network file system appropriately).

For each bind mount, the following fields are required:

host_path: The file system path on each agent to use. Must be an absolute filepath.
container_path: The file system path in the container to use. May be a relative filepath, in which case it will be mounted relative to the working directory inside the container. It is not allowed to mount directly into the working directory (i.e., container_path == ".") to reduce the risk of cluttering the host filesystem.

For each bind mount, the following optional fields may also be specified:

read_only: Whether the bind-mount should be a read-only mount. Defaults to false.
propagation: Propagation behavior for replicas of the bind-mount. Defaults to rprivate.

Environment¶

The environment section defines properties of the container environment that is used to execute workloads for this experiment. For more information on customizing the trial environment, refer to Environment Configuration.

Optional Fields

image

The Docker image to use when executing the workload. This image must be accessible via docker pull to every Determined agent machine in the cluster. Users can use different container images for GPU vs. CPU agents differently by specifying a dict with two keys, cpu and gpu. Default values:

determinedai/environments:py-3.6.9-pytorch-1.4-tf-1.15-cpu-0.5.0 for CPU agents
determinedai/environments:cuda-10.0-pytorch-1.4-tf-1.15-gpu-0.5.0 for GPU agents.

force_pull_image

Forcibly pull the image from the Docker registry and bypass the Docker cache. Defaults to false.

registry_auth

The Docker registry credentials to use when pulling a custom base Docker image, if needed. Credentials are specified as the following nested fields:

username (required)
password (required)
server (optional)
email (optional)

environment_variables

A list of environment variables that will be set in every trial container. Each element of the list should be a string of the form NAME=VALUE. See Environment Variables for more details. Users can customize environment variables for GPU vs. CPU agents differently by specifying a dict with two keys, cpu and gpu.

Optimizations¶

The optimizations section contains configuration options that influence the performance of the experiment.

Optional Fields

aggregation_frequency: Specifies after how many batches gradients are exchanged during Distributed Training. Defaults to 1.
average_aggregated_gradients: Whether gradients accumulated across batches (when aggregation_frequency > 1) should be divided by the aggregation_frequency. Defaults to true.
average_training_metrics: For multi-GPU training, whether to average the training metrics across GPUs instead of only using metrics from the chief GPU. This impacts the metrics shown in the Determined UI and TensorBoard, but does not impact the outcome of training or hyperparameter search. This option is currently only supported in PyTorch. Defaults to false.
gradient_compression: Whether to compress gradients when they are exchanged during Distributed Training. Compression may alter gradient values to achieve better space reduction. Defaults to false.
mixed_precision: Whether to use mixed precision training with PyTorch during Distributed Training. Setting O1 enables mixed precision and loss scaling. Defaults to O0 which disables mixed precision training. This configuration setting is deprecated; users are advised to call context.configure_apex_amp in the constructor of their trial class instead.
tensor_fusion_threshold: The threshold in MB for batching together gradients that are exchanged during Distributed Training. Defaults to 64.
tensor_fusion_cycle_time: The delay (in milliseconds) between each tensor fusion during Distributed Training. Defaults to 5.
auto_tune_tensor_fusion: When enabled, configures tensor_fusion_threshold and tensor_fusion_cycle_time automatically. Defaults to false.

Reproducibility¶

The reproducibility section specifies configuration options related to reproducible experiments. See Reproducibility for more details.

Optional Fields

experiment_seed: The random seed to use to initialize random number generators for all trials in this experiment. Must be an integer between 0 and 2³¹–1. If an experiment_seed is not explicitly specified, the master will automatically generate an experiment seed.

Data Layer¶

The data_layer section specifies configuration options related to the Data Layer. Determined currently supports three types of storage for the data_layer: s3, gcs, and shared_fs, identified by the type subfield. Defaults to shared_fs.

Shared File System¶

If type: shared_fs is specified, the cache will be stored in a directory on an agent’s file system.

Optional Fields

host_storage_path: The file system path on each agent to use.
container_storage_path: The file system path to use as the mount point in the trial runner container.

Amazon S3¶

If type: s3 is specified, the cache will be stored on Amazon S3 or an S3-compatible object store such as MinIO.

Required Fields

bucket: The S3 bucket name to use.
bucket_directory_path: The path in the S3 bucket to store the cache.

Optional Fields

local_cache_host_path: The file system path to store a local copy of the cache, which is synchronized with the S3 cache.
local_cache_container_path: The file system path to use as the mount point in the trial runner container for storing the local cache.
access_key: The AWS access key to use.
secret_key: The AWS secret key to use.
endpoint_url: The endpoint to use for S3 clones, e.g., http://127.0.0.1:8080/.

Google Cloud Storage¶

If type: gcs is specified, the cache will be stored on Google Cloud Storage (GCS). Authentication is done using GCP’s “Application Default Credentials” approach. When using Determined inside Google Compute Engine (GCE), the simplest approach is to ensure that the VMs used by Determined are running in a service account that has the “Storage Object Admin” role on the GCS bucket being used for checkpoints. As an alternative (or when running outside of GCE), you can add the appropriate service account credentials to your container (e.g., via a bind-mount), and then set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the container path where the credentials are located. See Environment Variables for more details on how to set environment variables in containers.

Required Fields

bucket: The GCS bucket name to use.
bucket_directory_path: The path in GCS bucket to store the cache.

Optional Fields

local_cache_host_path: The file system path, to store a local copy of the cache, which is synchronized with the GCS cache.
local_cache_container_path: The file system path to use as the mount point in the trial runner container for storing the local cache.