Experiment Configuration¶
The behavior of an experiment can be configured via a YAML file. A configuration file is typically passed as a command-line argument when an experiment is created with the Determined CLI. For example:
det experiment create config-file.yaml model-directory
Top-Level Fields¶
Required Fields
entrypoint
The location of the trial class in a user’s model definition as an entrypoint specification string. The entrypoint specification is expected to take the form
<module>:<object reference>
.<module>
specifies the module containing the trial class within the model definition, relative to the root.<object reference>
specifies the naming of the trial class within the module. It may be a nested object delimited by dots. For more information and examples, please see Model Definitions.
Optional Fields
description
A human-readable description of the experiment. This does not need to be unique.
labels
A list of label names (strings). Assigning labels to experiments allows you to identify experiments that share the same property or should be grouped together. You can add and remove labels using either the CLI (
det experiment label
) or the WebUI.
data
This field can be used to specify information about how the experiment accesses and loads training data. The content and format of this field is user-defined: it should be used to specify whatever configuration is needed for loading data for use by the experiment’s model definition. For example, if your experiment loads data from Amazon S3, the
data
field might contain the S3 bucket name, object prefix, and AWS authentication credentials.
min_validation_period
Instructs Determined to periodically compute validation metrics for each trial during training. If set, this variable specifies the maximum length, in terms of records, batches, or epochs (see Training Units), that a given trial can be trained without a validation operation; if this limit is reached, a new validation operation is performed. Validation metrics can be computed more frequently than specified by this parameter, depending on the hyperparameter search method being used by the experiment.
perform_initial_validation
Instructs Determined to perform an initial validation before any training begins, for each trial. This can be useful to determine a baseline when fine-tuning a model on a new dataset.
min_checkpoint_period
Instructs Determined to take periodic checkpoints of each trial during training. If set, this variable specifies the maximum length that a trial can be trained without a checkpoint, in terms of records, batches, or epochs (see Training Units); if this limit is reached, a checkpoint of the trial is taken. There are three other situations in which a trial might be checkpointed:
During training, a model may be checkpointed to allow the trial’s execution to be suspended and later resumed on a different Determined agent.
When the trial’s experiment is completed, to allow the resulting model to be exported from Determined (e.g., for deployment).
Before the search method makes a decision based on a validation of a trial, to maintain consistency in the event of a failure.
checkpoint_policy
Controls how Determined performs checkpoints after validation operations, if at all. Should be set to one of the following values:
best
(default): A checkpoint will be taken after every validation operation that performs better than all previous validations for this experiment. Validation metrics are compared according to themetric
andsmaller_is_better
options in the searcher configuration.all
: A checkpoint will be taken after every validation, no matter the validation performance.none
: A checkpoint will never be taken due to a validation. However, even with this policy selected, checkpoints are still expected to be taken after the trial is finished training, due to cluster scheduling decisions, before search method decisions, or due to min_checkpoint_period.
scheduling_unit
The number of batches in a single training workload. Determined divides the training of a single trial into a sequence of training workloads; each workload corresponds to a certain number of model updates. Therefore, this configuration parameter can be used to control how long a trial is trained on a single agent:
Training longer per workload allows per-workload overheads to be amortized over more training work. However, if the size of a single workload is too large, a trial might be trained for a long time before Determined gets an opportunity to suspend training of that trial and replace it with a different workload.
The default value is
100
. As a rule of thumb, the training workload size should be set so that a single workload takes 60–180 seconds.
This field is defined as a fixed number of batches; the number of records in a batch is controlled by the global_batch_size hyperparameter.
records_per_epoch
The number of records in the training data set. This is optional; it must be configured if you want to specify other fields (e.g.,
min_validation_period
) in units ofepochs
. Determined does not attempt to determine the size of an epoch automatically, because the size of the training set might vary based on data augmentation, changes to external storage, or other factors. See Training Units for details.
max_restarts
The maximum number of times that trials in this experiment will be restarted due to an error. If an error occurs while a trial is running (e.g., a container crashes abruptly), the Determined master will automatically restart the trial and continue running it. This parameter specifies a limit on the number of times to try restarting a trial; this ensures that Determined does not go into an infinite loop if a trial encounters the same error repeatedly. Once
max_restarts
trial failures have occurred for a given experiment, subsequent failed trials will not be restarted – instead, they will be marked as errored. The experiment itself will continue running; an experiment is considered to complete successfully if at least one of its trials completes successfully. The default value is5
.
Checkpoint Storage¶
The checkpoint_storage
section defines how model checkpoints will be
stored. A checkpoint contains the architecture and weights of the model
being trained. Determined currently supports four kinds of checkpoint
storage, gcs
, hdfs
, s3
, and shared_fs
, identified by the
type
subfield. Additional fields may also be required, depending on
the type of checkpoint storage in use. For example, to store checkpoints
on Google Cloud Storage:
checkpoint_storage:
type: gcs
bucket: <your-bucket-name>
If this field is not specified, the experiment will default to the checkpoint storage configured in the Master Configuration.
When an experiment finishes, the system will optionally delete some
checkpoints to reclaim space. The save_experiment_best
,
save_trial_best
and save_trial_latest
parameters specify which
checkpoints to save. See the documentation on
Checkpoint Garbage Collection for more details.
Google Cloud Storage¶
If type: gcs
is specified, checkpoints will be stored on Google
Cloud Storage (GCS). Authentication is done using GCP’s “Application
Default Credentials”
approach. When using Determined inside Google Compute Engine (GCE), the
simplest approach is to ensure that the VMs used by Determined are
running in a service account that has the “Storage Object Admin” role on
the GCS bucket being used for checkpoints. As an alternative (or when
running outside of GCE), you can add the appropriate service account
credentials
to your container (e.g., via a bind-mount), and then set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to the container
path where the credentials are located. See Environment Variables
for more details on how to set environment variables in containers.
The following fields are required when using GCS checkpoint storage:
bucket
The GCS bucket name to use.
HDFS¶
If type: hdfs
is specified, checkpoints will be stored in HDFS using
the WebHDFS
API for reading and writing checkpoint resources.
Required Fields
hdfs_url
Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode. Multiple namenodes are allowed as a semicolon-separated list (e.g.,
"http://namenode1:50070;http://namenode2:50070"
).hdfs_path
The prefix path where all checkpoints will be written to and read from. The resources of each checkpoint will be saved in a subdirectory of
hdfs_path
, where the subdirectory name is the checkpoint’s UUID.
Optional Fields
user
The user name to use for all read and write requests. If not specified, this defaults to the user of the trial runner container.
Amazon S3¶
If type: s3
is specified, checkpoints will be stored in Amazon S3 or
an S3-compatible object store such as MinIO.
Required Fields
bucket
The S3 bucket name to use.
access_key
The AWS access key to use.
secret_key
The AWS secret key to use.
Optional Fields
endpoint_url
The endpoint to use for S3 clones, e.g.,
http://127.0.0.1:8080/
. If not specified, Amazon S3 will be used.
Hyperparameters¶
The hyperparameters
section defines the hyperparameter space for the
experiment. Which hyperparameters are appropriate for a given model is
up to the user and depends on the nature of the model being trained. In
Determined, it is common to specify hyperparameters that influence many
aspects of the model’s behavior, including how data augmentation is
done, the architecture of the neural network, and which optimizer to
use, along with how that optimizer should be configured.
The value chosen for a hyperparameter in a given trial can be accessed
via the trial context using context.get_hparam()
. For instance, the current value
of a hyperparameter named learning_rate
can be accessed by
context.get_hparam("learning_rate")
.
Note
Every experiment must specify a hyperparameter named
global_batch_size
. This is because this hyperparameter is treated
specially: when doing distributed training, the global batch size
must be known so that the per-worker batch size can be computed
appropriately. Batch size per slot is computed at runtime, based on
the number of slots used to train a single trial of this experiment
(see resources.slots_per_trial). The updated values should
be accessed via the trial context, using
context.get_per_slot_batch_size()
and
context.get_global_batch_size()
.
The hyperparameter space is defined by a dictionary. Each key in the dictionary is the name of a hyperparameter; the associated value defines the range of the hyperparameter. If the value is a scalar, the hyperparameter is a constant; otherwise, the value should be a nested map. Here is an example:
hyperparameters:
global_batch_size: 64
optimizer:
type: categorical
vals:
- SGD
- Adam
- RMSprop
layer1_dropout:
type: double
minval: 0.2
maxval: 0.5
learning_rate:
type: log
minval: -5.0
maxval: 1.0
base: 10.0
This configuration defines four hyperparameters: global_batch_size
,
optimizer
, layer1_dropout
, and learning_rate
.
global_batch_size
is set to a constant value; the other
hyperparameters can take on a range of possible values. A
hyperparameter’s range is configured by the type
field of the map;
it must be one of categorical
, double
, int
, or log
. More
details on these types are given below.
Categorical¶
A categorical
hyperparameter ranges over a set of specified values.
The possible values are defined by the vals
key. vals
is a list;
each element of the list can be of any valid YAML type, such as a
boolean, a string, a number, or a collection.
Double¶
A double
hyperparameter is a floating point variable. The minimum
and maximum values of the variable are defined by the minval
and
maxval
keys, respectively (inclusive of endpoints).
When doing a grid search, the count
key can also be specified; this
defines the number of points in the grid for this hyperparameter. Grid
points are evenly spaced between minval
and maxval
. See
Hyperparameter Search: Grid for details.
Integer¶
An int
hyperparameter is an integer variable. The minimum and
maximum values of the variable are defined by the minval
and
maxval
keys, respectively (inclusive of endpoints).
When doing a grid search, the count
key can also be specified; this
defines the number of points in the grid for this hyperparameter. Grid
points are evenly spaced between minval
and maxval
. See
Hyperparameter Search: Grid for details.
Log¶
A log
hyperparameter is a floating point variable that is searched
on a logarithmic scale. The base of the logarithm is specified by the
base
field; the minimum and maximum exponent values of the
hyperparameter are given by the minval
and maxval
fields,
respectively (inclusive of endpoints).
When doing a grid search, the count
key can also be specified; this
defines the number of points in the grid for this hyperparameter. Grid
points are evenly spaced between minval
and maxval
. See
Hyperparameter Search: Grid for details.
Searcher¶
The searcher
section defines how the experiment’s hyperparameter
space will be explored. To run an experiment that trains a single trial
with fixed hyperparameters, specify the single
searcher and specify
constant values for the model’s hyperparameters. Otherwise, Determined
supports six different hyperparameter search algorithms: random
,
grid
, adaptive_asha
, adaptive_simple
, adaptive
, and
pbt
.
The name of the hyperparameter search algorithm to use is configured via
the name
field; the remaining fields configure the behavior of the
searcher and depend on the searcher being used. For example, to
configure a random
hyperparameter search that trains 5 trials for
1000 batches each:
searcher:
name: random
metric: accuracy
max_trials: 5
max_length:
batches: 1000
For details on using Determined to perform hyperparameter search, refer to Hyperparameter Tuning. For more information on the search methods supported by Determined, refer to Hyperparameter Tuning With Determined.
Single¶
The single
search method does not perform a hyperparameter search at
all; rather, it trains a single trial for a fixed length. When using
this search method, all of the hyperparameters specified in the
hyperparameters
section must be constants. By default, validation metrics are only
computed once, after the specified length of training has been
completed; min_validation_period can be used to specify that
validation metrics should be computed more frequently.
Required Fields
metric
The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_length
The length of the trial, in terms of records, batches, or epochs (see Training Units).
Optional Fields
smaller_is_better
Whether to minimize or maximize the metric defined above. The default value is
true
(minimize).source_trial_id
If specified, the weights of this trial will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of this experiment.
source_checkpoint_uuid
Like
source_trial_id
, but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_id
orsource_checkpoint_uuid
should be set.
Random¶
The random
search method implements a simple random search. The user
specifies how many hyperparameter configurations should be trained and
how long each configuration should be trained for; the configurations
are sampled randomly from the hyperparameter space. Each trial is
trained for the specified length and then validation metrics are
computed. min_validation_period can be used to specify that
validation metrics should be computed more frequently.
Required Fields
metric
The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_trials
The number of trials, i.e., hyperparameter configurations, to evaluate.
max_length
The length to train each trial, in terms of records, batches, or epochs (see Training Units).
Optional Fields
smaller_is_better
Whether to minimize or maximize the metric defined above. The default value is
true
(minimize).source_trial_id
If specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is incompatible with the model architecture of any of the trials in this experiment.
source_checkpoint_uuid
Like
source_trial_id
but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_id
orsource_checkpoint_uuid
should be set.
Grid¶
The grid
search method performs a grid search. The coordinates of
the hyperparameter grid are specified via the hyperparameters
field.
For more details see the Hyperparameter Search: Grid.
Required Fields
metric
The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_length
The length to train each trial, in terms of records, batches, or epochs (see Training Units).
Optional Fields
smaller_is_better
Whether to minimize or maximize the metric defined above. The default value is
true
(minimize).source_trial_id
If specified, the weights of this trial will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of this experiment.
source_checkpoint_uuid
Like
source_trial_id
, but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_id
orsource_checkpoint_uuid
should be set.
Adaptive (ASHA)¶
The adaptive_asha
search method employs the same underlying
algorithm as the adaptive method below, but it uses
an asynchronous version of successive halving (ASHA), which is more suitable for
large-scale experiments with hundreds or thousands of trials.
Required Fields
metric
The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_length
The maximum training length of any one trial, in terms of records, batches, or epochs (see Training Units). The vast majority of trials will be stopped early, and thus only a small fraction of trials will actually be trained for this long. This quantity is domain-specific and should roughly reflect the length of training needed for the model to converge on the data set.
max_trials
The number of trials, i.e., hyperparameter configurations, to evaluate.
Optional Fields
smaller_is_better
Whether to minimize or maximize the metric defined above. The default value is
true
(minimize).mode
How aggressively to perform early stopping. There are three modes:
aggressive
,standard
, andconservative
; the default isstandard
.These modes differ in the degree to which early-stopping is used. In
aggressive
mode, the searcher quickly stops underperforming trials, which enables the searcher to explore more hyperparameter configurations, but at the risk of discarding a configuration too soon. On the other end of the spectrum,conservative
mode performs significantly less downsampling, but as a consequence does not explore as many configurations given the same budget. We recommend using eitheraggressive
orstandard
mode.divisor
The fraction of trials to keep at each rung, and also determines the training length for each rung. The default setting is
4
; only advanced users should consider changing this value.max_rungs
The maximum number of times we evaluate intermediate results for a trial and terminate poorly performing trials. The default value is
5
; only advanced users should consider changing this value.max_concurrent_trials
The maximum number of trials that can be worked on simultaneously. The default value is
0
, and we set reasonable values depending on max_trials and the number of rungs in the brackets. This is akin to controlling the degree of parallelism of the experiment.source_trial_id
If specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of any of the trials in this experiment.
source_checkpoint_uuid
Like
source_trial_id
, but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_id
orsource_checkpoint_uuid
should be set.
Adaptive (Simple)¶
Warning
Adaptive (Simple) is deprecated and will be removed in a future release. We recommend using the state-of-the-art Adaptive (ASHA) searcher.
The adaptive_simple
search method is a simpler interface to the
adaptive search
method described above. adaptive_simple
is designed to be simpler to
configure for most applications of hyperparameter search.
Required Fields
metric
The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_length
:The maximum training length of any one trial, in terms of records, batches, or epochs (see Training Units). The vast majority of trials will be stopped early, and thus only a small fraction of trials will actually be trained for this long. This quantity is domain-specific and should roughly reflect the length of training needed for the model to converge on the data set.
max_trials
The number of trials, i.e., hyperparameter configurations, to evaluate.
Optional Fields
smaller_is_better
Whether to minimize or maximize the metric defined above. The default value is
true
(minimize).mode
How aggressively to perform early stopping. There are three modes:
aggressive
,standard
, andconservative
; the default isstandard
.These modes differ in the degree to which early-stopping is used. In
aggressive
mode, the searcher quickly stops underperforming trials, which enables the searcher to explore more hyperparameter configurations, but at the risk of discarding a configuration too soon. On the other end of the spectrum,conservative
mode performs significantly less downsampling, but as a consequence does not explore as many configurations given the same budget. We recommend using eitheraggressive
orstandard
mode.divisor
The fraction of trials to keep at each rung, and also determines the training length for each rung. The default setting is
4
; only advanced users should consider changing this value.max_rungs
The maximum number of times we evaluate intermediate results for a trial and terminate poorly performing trials. The default value is
5
; only advanced users should consider changing this value.source_trial_id
If specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of any of the trials in this experiment.
source_checkpoint_uuid
Like
source_trial_id
, but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_id
orsource_checkpoint_uuid
should be set.
Adaptive (Advanced)¶
Warning
Adaptive (Advanced) is deprecated and will be removed in a future release. We recommend using the state-of-the-art Adaptive (ASHA) searcher.
The adaptive
search method is a theoretically principled and
empirically state-of-the-art method that adaptively allocates resources
to promising hyperparameter configurations while quickly eliminating
poor ones.
Required Fields
metric
The name of the validation metric used to evaluate the performance of a hyperparameter configuration.
max_length
The maximum training length of any one trial, in terms of records, batches or epochs (see Training Units). The vast majority of trials will be stopped early, and thus only a small fraction of trials will actually be trained for this long. We suggest setting this to a multiple of
divisor^(max_rungs-1)
, which is 4^(5-1) = 256 with the default values.
budget
The total training length across all trials, in terms of the same unit (see Training Units) as
max_length
. We suggest setting this to be a multiple ofmax_length
, which implies interpreting this subfield as the effective number of complete trials to evaluate. Note that some trials might be in-progress when this budget is exhausted; adaptive search will allow these to complete.
Optional Fields
smaller_is_better
Whether to minimize or maximize the metric defined above. The default value is
true
(minimize).mode
How aggressively to perform early stopping. There are three modes:
aggressive
,standard
, andconservative
; the default isstandard
.These modes differ in the degree to which early-stopping is used. In
aggressive
mode, the searcher quickly stops underperforming trials, which enables the searcher to explore more hyperparameter configurations, but at the risk of discarding a configuration too soon. On the other end of the spectrum,conservative
mode performs significantly less downsampling, but as a consequence does not explore as many configurations given the same budget. We recommend using eitheraggressive
orstandard
mode.divisor
The fraction of trials to keep at each rung, and also determines the training length for each rung. The default setting is
4
; only advanced users should consider changing this value.max_rungs
The maximum number of times we evaluate intermediate results for a trial and terminate poorly performing trials. The default value is
5
; only advanced users should consider changing this value.source_trial_id
If specified, the weights of every trial in the search will be initialized to the most recent checkpoint of the given trial ID. This will fail if the source trial’s model architecture is inconsistent with the model architecture of any of the trials in this experiment.
source_checkpoint_uuid
Like
source_trial_id
, but specifies an arbitrary checkpoint from which to initialize weights. At most one ofsource_trial_id
orsource_checkpoint_uuid
should be set.
PBT¶
The pbt
search method uses population-based training,
which maintains a population of active trials to train. After each trial
has been trained the length specified by length_per_round
, all the
trials are validated. The searcher then closes some trials and replaces
them with altered copies of other trials. This process makes up one
“round”; the searcher runs some number of rounds to execute a complete
search. The model definition class must be able to restore from a
checkpoint that was created with a different set of hyperparameters; in
particular, you will not be able to vary any hyperparameters that change
the sizes of weight matrices without taking special steps to save or
restore models.
Required Fields
metric
Specifies the name of the validation metric used to evaluate the performance of a hyperparameter configuration.
population_size
The number of trials (i.e., different hyperparameter configurations) to keep active at a time.
length_per_round
The length to train each trial during a round, in terms of records, batches or epochs (see Training Units).
num_rounds
The total number of rounds to execute.
replace_function
How to choose which trials to close and which trials to copy at the end of each round. At present, only a single replacement function is supported:
truncate_fraction
Defines truncation selection, in which the worst
truncate_fraction
(multiplied by the population size) trials, ranked by validation metric, are closed and the same number of top trials are copied.
explore_function
How to alter a set of hyperparameters when a copy of a trial is made. Each parameter is either resampled (i.e., its value is chosen from the configured distribution) or perturbed (i.e., its value is computed based on the value in the original set).
explore_function
has two required sub-fields:resample_probability
The probability that a parameter is replaced with a new value sampled from the original distribution specified in the configuration.
perturb_factor
The amount by which parameters that are not resampled are perturbed. Each numerical hyperparameter is multiplied by either
1 + perturb_factor
or1 - perturb_factor
with equal probability;categorical
andconst
hyperparameters are left unchanged.
Optional Fields
smaller_is_better
Whether to minimize or maximize the metric defined above. The default value is
true
(minimize).
Resources¶
The resources
section defines the resources that an experiment is
allowed to use.
Optional Fields
slots_per_trial
The number of slots to use for each trial of this experiment. The default value is
1
; specifying a value greater than 1 means that multiple GPUs will be used in parallel. Training on multiple GPUs is done using data parallelism. Configuringslots_per_trial
to be greater thanmax_slots
is not sensible and will result in an error.Note
Using
slots_per_trial
to enable data parallel training for PyTorch can alter the behavior of certain models, as described in the PyTorch documentation.
agent_label
If set, tasks launched for this experiment will only be scheduled on agents that have the given label set. If this is not set (the default behavior), tasks launched for this experiment will only be scheduled on unlabeled agents. An agent’s label can be configured via the
label
field in the agent configuration.max_slots
The maximum number of scheduler slots that this experiment is allowed to use at any one time. The slot limit of an active experiment can be changed using
det experiment set max-slots <id> <slots>
. By default, there is no limit on the number of slots an experiment can use.Warning
max_slots
is only considered when scheduling jobs; it is not currently used when provisioning dynamic agents. This means that we may provision more instances than the experiment can schedule.weight
The weight of this experiment in the scheduler. When multiple experiments are running at the same time, the number of slots assigned to each experiment will be approximately proportional to its weight. The weight of an active experiment can be changed using
det experiment set weight <id> <weight>
. The default weight is1
.
shm_size
The size in bytes of
/dev/shm
for trial containers. Defaults to4294967296
(4GiB). If set, this value overrides the value specified in the master configuration.priority
The priority assigned to this experiment. Smaller priority assignment indicates higher priority. Only applicable when using the
priority
scheduler.
Bind Mounts¶
The bind_mounts
section specifies directories that are bind-mounted
into every container launched for this experiment. Bind mounts are often
used to enable trial containers to access additional data that is not
part of the model definition directory.
This field should consist of an array of entries; each entry has the form described below. Users must ensure that the specified host paths are accessible on all agent hosts (e.g., by configuring a network file system appropriately).
For each bind mount, the following fields are required:
host_path
The file system path on each agent to use. Must be an absolute filepath.
container_path
The file system path in the container to use. May be a relative filepath, in which case it will be mounted relative to the working directory inside the container. It is not allowed to mount directly into the working directory (i.e.,
container_path == "."
) to reduce the risk of cluttering the host filesystem.
For each bind mount, the following optional fields may also be specified:
read_only
Whether the bind-mount should be a read-only mount. Defaults to
false
.propagation
Propagation behavior for replicas of the bind-mount. Defaults to
rprivate
.
Environment¶
The environment
section defines properties of the container
environment that is used to execute workloads for this experiment. For
more information on customizing the trial environment, refer to
Environment Configuration.
Optional Fields
image
The Docker image to use when executing the workload. This image must be accessible via
docker pull
to every Determined agent machine in the cluster. Users can configure different container images for GPU vs. CPU agents by specifying a dict with two keys,cpu
andgpu
. Default values:determinedai/environments:cuda-10.0-pytorch-1.4-tf-1.15-gpu-0.8.0
for agents with GPUs.determinedai/environments:py-3.6.9-pytorch-1.4-tf-1.15-cpu-0.8.0
for agents with only CPUs.
force_pull_image
Forcibly pull the image from the Docker registry, bypassing the Docker cache. Defaults to
false
.registry_auth
The Docker registry credentials to use when pulling a custom base Docker image, if needed. Credentials are specified as the following nested fields:
username
(required)password
(required)server
(optional)email
(optional)
environment_variables
A list of environment variables that will be set in every trial container. Each element of the list should be a string of the form
NAME=VALUE
. See Environment Variables for more details. Users can customize environment variables for GPU vs. CPU agents differently by specifying a dict with two keys,cpu
andgpu
.
pod_spec
Only applicable when running Determined on Kubernetes. Applies a pod spec to the pods that are launched by Determined for this task. See Specifying Custom Pod Specs for details.
Optimizations¶
The optimizations
section contains configuration options that
influence the performance of the experiment.
Optional Fields
aggregation_frequency
Specifies after how many batches gradients are exchanged during Distributed Training. Defaults to
1
.average_aggregated_gradients
Whether gradients accumulated across batches (when
aggregation_frequency
> 1) should be divided by theaggregation_frequency
. Defaults totrue
.average_training_metrics
For multi-GPU training, whether to average the training metrics across GPUs instead of only using metrics from the chief GPU. This impacts the metrics shown in the Determined UI and TensorBoard, but does not impact the outcome of training or hyperparameter search. This option is currently only supported in PyTorch. Defaults to
false
.gradient_compression
Whether to compress gradients when they are exchanged during Distributed Training. Compression may alter gradient values to achieve better space reduction. Defaults to
false
.mixed_precision
Whether to use mixed precision training with PyTorch during Distributed Training. Setting
O1
enables mixed precision and loss scaling. Defaults toO0
which disables mixed precision training. This configuration setting is deprecated; users are advised to callcontext.configure_apex_amp
in the constructor of their trial class instead.tensor_fusion_threshold
The threshold in MB for batching together gradients that are exchanged during Distributed Training. Defaults to
64
.tensor_fusion_cycle_time
The delay (in milliseconds) between each tensor fusion during Distributed Training. Defaults to
5
.auto_tune_tensor_fusion
When enabled, configures
tensor_fusion_threshold
andtensor_fusion_cycle_time
automatically. Defaults tofalse
.
Reproducibility¶
The reproducibility
section specifies configuration options related
to reproducible experiments. See Reproducibility for more
details.
Optional Fields
experiment_seed
The random seed to use to initialize random number generators for all trials in this experiment. Must be an integer between 0 and 231–1. If an
experiment_seed
is not explicitly specified, the master will automatically generate an experiment seed.
Data Layer¶
The data_layer
section specifies configuration options related to
the Data Layer. Determined currently supports three types of
storage for the data_layer
: s3
, gcs
, and shared_fs
,
identified by the type
subfield. Defaults to shared_fs
.
Shared File System¶
If type: shared_fs
is specified, the cache will be stored in a
directory on an agent’s file system.
Optional Fields
host_storage_path
The file system path on each agent to use.
container_storage_path
The file system path to use as the mount point in the trial runner container.
Amazon S3¶
If type: s3
is specified, the cache will be stored on Amazon S3 or
an S3-compatible object store such as MinIO.
Required Fields
bucket
The S3 bucket name to use.
bucket_directory_path
The path in the S3 bucket to store the cache.
Optional Fields
local_cache_host_path
The file system path to store a local copy of the cache, which is synchronized with the S3 cache.
local_cache_container_path
The file system path to use as the mount point in the trial runner container for storing the local cache.
access_key
The AWS access key to use.
secret_key
The AWS secret key to use.
endpoint_url
The endpoint to use for S3 clones, e.g.,
http://127.0.0.1:8080/
.
Google Cloud Storage¶
If type: gcs
is specified, the cache will be stored on Google Cloud
Storage (GCS). Authentication is done using GCP’s “Application Default
Credentials”
approach. When using Determined inside Google Compute Engine (GCE), the
simplest approach is to ensure that the VMs used by Determined are
running in a service account that has the “Storage Object Admin” role on
the GCS bucket being used for checkpoints. As an alternative (or when
running outside of GCE), you can add the appropriate service account
credentials
to your container (e.g., via a bind-mount), and then set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to the container
path where the credentials are located. See Environment Variables
for more details on how to set environment variables in containers.
Required Fields
bucket
The GCS bucket name to use.
bucket_directory_path
The path in GCS bucket to store the cache.
Optional Fields
local_cache_host_path
The file system path to store a local copy of the cache, which is synchronized with the GCS cache.
local_cache_container_path
The file system path to use as the mount point in the trial runner container for storing the local cache.
Training Units¶
Some configuration settings, such as searcher training lengths and
budgets, min_validation_period
, and min_checkpoint_period
, can
be expressed in terms of a few training units: records, batches, or
epochs. A record is a single labeled example. A batch is a group of
records (the number of records in a batch is configured via the
global_batch_size
hyperparameter). An epoch is a single copy of
the entire training data set; the number of records in an epoch is
configured via the records_per_epoch
configuration field.
For example, to specify the max_length
for a searcher in terms of
batches, the configuration would read as shown below.
max_length:
batches: 900
To express it terms of records or epochs, records
or epochs
would be specified in place of batches
. In the case of epochs,
records_per_epoch must also be
specified. Below is an example that configures a single
searcher to
train a model for 64 epochs.
records_per_epoch: 50000
searcher:
name: single
metric: validation_error
max_length:
epochs: 64
smaller_is_better: true
The epoch size configured here is only used for interpreting configuration fields that are expressed in epochs. Actual epoch boundaries are still determined by the dataset itself (specifically, the end of an epoch occurs when the training data loader runs out of records).
For a more detailed look at training units in Determined, check out the topic guide.
Migration from v0.12.13 to v0.13.0¶
In v0.13.0, many configuration settings were renamed and updated to be configured in terms of the training units records, batches, and epochs instead of steps.
This migration guide describes the steps to migrate your experiment configurations from v0.12.13 to v0.13.0 while maintaining nearly identical behavior.
Warning
Before migrating, make sure to cancel or kill all experiments in the
ACTIVE
or PAUSED
state, as they will not be able to resume on
the new version of the Determined master. Also, we recommend taking a
database snapshot and archiving any old experiments ahead of time.
The table below describes the fields whose name or value has changed.
Old name |
New name |
Value changed |
---|---|---|
searcher.step_budget |
searcher.budget |
true |
searcher.max_steps |
searcher.max_length |
true |
searcher.target_trial_steps |
searcher.max_length |
true |
searcher.steps_per_round |
searcher.length_per_round |
true |
min_validation_period |
min_validation_period |
true |
min_checkpoint_period |
min_checkpoint_period |
true |
batches_per_step |
scheduling_unit |
false |
For each configuration setting in the table, if it exists in your
configuration, it should be renamed as shown above and changed to use
the new units. To migrate the value to the new units, it should be
changed from an integer to a map, {batches: value}
, where value is
the old value multiplied by the old value for batches_per_step
. If
batches_per_step
was not specified, use the default value of 100.
For example, to migrate this snippet of a single searcher experiment
configuration:
batches_per_step: 100
min_validation_period: 1
min_checkpoint_period: 2
searcher:
name: single
metric: accuracy
max_trials: 5
max_steps: 10
We replace max_steps
by max_length
and change the value to
{batches: 100*10}
, change the value for min_validation_period
to
{batches: 100*1}
, and change the value for min_checkpoint_period
to {batches: 100*2}
.
scheduling_unit: 100
min_validation_period:
batches: 100
min_checkpoint_period:
batches: 200
searcher:
name: single
metric: accuracy
max_trials: 5
max_length:
batches: 1000
Finally, we also rename batches_per_step
to scheduling_unit
,
leaving the value the same.
To see how a full experiment configuration for one of our official examples changed, please compare this 0.12.13 CIFAR10 PyTorch adaptive experiment configuration to the 0.13.0 version here that uses epoch training units.