Cluster Configuration¶
For both the master and the agent, configuration options may be set using a configuration file, environment variables, or command-line options. (Although values from different sources will be merged, we generally recommend sticking to a single source for each service to keep things simple.)
Each option in the configuration file corresponds directly to an environment variable or command-line option. For example, the master configuration file might contain
db:
host: the-db-host
to configure the host where the PostgreSQL database is running. The same
thing could be done by starting the master with the
DET_DB_HOST=the-db-host
environment variable or --db-host
the-db-host
command-line option.
In the rest of this document, we will refer to options using their names
in the configuration file. Periods (.
) will be used to indicate
nested options; for example, the option above would be indicated by
db.host
.
Common Options¶
Master Port¶
By default, the master listens on TCP port 8080. This can be configured
via the http_port
option.
Security¶
The master is capable of serving over HTTPS in addition to HTTP. Doing
so requires a TLS private key and certificate; to configure them, set
the options security.tls.cert
and security.tls.key
to paths to a
PEM-encoded TLS certificate and private key, respectively. The
https_port
option determines the HTTPS listening port (default
8443).
Configuring Trial Runner Networking¶
The master is capable of selecting the network interface that trial
runners will use to communicate when performing distributed
(multi-machine) training. The network interface can be configured by
editing the task_container_defaults.dtrain_network_interface
. If left unspecified,
which is the default setting, Determined will auto-discover a common
network interface shared by the trial runners.
Note
For Distributed and Parallel Training, Determined automatically detects a common network interface shared by the agent machines. If your cluster has multiple common network interfaces, please specify the fastest one.
Additionally, the ports used by the GLOO and NCCL libraries, which are used during distributed (multi-machine) training can be configured to fall within user-defined ranges. If left unspecified, ports will be chosen randomly from the unprivileged port range (1024-65535).
Default Checkpoint Storage¶
See Checkpoint Storage Configuration for details.
Telemetry¶
By default, the master and WebUI collect anonymous information about how Determined is being used. This usage information is collected so that we can improve the design of the product. Determined does not report information that can be used to identify individual users of the product, nor does it include model source code, model architecture/checkpoints, training datasets, training and validation metrics, logs, or hyperparameter values.
The information we collect from the master periodically includes:
a unique, randomly generated ID for the current database and for the current instance of the master
the version of Determined
the version of Go that was used to compile the master
the number of registered users
the number of experiments that have been created
the total number of trials across all experiments
the number of active, paused, completed, and canceled experiments
We also record when the following events happen:
an experiment is created
an experiment’s state changes
an agent connects or disconnects
a user is created (the username is not transmitted)
When an experiment is created, we report:
the
searcher
andresources
sections of the experiment configthe name of the container image used
the total number of hyperparameters
the value of the
batches_per_step
configuration setting
When an experiment terminates, we report:
the number of trials in the experiment
the total number of steps across all trials in the experiment
the total elapsed time for all steps across all trials in the experiment
The information we collect from the WebUI includes:
pages that are visited
errors that occur (both network errors and uncaught exceptions)
user-triggered actions
In order to disable telemetry reporting in both the master and the
WebUI, start the master with the --telemetry-enabled=false
flag
(this can also be done by editing the master config file or setting an
environment variable, as with any other configuration option).
Disabling telemetry reporting will not affect the functionality of
Determined in any way.
Master Configuration¶
A cluster configuration is a YAML file that controls the behavior of
Determined as a whole. It is normally located at
/etc/determined/master.yaml
on the master, and it is read when the
master starts. It may contain the following fields:
scheduler
: Specifies how Determined schedules tasks to agents.fit
: The scheduling policy to use when assigning tasks to agents in the cluster.best
: The best-fit policy ensures that tasks will be preferentially “packed” together on the smallest number of agents.worst
: The worst-fit policy ensures that tasks will be placed on under-utilized agents.
task_container_defaults
: Specifies Docker defaults for all task containers. A task represents a single schedulable unit, like a trial, command or tensorboard.shm_size_bytes
: The size (in bytes) of/dev/shm
for Determined task containers. Defaults to4294967296
.network_mode
: The Docker network to use for the Determined task containers. If this is set to “host”, Docker host-mode networking will be used instead. Defaults to “bridge”.dtrain_network_interface
: The network interface to use during Distributed and Parallel Training. If not set, Determined automatically determines the network interface.When training a model with multiple machines, the host network interface used by each machine must have the same interface name across machines. This is usually determined automatically, but there may be issues if there is an interface name common to all machines but it is not routable between machines. Determined already filters out common interfaces like
lo
anddocker0
, but agent machines may have others.If interface detection is not finding the appropriate interface, the
dtrain_network_interface
option can be used to set it explicitly (e.g.,eth11
).nccl_port_range
: The range of ports that nccl is permitted to use during distributed training. A valid port range is in the format ofMIN:MAX
.gloo_port_range
: The range of ports that gloo is permitted to use during distributed training. A valid port range is in the format ofMIN:MAX
.
root
: Specifies the root directory of the state files. Defaults to/usr/share/determined/master
.provisioner
: Specifies the configuration of dynamic agents.master_url
: The full URL of the master. A valid URL is in the format ofscheme://host:port
. The scheme must be eitherhttp
orhttps
. If the master is deployed on EC2, rather than hardcoding the IP address, we advise you use one of the following to set the host as an alias:local-ipv4
,public-ipv4
,local-hostname
, orpublic-hostname
. If the master is deployed on GCP, rather than hardcoding the IP address, we advise you use one of the following to set the host as an alias:internal-ip
orexternal-ip
. Which one you should select is based on your network configuration. On master startup, we will replace the above alias host with its real value. Defaults tohttp
as scheme, local IP address as host, and8080
as port.startup_script
: Startup script for agents. This script will run right away when agent instances start up. For example, it can be used for formating and mounting a disk. Defaults to an empty string.agent_docker_network
: The Docker network to use for the Determined agent and task containers. If this is set to “host”, Docker host-mode networking will be used instead. The default value is “determined”.agent_docker_runtime
: The Docker runtime to use for the Determined agent and task containers. Defaults tonvidia
.agent_docker_image
: The Docker image to use for the Determined agents. A valid form isdeterminedai/determined-agent:<version>
. (Required)max_idle_agent_period
: How long to wait before terminating idle dynamic agents. This string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as “30s”, “1h”, or “1m30s”. Valid time units are “s”, “m”, “h”.provider: aws
: Specifies running dynamic agents on AWS. (Required)region
: The region of the AWS resources used by Determined. We advise setting this region to be the same region as the Determined master for better network performance. Defaults to the same region as the master.root_volume_size
: Size of the root volume of the Determined agent in GB. We recommend at least 100GB. Defaults to200
.image_id
: The AMI ID of the Determined agent. (Required)tag_key
: Key for tagging the Determined agent instances. Defaults tomanaged-by
.tag_value
: Value for tagging the Determined agent instances. Defaults to the master instance ID if the master is on EC2, otherwisedetermined-ai-determined
.instance_name
: Name to set for the Determined agent instances. Defaults todetermined-ai-agent
.ssh_key_name
: The name of the SSH key registered with AWS for SSH key access to the agent instances. (Required)iam_instance_profile_arn
: The Amazon Resource Name (ARN) of the IAM instance profile to attach to the agent instances.network_interface
: Network interface to set for the Determined agent instances.public_ip
: Flag to using public IP address for the Determined agent instances. See Network Requirements for instruction on whether an external IP should be set. Defaults tofalse
.security_group_id
: The ID of the security group to run the Determined agents as. This is the one you identified or created in Network Requirements. Defaults to the default security group of the specified VPC.subnet_id
: The ID of the subnet to run the Determined agents in. Defaults to the default subnet of the default VPC.
max_instances
: Max number of Determined agent instances. Defaults to5
.instance_type
: Type of instance for the Determined agents. We only support P3 and P2 type instances. Defaults top3.8xlarge
.
provider: gcp
: Specifies running dynamic agents on GCP. (Required)base_config
: Instance resource base configuration that will be merged with the fields below to construct GCP inserting instance request. See REST Resource: instances for details.project
: The project ID of the GCP resources used by Determined. Defaults to the project of the master.zone
: The zone of the GCP resources used by Determined. Defaults to the zone of the master.boot_disk_size
: Size of the root volume of the Determined agent in GB. We recommend at least 100GB. Defaults to200
.boot_disk_source_image
: The boot disk source image of the Determined agent that was shared with you. To use a specific version of the Determined agent image from a specific project, it should be set in the format:projects/<project-id>/global/images/<image-id>
. (Required)label_key
: Key for labeling the Determined agent instances. Defaults tomanaged-by
.label_value
: Value for labeling the Determined agent instances. Defaults to the master instance name if the master is on GCP otherwisedetermined-ai-determined
.name_prefix
: Name prefix to set for the Determined agent instances. The names of the Determined agent instances are a concatenation of the name prefix and a pet name. Defaults to the master instance name if the master is on GCP otherwisedetermined-ai-determined
.network_interface
: Network configuration for the Determined agent instances. See the GCP API Access section for the suggested configuration. (Required)network
: Network resource for the Determined agent instances. The network configuration should specify the project ID of the network. It should be set in the format:projects/<project>/global/networks/<network>
. (Required)subnetwork
: Subnetwork resource for the Determined agent instances. The subnet configuration should specify the project ID and the region of the subnetwork. It should be set in the format:projects/<project>/regions/<region>/subnetworks/<subnetwork>
. (Required)external_ip
: Flag to using external IP address for the Determined agent instances. See Network Requirements for instructions on whether an external IP should be set. Defaults tofalse
.
network_tags
: An array of network tags to set firewalls for the Determined agent instances. This is the one you identified or created in Firewall Rules. Defaults to be an empty array.service_account
: Service account for the Determined agent instances. See the GCP API Access section for suggested configuration.email
: Email of the service account for the Determined agent instances. Defaults to the empty string.scopes
: List of scopes authorized for the Determined agent instances. As suggested in GCP API Access, we recommend you set the scopes to["https://www.googleapis.com/auth/cloud-platform"]
. Defaults to["https://www.googleapis.com/auth/cloud-platform"]
.
instance_type
: Type of instance for the Determined agents.machine_type
: Type of machine for the Determined agents. Defaults ton1-standard-32
.gpu_type
: Type of GPU for the Determined agents. Defaults tonvidia-tesla-v100
.gpu_num
: Number of GPUs for the Determined agents. Defaults to 4.preemptible
: Whether to use preemptible instances. Defaults tofalse
.
max_instances
: Max number of Determined agent instances. Defaults to 5.
checkpoint_storage
: Specifies where model checkpoints will be stored. This can be overridden on a per-experiment basis in the Experiment Configuration. A checkpoint contains the architecture and weights of the model being trained. Determined currently supports four kinds of checkpoint storage,gcs
,hdfs
,s3
, andshared_fs
, identified by thetype
subfield.type: gcs
: Checkpoints are stored on Google Cloud Storage (GCS). Authentication is done using GCP’s “Application Default Credentials” approach. When using Determined inside Google Compute Engine (GCE), the simplest approach is to ensure that the VMs used by Determined are running in a service account that has the “Storage Object Admin” role on the GCS bucket being used for checkpoints. As an alternative (or when running outside of GCE), you can add the appropriate service account credentials to your container (e.g., via a bind-mount), and then set theGOOGLE_APPLICATION_CREDENTIALS
environment variable to the container path where the credentials are located. See Environment Variables for more information on how to set environment variables in trial environments.bucket
: The GCS bucket name to use.
type: hdfs
: Checkpoints are stored in HDFS using the WebHDFS API for reading and writing checkpoint resources.hdfs_url
: Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode. Multiple namenodes are allowed as a semicolon-separated list (e.g.,"http://namenode1:50070;http://namenode2:50070"
).hdfs_path
: The prefix path where all checkpoints will be written to and read from. The resources of each checkpoint will be saved in a subdirectory ofhdfs_path
, where the subdirectory name is the checkpoint’s UUID.user
: An optional string value that indicates the user to use for all read and write requests. If left unspecified, the default user of the trial runner container will be used.
type: s3
: Checkpoints are stored in Amazon S3.bucket
: The S3 bucket name to use.access_key
: The AWS access key to use.secret_key
: The AWS secret key to use.endpoint_url
: The optional endpoint to use for S3 clones, e.g., http://127.0.0.1:8080/.
type: shared_fs
: Checkpoints are written to a directory on the agent’s file system. The assumption is that the system administrator has arranged for the same directory to be mounted at every agent host, and for the content of this directory to be the same on all agent hosts (e.g., by using a distributed or network file system such as GlusterFS or NFS).host_path
: The file system path on each agent to use.container_path
: The optional file system path to use as the mount point in the trial runner container. Defaults to/determined_shared_fs
.storage_path
: The optional path where checkpoints will be written to and read from. Must be a subdirectory of thehost_path
or an absolute path containing thehost_path
. If unset, checkpoints are written to and read from thehost_path
.propagation
: (Advanced users only) Optional propagation behavior for replicas of the bind-mount. Defaults torprivate
.
When an experiment finishes, the system will optionally delete some checkpoints to reclaim space. The
save_experiment_best
,save_trial_best
andsave_trial_latest
parameters specify which checkpoints to save. See Checkpoint Garbage Collection for more details.
db
: Specifies the configuration of the database.user
: The database user to use when logging in the database. (Required)password
: The password to use when logging in the database. (Required)host
: The database host to use. (Required)port
: The database port to use. (Required)name
: The database name to use. (Required)
hasura
: Specifies the connection to Hasura.address
: The address of Hasura host. A valid address is in the form ofhost:port
. Defaults tolocalhost:8081
.secret
: The secret to use. Defaults to an empty string.
telemetry
: Specifies whether we collect anonymous information about the usage of Determined.enabled
: Set it to betrue
to allow us to collect. Defaults to betrue
.