Common Configuration Options#
Master Port#
By default, the master listens on TCP port 8080. You can configure this port via the port
option.
Security#
The master can secure all incoming connections using TLS. To enable this feature, provide a TLS
private key and certificate by setting the options security.tls.cert
and security.tls.key
to
paths to a PEM-encoded TLS certificate and private key, respectively. If TLS is enabled, the default
port becomes 8443 rather than 8080. Refer to Transport Layer Security for more information.
Configuring Task Container Networking#
The master can select the network interface that task containers will use to communicate during
distributed (multi-machine) training. You can configure the network interface by editing
task_container_defaults.dtrain_network_interface
. If left unspecified, which is the default
setting, Determined will auto-discover a common network interface shared by the task containers.
Note
For Distributed Training with Determined, the platform automatically detects a common network interface shared by the agent machines. If your cluster has multiple common network interfaces, please specify the fastest one.
Default Checkpoint Storage#
See Checkpoint Storage for details.
Telemetry#
To improve product design, the master and WebUI both collect information about how Determined is being used, by default. This information includes various metrics and events such as the number of experiments, trials, registered users, and more.
Telemetry does not report model source code, model architecture/checkpoints, training datasets, training and validation metrics, logs, or hyperparameter values.
The information we collect from the master periodically includes:
a unique, randomly generated ID for the current database and for the current instance of the master
the IP address of the master
the version of Determined
the version of Go that was used to compile the master
the number of registered users
the number of experiments that have been created
the total number of trials across all experiments
the number of active, paused, completed, and canceled experiments
whether tasks are scheduled using Kubernetes or the built-in Determined scheduler
the total number of slots (e.g., GPUs)
the number of slots currently being utilized
the type of each configured resource pool
We also record when the following events happen:
an experiment is created
an experiment changes state
an agent connects or disconnects
a user is created (the username is not transmitted)
When an experiment is created, we report:
the name of the hyperparameter search method
the total number of hyperparameters
the number of slots (e.g., GPUs) used by each trial in the experiment
the name of the container image used
When a task terminates, we report:
the start and end time of the task
the number of slots (e.g., GPUs) used
for experiments, we also report:
the number of trials in the experiment
the total number of training workloads across all trials in the experiment
the total elapsed time for all workloads across all trials in the experiment
The information we collect from the WebUI includes:
pages that are visited
errors that occur (both network errors and uncaught exceptions)
user-triggered actions
To disable telemetry reporting in both the master and the WebUI, start the master with the
--telemetry-enabled=false
flag (this can also be done by editing the master config file or
setting an environment variable, as with any other configuration option). Disabling telemetry
reporting will not affect the functionality of Determined in any way.
OpenTelemetry#
Separate from the telemetry reporting mentioned above, Determined also supports OpenTelemetry to collect traces. This is disabled by default. To enable it, use the
master configuration setting telemetry.otel-enabled
. When enabled, the master will send
OpenTelemetry traces to a collector running at localhost:4317
. A different endpoint can be set
via the telemetry.otel-endpoint
configuration setting.