Common Configuration Options¶
Master Port¶
By default, the master listens on TCP port 8080. This can be configured via the port
option.
Security¶
The master can secure all incoming connections using TLS. That ability requires a TLS private key
and certificate to be provided; set the options security.tls.cert
and security.tls.key
to
paths to a PEM-encoded TLS certificate and private key, respectively, to do so. If TLS is enabled,
the default port becomes 8443 rather than 8080. See Transport Layer Security for more information.
Configuring Trial Runner Networking¶
The master is capable of selecting the network interface that trial runners will use to communicate
when performing distributed (multi-machine) training. The network interface can be configured by
editing task_container_defaults.dtrain_network_interface
. If left unspecified, which is the
default setting, Determined will auto-discover a common network interface shared by the trial
runners.
Note
For Introduction to Distributed Training, Determined automatically detects a common network interface shared by the agent machines. If your cluster has multiple common network interfaces, please specify the fastest one.
Additionally, the ports used by the GLOO and NCCL libraries, which are used during distributed (multi-machine) training can be configured to fall within user-defined ranges. If left unspecified, ports will be chosen randomly from the unprivileged port range (1024-65535).
Default Checkpoint Storage¶
See Checkpoint Storage for details.
Telemetry¶
By default, the master and WebUI collect anonymous information about how Determined is being used. This usage information is collected so that we can improve the design of the product. Determined does not report information that can be used to identify individual users of the product, nor does it include model source code, model architecture/checkpoints, training datasets, training and validation metrics, logs, or hyperparameter values.
The information we collect from the master periodically includes:
a unique, randomly generated ID for the current database and for the current instance of the master
the version of Determined
the version of Go that was used to compile the master
the number of registered users
the number of experiments that have been created
the total number of trials across all experiments
the number of active, paused, completed, and canceled experiments
whether tasks are scheduled using Kubernetes or the built-in Determined scheduler
the total number of slots (e.g., GPUs)
the number of slots currently being utilized
the type of each configured resource pool
We also record when the following events happen:
an experiment is created
an experiment changes state
an agent connects or disconnects
a user is created (the username is not transmitted)
When an experiment is created, we report:
the name of the hyperparameter search method
the total number of hyperparameters
the number of slots (e.g., GPUs) used by each trial in the experiment
the name of the container image used
When a task terminates, we report:
the start and end time of the task
the number of slots (e.g., GPUs) used
for experiments, we also report:
the number of trials in the experiment
the total number of training workloads across all trials in the experiment
the total elapsed time for all workloads across all trials in the experiment
The information we collect from the WebUI includes:
pages that are visited
errors that occur (both network errors and uncaught exceptions)
user-triggered actions
To disable telemetry reporting in both the master and the WebUI, start the master with the
--telemetry-enabled=false
flag (this can also be done by editing the master config file or
setting an environment variable, as with any other configuration option). Disabling telemetry
reporting will not affect the functionality of Determined in any way.
OpenTelemetry¶
Separate from the telemetry reporting mentioned above, Determined also supports OpenTelemetry to collect traces. This is disabled by default; to enable it, use the
master configuration setting telemetry.otel-enabled
. When enabled, the master will send
OpenTelemetry traces to a collector running at localhost:4317
. A different endpoint can be set
via the telemetry.otel-endpoint
configuration setting.