Helm Chart Configuration¶
appVersion: Configures which version of Determined to install. Users can specify a release version (e.g.,
0.13.0) or specify any commit hash from the upstream Determined repo (e.g.,
b13461ed06f2fad339e179af8028d4575db71a81). Users are encouraged to use a released version.
If using a non-release branch of the Determined repository,
appVersion is going to be set to
X.Y.Z.dev0. This is not an
official release version and deploying this version result in a
ImagePullBackOff error. Users should remove
.dev0 to get the
latest released version, or they can specify a specific commit hash
httpPort: The port at which the Determined master listens for connections on. (Required)
db: Specifies the configuration of the database.
name: The database name to use. (Required)
user: The database user to use when logging in the database. (Required)
password: The password to use when logging in the database. (Required)
port: The database port to use. (Required)
hostAddress: Optional configuration to indicate the address of a user provisioned database If configured, the Determined helm chart will not deploy a database.
storageSize: Only used when
hostAddressis left blank. Configures the size of the PersistentVolumeClaim for the Determined deployed database.
checkpointStorage: Specifies where model checkpoints will be stored. This can be overridden on a per-experiment basis in the Experiment Configuration. A checkpoint contains the architecture and weights of the model being trained. Determined currently supports three kinds of checkpoint storage,
shared_fs, identified by the
type: gcs: Checkpoints are stored on Google Cloud Storage (GCS). Authentication is done using GCP’s “Application Default Credentials” approach. When using Determined inside Google Kubernetes Engine (GCE), the simplest approach is to ensure that the nodes used by Determined are running in a service account that has the “Storage Object Admin” role on the GCS bucket being used for checkpoints. As an alternative (or when running outside of GKE), you can add the appropriate service account credentials to your container (e.g., via a bind-mount), and then set the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable to the container path where the credentials are located. See Environment Variables for more information on how to set environment variables in trial environments.
bucket: The GCS bucket name to use.
type: s3: Checkpoints are stored in Amazon S3.
bucket: The S3 bucket name to use.
accessKey: The AWS access key to use.
secretKey: The AWS secret key to use.
endpointUrl: The optional endpoint to use for S3 clones, e.g., http://127.0.0.1:8080/.
type: shared_fs: Checkpoints are written to a
hostPath Volume. Users are strongly discouraged from using
shared_fsfor storage beyond initial testing as most Kubernetes cluster nodes do not have a shared file system.
hostPath: The file system path on each node to use. This directory will be mounted to
/determined_shared_fsinside the trial pod.
When an experiment finishes, the system will optionally delete some checkpoints to reclaim space. The
saveTrialLatestparameters specify which checkpoints to save. See Checkpoint Garbage Collection for more details.
maxSlotsPerPod: Specifies number of GPUs there are per machine. Determined uses this information when scheduling multi-GPU tasks. Each multi-GPU (distributed training) task will be scheduled as a set of
slotsPerTask / maxSlotsPerPodseparate pods, with each pod assigned up to
maxSlotsPerPodGPUs. Distibuted tasks with sizes that are not divisible by
maxSlotsPerPodare never scheduled. If you have a cluster of different size nodes, set the
maxSlotsPerPodto the smallest common denominator. For example, if you have nodes with 4 GPUs and other nodes with 8 GPUs, set
maxSlotsPerPodto 4 so that all distributed experiments will launch with 4 GPUs per pod (e.g., on nodes with 8 GPUs, two such pods would be launched). (Required)
masterCpuRequest: The CPU requirements for the Determined master.
masterMemRequest: The memory requirements for the Determined master.
taskContainerDefaults: Specifies Docker defaults for all task containers. A task represents a single schedulable unit, such as a trial, command, or tensorboard.
networkMode: The Docker network to use for the Determined task containers. If this is set to “host”, Docker host-mode networking will be used instead. Defaults to “bridge”.
dtrainNetworkInterface: The network interface to use during Distributed Training. If not set, Determined automatically determines the network interface. When training a model with multiple machines, the host network interface used by each machine must have the same interface name across machines. This is usually determined automatically, but there may be issues if there is an interface name common to all machines but it is not routable between machines. Determined already filters out common interfaces like
docker0, but agent machines may have others. If interface detection is not finding the appropriate interface, the
dtrainNetworkInterfaceoption can be used to set it explicitly (e.g.,
ncclPortRange: The range of ports that nccl is permitted to use during distributed training. A valid port range is in the format of
glooPortRange: The range of ports that gloo is permitted to use during distributed training. A valid port range is in the format of
forcePullImage: Defines the default policy for forcibly pulling images from the docker registry and bypassing the docker cache. If a pull policy is specified in the experiment config this default is overriden. Please note that as of November 1st, 2020 unauthenticated users will be capped at 100 pulls from Docker per 6 hours. Defaults to
cpuPodSpec: Sets the default pod spec for all non-gpu tasks. See Specifying Custom Pod Specs for details.
gpuPodSpec: Sets the default pod spec for all ngpu tasks. See Specifying Custom Pod Specs for details.
cpuImage: Sets the default docker image for all non-gpu tasks. If a docker image is specified in the experiment config this default is overriden. Defaults to:
gpuImage: Sets the default docker image for all gpu tasks. If a docker image is specified in the experiment config this default is overriden. Defaults to:
enterpriseEdition: Specifies whether to use Determined enterprise edition.
imagePullSecretName: Specifies the image pull secret for pulling the Determined master image. Required when using the enterprise edition.
telemetry: Specifies whether we collect anonymous information about the usage of Determined.
enabled: Whether collection is enabled. Defaults to