Install Determined on Kubernetes#
Configuration Reference |
---|
This user guide describes how to install Determined on Kubernetes. using the Determined Helm Chart.
Tip
Store your installation commands and flags in a shell script for future use, particularly for upgrading.
When the Determined Helm chart is installed, the following entities will be created:
Deployment of the Determined master.
ConfigMap containing configurations for the Determined master.
LoadBalancer service to make the Determined master accessible. Later in this guide, we describe how it is possible to replace this with a NodePort service.
ServiceAccount which will be used by the Determined master.
Deployment of a Postgres database. Later in this guide, we describe how an external database can be used instead.
PersistentVolumeClaim for the Postgres database. Omitted if using an external database.
Service to allow the Determined master to communicate with the Postgres database. Omitted if using an external database.
In case of multiple Kubernetes clusters and in each external-to-master clusters:
Gateway service to allow north-south access to Determined proxied tasks in external-to-master clusters.
Service to expose proxied ports on Determined jobs.
TCPRoute to attach the gateway service to the proxied ports service.
Prerequisites#
Before installing Determined on a Kubernetes cluster, please ensure that the following prerequisites are satisfied:
The Kubernetes cluster should be running Kubernetes version >= 1.21.
You should have access to the cluster via kubectl.
Helm 3 should be installed.
If you are using a private image registry or the enterprise edition, you should add a secret using kubectl create secret.
Optional: for GPU-based training, the Kubernetes cluster should have GPU support enabled.
If you do not yet have a Kubernetes cluster deployed and you want to use Determined in a public cloud environment, we recommend using a managed Kubernetes offering such as Google Kubernetes Engine (GKE) on GCP or Elastic Kubernetes Service (EKS) on AWS. For more info on configuring GKE for use with Determined, refer to the Instructions for setting up a GKE cluster. For info on configuring EKS, refer to the Instructions for setting up an EKS cluster.
Quickstart#
First, add Determined helm chart repository:
helm repo add determined-ai https://helm.determined.ai/
Then, create a values.yaml
file to configure the Determined deployment:
# values.yaml
# Minimal configuration requires you to specify the number of GPUs per node.
maxSlotsPerPod: 1
Finally, install Determined using Helm:
helm install determined determined-ai/determined --values values.yaml
You can find more details about the configuration options in the Helm Chart Configuration Reference or in the Configuration section below.
Alternatively, you can:
Download the full helm chart using
helm pull determined-ai/determined --untar=true
, edit thevalues.yaml
file, and then install it usinghelm install determined ./determined
.Download the packaged
Determined Helm Chart
, extract the archive, editvalues.yaml
and install it.Use the latest main branch version version from Determined GitHub repo.
Configuration#
When installing Determined using Helm, first configure some aspects of the Determined deployment by
editing the values.yaml
file.
Image Registry Configuration#
To configure which image registry of Determined will be installed by the Helm chart, change
imageRegistry
in values.yaml
. You can specify the Docker Hub public registry
determinedai
or specify any private registry that hosts the Determined master image.
Image Pull Secret Configuration#
To configure which image pull secret will be used by the Helm chart, change imagePullSecretName
in values.yaml
. You can set it to empty for the Docker Hub public registry or specify any secret
that is configured using kubectl create secret.
Version Configuration#
To install a specific version of Determined, use helm --version <version>
flag, for example:
helm install determined determined-ai/determined --values values.yaml --version 0.30.0
Alternatively, if you have a copy of the Determined Helm chart, you can edit the Chart.yaml
file
and change appVersion
. You can specify a release version (e.g., 0.30.0
) or specify any
commit hash from the upstream Determined repo
(e.g., b13461ed06f2fad339e179af8028d4575db71a81
). You are strongly encouraged to use a released
version.
Resource Configuration (GPU-based setups)#
For GPU-based configurations, you must specify the number of GPUs on each node (for GPU-enabled
nodes only). This is done by setting maxSlotsPerPod
in values.yaml
. Determined uses this
information when scheduling multi-GPU tasks. Each multi-GPU (distributed training) task will be
scheduled as a set of slotsPerTask / maxSlotsPerPod
separate pods, with each pod assigned up to
maxSlotsPerPod
GPUs. Distributed tasks with sizes that are not divisible by maxSlotsPerPod
are never scheduled. If you have a cluster of different size nodes, set maxSlotsPerPod
to the
greatest common divisor of all the sizes. For example, if you have some nodes with 4 GPUs and other
nodes with 8 GPUs, set maxSlotsPerPod
to 4
so that all distributed experiments will launch
with 4 GPUs per pod (with two pods on 8-GPU nodes).
Resource Configuration (CPU-based setups)#
For CPU-only configurations, you need to set slotType: cpu
as well as
slotResourceRequests.cpu: <number of CPUs per slot>
in values.yaml
. Please note that the
number of CPUs allocatable by Kubernetes may be lower than the number of “hardware” CPU cores. For
example, an 8-core node may provide 7.91 CPUs, with the rest allocated for the Kubernetes system
tasks. If slotResourceRequests.cpu
was set to 8 in this example, the pods would fail to
allocate, so it should be set to a lower number instead, such as 7.5.
Then, similarly to GPU-based configuration, maxSlotsPerPod
needs to be set to the greatest
common divisor of all the node sizes. For example, if you have 16-core nodes with 15 allocatable
CPUs, it’s reasonable to set maxSlotsPerPod: 1
and slotResourceRequests.cpu: 15
. If you have
some 32-core nodes and some 64-core nodes, and you want to use finer-grained
slotResourceRequests.cpu: 15
, set maxSlotsPerPod: 2
.
Checkpoint Storage#
Checkpoints and TensorBoard events can be configured to be stored in shared_fs
, AWS S3, Microsoft Azure Blob Storage, or GCS. By default, checkpoints and TensorBoard events are stored
using shared_fs
, which creates a hostPath Volume and saves to the host file
system. This configuration is intended for initial testing only; you are strongly discouraged from
using shared_fs
for actual deployments of Determined on Kubernetes, because most Kubernetes
cluster nodes do not have a shared file system.
Instead of using shared_fs
, configure either AWS S3, Microsoft Azure Blob Storage, or GCS:
AWS S3: To configure Determined to use AWS S3 for checkpoint and TensorBoard storage, you need to set
checkpointStorage.type
invalues.yaml
tos3
and setcheckpointStorage.bucket
to the name of the bucket. The pods launched by the Determined master must have read, write, and delete access to the bucket. To enable this you can optionally configurecheckpointStorage.accessKey
andcheckpointStorage.secretKey
. You can optionally configurecheckpointStorage.endpointUrl
which specifies the endpoint to use for S3 clones (e.g.,http://<minio-endpoint>:<minio-port|default=9000>
).Microsoft Azure Blob Storage: To configure Determined to use Microsoft Azure Blob Storage for checkpoint and TensorBoard storage, you need to set
checkpointStorage.type
invalues.yaml
toazure
and setcheckpointStorage.container
to the name of the container to store it in. You must also specify one ofconnection_string
- the connection string associated with the Azure Blob Storage service account to use, or the tupleaccount_url
andcredential
- whereaccount_url
is the URL for the service account to use, andcredential
is an optional credential.GCS: To configure Determined to use Google Cloud Storage for checkpoints and TensorBoard data, set
checkpointStorage.type
invalues.yaml
togcs
and setcheckpointStorage.bucket
to the name of the bucket. The pods launched by the Determined master must have read, write, and delete access to the bucket. For example, when launching GKE nodes you need to specify--scopes=storage-full
to configure proper GCS access.
Default Pod Specs (Optional)#
As described in the Deploy on Kubernetes guide, when tasks (e.g., experiments, notebooks)
are started in a Determined cluster running on Kubernetes, the Determined master launches pods to
execute these tasks. The Determined helm chart makes it possible to set default pod specs for all
CPU and GPU tasks. The defaults can be defined in values.yaml
under
taskContainerDefaults.cpuPodSpec
and taskContainerDefaults.gpuPodSpec
. For examples of how
to do this and a description of permissible fields, see the specifying custom pod specs guide.
Default Password#
Setting an initialUserPassword
for the admin
and determined
user accounts is a required
step and is configured in the Helm Chart. The password for these
users will not affect any other user account. For additional information on managing users in
determined, visit the topic guide on users.
Database (Optional)#
By default, the Helm chart deploys an instance of Postgres on the same Kubernetes cluster where
Determined is deployed. If this is not what you want, you can configure the Helm chart to use an
external Postgres database by setting db.hostAddress
to the IP address of their database. If
db.hostAddress
is configured, the Determined Helm chart will not deploy a database.
TLS (Optional)#
By default, the Helm chart will deploy a load-balancer which makes the Determined master accessible
over HTTP. To secure your cluster, Determined supports configuring TLS encryption which can be configured to terminate
inside a load-balancer or inside the Determined master itself. To configure TLS, set
useNodePortForMaster
to true
. This will instruct Determined to deploy a NodePort service for
the master. You can then configure an Ingress that performs TLS
termination in the load balancer and forwards plain text to the NodePort service, or forwards TLS
encrypted data. Please note when configuring an Ingress that you need to have an Ingress controller running your
cluster.
TLS termination in a load-balancer (e.g., nginx). This option will provide TLS encryption between the client and the load-balancer, with all communication inside the cluster performed via HTTP. To configure this option set
useNodePortForMaster
totrue
and then configure an Ingress service to perform TLS termination and forward the plain text traffic to the Determined master.TLS termination in the Determined master. This option will provide TLS encryption inside the Kubernetes cluster. All communication with the master will be encrypted. Communication between task containers (distributed training) will not be encrypted. To configure this option create a Kubernetes TLS secret within the namespace where Determined is being installed and set
tlsSecret
to be the name of this secret. You also need to setuseNodePortForMaster
totrue
. After the NodePort service is created, you can configure an Ingress to forward TLS encrypted data to the NodePort service.
An example of how to configure an Ingress, which will perform TLS termination in the load-balancer by default:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: determined-ingress
annotations:
kubernetes.io/ingress.class: "nginx"
# Uncommenting this option instructs the created load-balancer
# to forward TLS encrypted data to the NodePort service and
# perform TLS termination in the Determined master. In order
# to configure ssl-passthrough, your nginx ingress controller
# must be running with the --enable-ssl-passthrough option enabled.
#
# nginx.ingress.kubernetes.io/ssl-passthrough: "true"
spec:
tls:
- hosts:
- your-hostname-for-determined.ai
secretName: your-tls-secret-name
rules:
- host: your-hostname-for-determined.ai
http:
paths:
- path: /
backend:
serviceName: determined-master-service-<name for your deployment>
servicePort: masterPort configured in values.yaml
To see information about using AWS Load Balancer instead of nginx visit Using AWS Load Balancer.
Default Scheduler (Optional)#
Determined includes support for the lightweight coscheduling plugin, which
extends the default Kubernetes scheduler to provide gang scheduling. This feature is currently in
beta and is not enabled by default. To activate the plugin, set the defaultScheduler
field to
coscheduler
. If the field is empty or doesn’t exist, Determined will use the default Kubernetes
scheduler to schedule all experiments and tasks.
defaultScheduler: coscheduler
Determined also includes support for priority-based scheduling with preemption. This feature allows
experiments to be preempted if higher priority ones are submitted. This feature is also in beta and
is not enabled by default. To activate priority-based preemption scheduling, set
defaultScheduler
to preemption
.
defaultScheduler: preemption
Node Taints#
Tainting nodes is optional, but you might want to taint nodes to restrict which nodes a pod may be scheduled onto. A taint consists of a taint type, tag, and effect.
When using a managed kubernetes cluster (e.g. a GKE, AKS, or EKS cluster), it is possible to specify taints
at cluster or nodepool creation using the specified CLIs. Please refer to the set up pages for each
managed cluster service for instructions on how to do so. To add taints to an existing resource, it
is necessary to use kubectl
. Tolerations can be added to Pods by including the tolerations
field in the Pod specification.
kubectl
Taints#
To taint a node with kubectl, use kubectl taint nodes
.
kubectl taint nodes ${NODE_NAME} ${TAINT_TYPE}=${TAINT_TAG}:${TAINT_EFFECT}
As an example, the following snippet taints nodes named node-1
to not be schedulable if the
accelerator
taint type has the gpu
taint value.
kubectl taint nodes node-1 accelerator=gpu:NoSchedule
kubectl
Tolerations#
To specify a toleration, use the toleration
field in the PodSpec.
tolerations:
- key: "${TAINT_TYPE}"
operator: "Equal"
value: "${TAINT_TAG}"
effect: "${TAINT_EFFECT}"
The following example is a toleration for when a node has the accelerator
taint type equal to
the gpu
taint value.
tolerations:
- key: "accelerator"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
The next example is a toleration for when a node has the gpu
taint type.
tolerations:
- key: "gpu"
operator: "Exists"
effect: "NoSchedule"
Setting Up Multiple Resource Pools#
To set up multiple resource pools for Determined on your Kubernetes cluster:
Create a namespace for each resource pool. The default namespace can also be mapped to a resource pool.
As Determined ensures that tasks in a given resource pool get launched in its linked namespace, the cluster admin needs to ensure that pods in a given namespace have the right nodeSelector or toleration automatically added to their pod spec so that they can be forced to be scheduled on the nodes that we want to be part of a given resource pool. This can be done using an admissions controller like a PodNodeSelector or PodTolerationRestriction. Alternatively, the cluster admin can also add a resource pool (and hence namespace) specific pod spec to the
task_container_defaults
sub-section of theresourcePools
section of the Helmvalues.yaml
:resourcePools: - pool_name: prod_pool kubernetes_namespace: default task_container_defaults: gpu_pod_spec: apiVersion: v1 kind: Pod spec: tolerations: - key: "pool_taint" operator: "Equal" value: "prod" effect: "NoSchedule" affinity: # Define an example node selector label. nodeSelectorTerms: kubernetes.io/hostname: "foo" # Define an example node affinity. nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - antarctica-west1 - antarctica-east1
Label/taint the appropriate nodes you want to include as part of each resource pool.
#. For instance you may add a taint like
kubectl taint nodes prod_node_name pool_taint=prod:NoSchedule
to add the appropriate toleration to the PodTolerationRestriction admissions controller or inresourcePools.pool_name.task_container_defaults.gpu_pod_spec
as above so it is automatically added to the pod spec based on which namespace (and hence resource pool) a task runs in.#. Adding node selector or node affinity logic to your resource pool will ensure that only nodes that match this logic are selected. You may add a node selector like
kubernetes.io/hostname = foo
, or match your resource pool to any nodes that match thetopology.kubernetes.io/zone
value in the set{antactica-west1, antarctica-east`}
.Add the appropriate resource pool name to namespace mappings in the
resourcePools
section of thevalues.yaml
file in the Helm chart.
Note
To enable north-south access to Determined proxied tasks in external-to-master clusters, set up a gateway as described in the docs Internal Task Gateway
Install Determined#
Once finished making configuration changes in values.yaml
, install Determined using:
helm install <name for your deployment> determined-ai/determined --values values.yaml
It may take a few minutes for all resources to come up. If you encounter issues during installation,
refer to the list of useful kubectl commands. Helm will install
Determined within the default namespace. If you wish to install Determined into a non-default
namespace, add -n <namespace name>
to the command shown above.
Once the installation has completed, instructions will be displayed for discovering the IP address
assigned to the Determined master. The IP address can also be discovered by running kubectl get
services
.
When installing Determined on Kubernetes, I get an ImagePullBackOff
error#
You may be trying to install a non-released version of Determined or a version in a private registry without the right secret. See the documentation on how to configure which version of Determined to install on Kubernetes.
Upgrade Determined#
To upgrade Determined or to change a configuration setting, make the appropriate changes in
values.yaml
, and then run:
helm repo update
helm upgrade <name for your deployment> determined-ai/determined --wait --values values.yaml
Before upgrading Determined, consider pausing all active experiments. Any experiments that are active when the Determined master restarts will resume training after the upgrade, but will be rolled back to their most recent checkpoint.
If using a local downloaded helm chart instead of the helm repo, make sure to update it manually.
Uninstall Determined#
To uninstall Determined run:
# Please note that if the Postgres Database was deployed by Determined, it will
# be deleted by this command, permanently removing all records of your experiments.
helm delete <name for your deployment>
# If there were any active tasks when uninstalling, this command will
# delete all of the leftover Kubernetes resources. It is recommended to
# pause all experiments prior to upgrading or uninstalling Determined.
kubectl get pods --no-headers=true -l=determined | awk '{print $1}' | xargs kubectl delete pod