Install Determined on Kubernetes¶
This document describes how to install Determined on Kubernetes. The installation is performed using the
Determined Helm Chart
. For
general information about using Determined with Kubernetes, refer the
Determined on Kubernetes guide.
Prerequisites¶
Before installing Determined on a Kubernetes cluster, please ensure that the following prerequisites are satisfied:
The Kubernetes cluster should be running Kubernetes >= 1.15, with GPU support enabled.
You should have access to the cluster via kubectl.
Helm 3 should be installed.
You should also download a copy of the Determined Helm Chart
and extract it on your local machine.
If you do not yet have a Kubernetes cluster deployed and you want to use Determined in a public cloud environment, we recommend using a managed Kubernetes offering such as Google Kubernetes Engine (GKE) on GCP or Elastic Kubernetes Service (EKS) on AWS. For more info on configuring GKE for use with Determined, refer to our Instructions for setting up a GKE cluster. For info on configuring EKS, refer to our Instructions for setting up an EKS cluster.
What Gets Installed¶
When the Determined Helm chart is installed, the following entities will be created:
Deployment of the Determined master.
ConfigMap containing configurations for the Determined master.
LoadBalancer service to make the Determined master accessible. Later in this guide, we describe how it is possible to replace this with a NodePort service.
ServiceAcccount which will be used by the Determined master.
Deployment of a Postgres database. Later in this guide, we describe how an external database can be used instead.
PersistentVolumeClaim for the Postgres database. Omitted if using an external database.
Service to allow the Determined master to communicate with the Postgres database. Omitted if using an external database.
Configuration¶
When installing Determined using Helm, you should first configure some
aspects of the Determined deployment by editing the values.yaml
and
Chart.yaml
files in the Helm chart.
Version Configuration¶
To configure which version of Determined will be installed by the Helm
chart, users should modify appVersion
in Chart.yaml
. Users can
specify a release version (e.g., 0.13.0
) or specify any commit hash
from the upstream Determined repo (e.g.,
b13461ed06f2fad339e179af8028d4575db71a81
). Users are strongly
encouraged to use a released version.
Number of GPUs Per Node¶
Users are required to specify the number of GPUs on each node (for
GPU-enabled nodes only). This is done by setting maxSlotsPerPod
in
values.yaml
. Determined uses this information when scheduling
multi-GPU tasks. Each multi-GPU (distributed training) task will be
scheduled as a set of slotsPerTask / maxSlotsPerPod
separate pods,
with each pod assigned up to maxSlotsPerPod
GPUs. Distributed tasks
with sizes that are not divisible by maxSlotsPerPod
are never
scheduled. If you have a cluster of different size nodes set
maxSlotsPerPod
to the smallest common denominator. For example, if
you have nodes with 4 GPUs and other nodes with 8 GPUs, set
maxSlotsPerPod
to 4 so that all distributed experiments will launch
with 4 GPUs per pod (e.g., on nodes with 8 GPUs, two such pods would be
launched).
Checkpoint Storage¶
Checkpoints and TensorBoards events can be configured to be stored in
shared_fs
, AWS S3, or GCS. By default, checkpoints and
TensorBoard events are stored using shared_fs
, which creates a
hostPath Volume and
saves to the host file system. This configuration is intended for
initial testing only; users are strongly discouraged from using
shared_fs
for actual deployments of Determined on Kubernetes,
because most Kubernetes cluster nodes do not have a shared file system.
Instead of using shared_fs
, users should configure either AWS S3 or
GCS:
AWS S3: To configure Determined to use AWS S3 for checkpoint and TensorBoard storage, users need to set
checkpointStorage.type
invalues.yaml
tos3
and setcheckpointStorage.bucket
to the name of the bucket. The pods launched by the Determined master must have read, write, and delete access to the bucket. To enable this users may optionally configurecheckpointStorage.accessKey
andcheckpointStorage.secretKey
. Users may also optionally configurecheckpointStorage.endpointUrl
which specifies the endpoint to use for S3 clones (e.g.,http://<minio-endpoint>:<minio-port|default=9000>
).GCS: To configure Determined to use Google Cloud Storage for checkpoints and TensorBoard data, users need to set
checkpointStorage.type
invalues.yaml
togcs
and setcheckpointStorage.bucket
to the name of the bucket. The pods launched by the Determined master must have read, write, and delete access to the bucket. For example, when launching their GKE nodes users need to specify--scopes=storage-full
to configure proper GCS access.
Default Pod Specs (Optional)¶
As described in the Determined on Kubernetes guide, when tasks
(e.g., experiments, notebooks) are started in a Determined cluster
running on Kubernetes, the Determined master launches pods to execute
these tasks. The Determined helm chart makes it possible to set default
pod specs for all CPU and GPU tasks. The defaults can be defined in
values.yaml
under taskContainerDefaults.cpuPodSpec
and
taskContainerDefaults.gpuPodSpec
. For examples of how to do this and
a description of permissible fields please see the specifying
custom pod specs guide.
Database (Optional)¶
By default, the Helm chart will deploy an instance of Postgres on the
same Kubernetes cluster where Determined itself is deployed. If this is
undesirable, users can configure the Helm chart to use an external
Postgres database by setting db.hostAddress
to the IP address of
their database. If db.hostAddress
is configured, the Determined Helm
chart will not deploy a database.
TLS (Optional)¶
By default, the Helm chart will deploy a load-balancer which makes the
Determined master accessible over HTTP. To secure your cluster,
Determined supports configuring TLS encryption which can be
configured to terminate inside a load-balancer or inside the Determined
master itself. To configure TLS, users should set
useNodePortForMaster
to true
. This will instruct Determined to
deploy a NodePort service for the master. Users can then configure an
Ingress
that performs TLS termination in the load balancer and forward plain
test to the NodePort service, or forwards TLS encrypted data. Please
note when configuring an Ingress that you need to have an Ingress
controller
runing your cluster.
TLS termination in a load-balancer (e.g., nginx). This option will provide TLS encryption between the client and the load-balancer, with all communication inside the cluster performed via http. To configure this option set
useNodePortForMaster
totrue
and then configure an Ingress service to perform TLS termination and forward the plain text traffic to the Determined master.TLS termination in the Determined master. This option will provide TLS encrytption inside the Kubernetes cluster. All communication with the master will encrypted. Communication between task containers (distributed triaining) will not be encrypted. To configure this option create a Kuberentes TLS secret within the namespace where Determined is being installed and set
tlsSecret
to be the name of this secret. Users will also have to setuseNodePortForMaster
totrue
. Once the the NodePort service is created, users can configure an Ingress to forward TLS encrypted data to the NodePord service.
An example of how to configure an Ingress, by default this Ingress will perform TLS termination in the load-balancer:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: determined-ingress
annotations:
kubernetes.io/ingress.class: "nginx"
# Uncommenting this option instucts the created load-balancer
# to forward TLS encrypted data to the NodePort service and
# perform TLS termination in the Determined master. In order
# to configure ssl-passthrough, your nginx ingress controller
# must be running with --enable-ssl-passthrough option enabled.
#
# nginx.ingress.kubernetes.io/ssl-passthrough: "true"
spec:
tls:
- hosts:
- your-hostname-for-determined.ai
secretName: your-tls-secret-name
rules:
- host: your-hostname-for-determined.ai
http:
paths:
- path: /
backend:
serviceName: determined-master-service-<name for your deployment>
servicePort: masterPort configured in values.yaml
Installing Determined¶
Once finished making configuration changes in values.yaml
and
Chart.yaml
, Determined is ready to be installed. To install
Determined run:
helm install <name for your deployment> determined-helm-chart
determined-helm-chart
is a relative path to where the
Determined Helm Chart
is
located. It may take a few minutes for all resources to come up. If you
encounter issues during installation please follow our list of
useful kubectl commands. Helm will
install Determined within the default namespace. If you wish to install
Determined into a non-default namespace, add -n <namespace name>
to
the command shown above.
Once the installation has completed, instructions will be displayed for
discovering the IP address assigned to the Determined master. The IP
address can also be discovered by running kubectl get services
.
Upgrading Determined¶
To upgrade Determined or to change a configuration setting, first make
the appropriate changes in values.yaml
and Chart.yaml
, and then
run:
helm upgrade <name for your deployment> --wait determined-helm-chart
Before upgrading Determined, consider pausing all active experiments. Any experiments that are active when the Determined master restarts will resume training after the upgrade, but will be rolled back to their most recent checkpoint.
Uninstalling Determined¶
To uninstall Determined run:
# Please note that if the Postgres Database was deployed by Determined, it will
# be deleted by this command, permanently removing all records of your experiments.
helm delete <name for your deployment>
# If there were any active tasks when uninstalling, this command will
# delete all of the leftover Kubernetes resources. It is recommended to
# pause all experiments prior to upgrading or uninstalling Determined.
kubectl get pods --no-headers=true -l=determined | awk '{print $1}' | xargs kubectl delete pod