Determined on Kubernetes¶

This guide describes how the Determined deep learning platform works on Kubernetes. For instructions on installing Determined on Kubernetes, please see the installation guide.

In this topic guide, we will cover:

How Determined works on Kubernetes.
Limitations of Determined on Kubernetes.
Useful Helm and Kubectl commands.

How Determined Works on Kubernetes¶

Installing Determined on Kubernetes deploys an instance of the Determined master and a Postgres database in the Kubernetes cluster. Once the master is up and running, users can submit experiments and launch notebooks, tensorboards, commands, and shells. When new workloads are submitted to the Determined master, the master launches pods and configMaps on the Kubernetes cluster to execute those workloads. Users of Determined shouldn’t need to interact with Kubernetes directly after installation, as Determined handles all the necessary interaction with the Kubernetes cluster.

Limitations of Determined on Kubernetes¶

Deploying Determined on Kubernetes is currently in beta and under active development. This section outlines the current limitations of Determined on Kubernetes. Many of these limitations will be addressed in future releases of Determined.

Scheduling¶

Determined on Kubernetes does not currently support the scheduling policies that are available when deploying Determined on VMs. These policies include: priority scheduling, fair sharing resources across experiments, and gang-scheduling for distributed training. Determined relies on Kubernetes to handle scheduling, which does not natively support these scheduling policies.

Distributed training experiments that use multiple pods require all pods to be scheduled and running in order to make progress. Due to the lack of gang-scheduling in Kubernetes, when running distributed training experiments it is possible to deadlock the Kubernetes cluster such that none of the experiments will make any progress. For example, if you have a cluster with three 4-GPU nodes, scheduling an experiment that requires four such nodes will deadlock the cluster. Three pods will start up on the available nodes and occupy all of their GPUs while waiting for the fourth pod to launch before training can start. Because the fourth pod will never start (due to insufficient resources), the job will never make progress. Similarly, if you launch two experiments simultaneously that both attempt to use 12 GPUs on a cluster with only 12 GPUs, it is likely that Kubernetes will assign some of the GPUs to one experiment and some GPUs to the other. Because neither experiment will receive the resources it needs to begin executing, the system will wait indefinitely.

To avoid deadlocking your cluster, we recommend enabling the cluster autoscaler if possible. If a potential deadlock is detected, a warning will be displayed in the trial logs. Upon encountering a deadlock, users should pause, cancel, or kill one or more of the deadlocked experiments.

Dynamic Agents¶

Determined on AWS and GCP autoscales compute resources. Determined on Kubernetes does not yet offer this functionality. Users are encouraged to use the Kubernetes Cluster Autoscaler which is supported on GKE and EKS.

Useful Helm and Kubectl Commands¶

kubectl is a command-line tool for interacting with a Kubernetes cluster. Helm is used to install and upgrade Determined on Kubernetes. This section covers some of the useful kubectl and helm commands when running Determined on Kubernetes.

For all the commands listed below, include -n <kubernetes namespace name> if running Determined in a non-default namespace.

List Installations of Determined¶

To list the current installation of Determined on the Kubernetes cluster:

# To list in the current namespace.
helm list

# To list in all namespaces.
helm list -A

It is recommended to have just one instance of Determined per Kubernetes cluster.

Get the IP Address of the Determined Master¶

To get the IP and port address of the Determined master:

# Get all services.
kubectl get services

# Get the master service. The exact name of the master service depends on
# the name given to your helm deployment, which can be looked up by running
# ``helm list``.
kubectl get service determined-master-service-<helm deployment name>

Check the Status of the Determined Master¶

Logs for the Determined master are available via the CLI and WebUI. Kubectl commands are useful for diagnosing any issues that arise during installation.

# Get all deployments.
kubectl get deployments

# Describe the current state of Determined master deployment. The exact name
# of the master deployment depends on the name given to your helm deploy
# which can be looked up by running `helm list`.
kubectl describe deployment determined-master-deployment-<helm deployment name>

# Get all pods associated with the Determined master deployment. Note this
# will only include pods that are running the Determined master, not pods
# that are running tasks associated with Determined workloads.
kubectl get pods -l=app=determined-master-<helm deployment name>

# Get logs for the pod running the Determined master.
kubectl logs <determined-master-pod-name>

Get All the Running Task Pods¶

These kubectl commands list and delete pods which are running Determined tasks:

# Get all pods that are running Determined tasks.
kubectl get pods -l=determined

# Delete all Determined task pods. Users should never have to run this,
# unless they are removing a deployment of Determined.
kubectl get pods --no-headers=true -l=determined | awk '{print $1}' | xargs kubectl delete pod