SysAdmin: Deploy on Kubernetes

This document describes how the Determined runs on Kubernetes. For instructions on installing Determined on Kubernetes, please see the installation guide.

In this topic guide, we will cover:

  1. How Determined works on Kubernetes.

  2. Limitations of Determined on Kubernetes.

  3. Useful Helm and Kubectl commands.

How Determined Works on Kubernetes

Installing Determined on Kubernetes deploys an instance of the Determined master and a Postgres database in the Kubernetes cluster. Once the master is up and running, users can submit experiments and launch notebooks, tensorboards, commands, and shells. When new workloads are submitted to the Determined master, the master launches pods and configMaps on the Kubernetes cluster to execute those workloads. Users of Determined shouldn’t need to interact with Kubernetes directly after installation, as Determined handles all the necessary interaction with the Kubernetes cluster.

Limitations of Determined on Kubernetes

This section outlines the current limitations of Determined on Kubernetes.

Scheduling

By default, the Kubernetes scheduler does not support gang scheduling or preemption. This can be problematic for distributed deep learning workloads that require multiple pods to be scheduled before execution starts. Determined includes built-in support for the lightweight coscheduling plugin, which extends the default Kubernetes scheduler to support gang scheduling. Determined also includes support for priority-based preemption scheduling. Neither are enabled by default. For more details and instructions on how to enable the coscheduling plugin, refer to Gang Scheduling on Kubernetes and Priority Scheduling with Preemption on Kubernetes.

Dynamic Agents

Determined is not able to autoscale your cluster, but equivalent functionality is available by using the Kubernetes Cluster Autoscaler, which is supported on GKE and EKS.

Useful Helm and Kubectl Commands

kubectl is a command-line tool for interacting with a Kubernetes cluster. Helm is used to install and upgrade Determined on Kubernetes. This section covers some of the useful kubectl and helm commands when running Determined on Kubernetes.

For all the commands listed below, include -n <kubernetes namespace name> if running Determined in a non-default namespace.

List Installations of Determined

To list the current installation of Determined on the Kubernetes cluster:

# To list in the current namespace.
helm list

# To list in all namespaces.
helm list -A

It is recommended to have just one instance of Determined per Kubernetes cluster.

Get the IP Address of the Determined Master

To get the IP and port address of the Determined master:

# Get all services.
kubectl get services

# Get the master service. The exact name of the master service depends on
# the name given to your helm deployment, which can be looked up by running
# ``helm list``.
kubectl get service determined-master-service-<helm deployment name>

Check the Status of the Determined Master

Logs for the Determined master are available via the CLI and WebUI. Kubectl commands are useful for diagnosing any issues that arise during installation.

# Get all deployments.
kubectl get deployments

# Describe the current state of Determined master deployment. The exact name
# of the master deployment depends on the name given to your helm deploy
# which can be looked up by running `helm list`.
kubectl describe deployment determined-master-deployment-<helm deployment name>

# Get all pods associated with the Determined master deployment. Note this
# will only include pods that are running the Determined master, not pods
# that are running tasks associated with Determined workloads.
kubectl get pods -l=app=determined-master-<helm deployment name>

# Get logs for the pod running the Determined master.
kubectl logs <determined-master-pod-name>

Get All the Running Task Pods

These kubectl commands list and delete pods which are running Determined tasks:

# Get all pods that are running Determined tasks.
kubectl get pods -l=determined

# Delete all Determined task pods. Users should never have to run this,
# unless they are removing a deployment of Determined.
kubectl get pods --no-headers=true -l=determined | awk '{print $1}' | xargs kubectl delete pod

After installing Determined on Kubernetes, I can’t reach the Determined master

Useful steps for debugging this include:

# Get the name of the Helm deployment.
helm list

# Double check the IP address and port assigned to the Determined master by looking up the master service.
kubectl get service determined-master-service-development-<helm deployment name>

# Check the status of master deployment.
kubectl describe deployment determined-master-deployment-<helm deployment name>

# Check the logs of master pod.
kubectl logs <determined-master-pod-name>