Operating PEDL with Kubernetes

This document describes how to install, configure, and upgrade a PEDL deployment that is running on Kubernetes.

Concepts

In a standard ("bare metal") installation of PEDL, each PEDL agent runs workloads by launching containers via the local Docker daemon on each agent machine. Some customers prefer using PEDL in this mode, because it does not require installing or configuring a third-party cluster manager or container orchestration system.

PEDL also supports Kubernetes-based deployments; this can be convenient for customers that already use Kubernetes for container orchestration. In this mode, PEDL is installed as a Helm package.

Currently in Kubernetes-based deployments, containers are completely managed by PEDL and will not appear as Kubernetes pods. Management of PEDL workloads must be done via PEDL itself rather than through Kubernetes.

Prerequisites

  • Kubernetes 1.8+
  • Kubernetes nodes with Nvidia GPU drivers installed and Nvidia docker

Some Kubernetes configurations such as Google Container Engine have lean host configurations that may not contain the features of Nvidia docker that are required by PEDL.

The following pod definition tests if a cluster currently supports PEDL:

apiVersion: v1
kind: Pod
metadata:
  name: test-pedl-support
spec:
  containers:
  - name: docker
    image: docker
    command:
    - docker
    - run
    - nvidia/cuda
    - nvidia-smi
    volumeMounts:
    - name: docker-socket
      mountPath: /var/run/docker.sock
  volumes:
  - name: docker-socket
    hostPath:
      path: /var/run/docker.sock
      type: Socket

Installing the Chart

To install the chart with the release name my-release:

$ helm install --name my-release deploy/kubernetes/pedl-*.tgz

This command deploys the PEDL Master and an accompanying PostgreSQL database on a Kubernetes cluster. The configuration section lists the configuration parameters.

Tip: List all releases using helm list.

PEDL agents pods are created separately from the helm installation by labeling nodes.

To add PEDL agents to the cluster, label the desired nodes with determined.ai/pedl-agent=present. For example,

kubectl label nodes worker-gpu-0 determined.ai/pedl-agent=present

To remove an agent from a node, remove the agent label from the node:

kubectl label nodes <node-name> determined.ai/pedl-agent-

Since PEDL agents presume they have complete control over the resources of node, we advise tainting nodes with PEDL agents to prevent non-PEDL workloads from executing on the same node. We recommend using the following taint, which PEDL agents tolerate by default:

key: determined.ai/pedl-agent
value: present
effect: NoSchedule

Upgrading the Chart

To upgrade the my-release deployment:

helm upgrade my-release deploy/kubernetes/pedl-*.tgz

Uninstalling the Chart

To uninstall/delete the my-release deployment:

$ helm delete my-release

Configuration

The following table lists the configurable parameters of the PEDL chart and their default values.

Parameter Description Default
masterImage PEDL master image repository determinedai/pedl-master
agentImage PEDL agent image repository determinedai/pedl-agent
agent.enableCPUScheduling Enable agents to schedule tasks on CPUs false
agent.env PEDL agent environment variables []
agent.resources PEDL agent resource limits and requests limits: {cpu: 1, memory: 2Gi}, requests: {cpu: 0.1, memory: 256Mi}
agent.tolerations Taint tolerations for the agent [{key: nvidia.com/gpu, value: present, effect: NoSchedule}]
trialRunner.network The docker network mode of the trial runner default
trialRunner.uid The UID to use when running a trial runner container nil
trialRunner.gid The GID to use when running a trial runner container nil
imagePullPolicy Image pull policy IfNotPresent
resources PEDL master resource limits and requests limits: {cpu: 8, memory: 16Gi}, requests: {cpu: 4, memory: 8Gi}
tolerations Taint tolerations for PEDL master []
nodeSelector Node labels for pod assignment {}
registry.server Determined AI registry server https://index.docker.io/v1/
registry.user Determined AI registry username determinedaicustomer
registry.password Determined AI registry password aPGMABpTTW6Aj2LtseRZCnVD9W3kJvtsJNVzrapD
registry.email Determined AI registry email hello@determined.ai
service.type ClusterIP, NodePort, or LoadBalancer ClusterIP
service.port External port available to pods in the cluster 8080
service.externalIPs External IP addresses connected to the service. nil
service.nodePort Exposed node port for service type NodePort nil
service.clusterIP Manual IP address assigned to the service nil
service.loadBalancerIP Manual IP address assigned to the load balancer nil
service.loadBalancerSourceRanges Restrict traffic through the load balancer to the client IP ranges nil
service.annotations Additional annotations to append to the service nil
ingress.enabled Enable ingress controller resource false
ingress.annotations Specify ingress class nil
ingress.hosts PEDL master hostnames nil
ingress.tls TLS certificates associated with an Ingress nil

PostgreSQL Configuration

The following tables lists the configurable parameters of the PostgreSQL dependency and their default values.

N.B.: these configurations should be under the postgresql prefix (e.g., postgresql.resources).

Parameter Description Default
image.registry postgresql image registry docker.io
image.repository postgresql image repository bitnami/postgresql
image.tag postgresql image tag 10.8.0
image.pullPolicy Image pull policy IfNotPresent
image.pullSecrets Image pull secrets nil
resources Postgres resource limits and requests limits: {cpu: 8, memory: 32GB}, requests: {cpu: 4, memory: 16GB}
postgresqlPassword Password for admin user postgres
postgresqlDatabase Name for new database to create pedl
postgresqlInitdbArgs Initdb Arguments nil
schedulerName Name of an alternate scheduler nil
postgresqlExtendedConfig Additional Runtime Config Parameters {maxConnections: 2000, sharedBuffers: 512MB}
persistence.enabled Use a PVC to persist data true
persistence.existingClaim Provide an existing PersistentVolumeClaim nil
persistence.storageClass Storage class of backing PVC nil (uses alpha storage class annotation)
persistence.accessMode Access mode for PostgreSQL volume ReadWriteOnce
persistence.annotations Persistent Volume annotations {}
persistence.size Size of data volume 100Gi
persistence.subPath Subdirectory of the volume to mount at ""
persistence.mountPath Mount path of data volume /bitnami/postgresql
service.externalIPs External IPs to listen on []
service.port TCP port 5432

Persistence

The PEDL chart mounts a Persistent Volume for the PostgreSQL instance. If the PersistentVolumeClaim should not be managed by the chart, define postgresql.persistence.existingClaim.

Existing PersistentVolumeClaims

  1. Create the PersistentVolume
  2. Create the PersistentVolumeClaim
  3. Install the chart
$ helm install --set postgresql.persistence.existingClaim=PVC_NAME pedl-*.tgz