Operating PEDL with Kubernetes

This document describes how to install, configure, and upgrade a PEDL deployment that is running on Kubernetes.

Concepts

In a standard ("bare metal") installation of PEDL, each PEDL agent runs workloads by launching containers via the local Docker daemon on each agent machine. Some customers prefer using PEDL in this mode, because it does not require installing or configuring a third-party cluster manager or container orchestration system.

PEDL also supports Kubernetes-based deployments; this can be convenient for customers that already use Kubernetes for container orchestration. In this mode, PEDL is installed as a Helm package.

Currently in Kubernetes-based deployments, containers are completely managed by PEDL and will not appear as Kubernetes pods. Management of PEDL workloads must be done via PEDL itself rather than through Kubernetes.

Prerequisites

  • Kubernetes 1.8+
  • Kubernetes nodes with Nvidia GPU drivers installed and Nvidia Docker

Some Kubernetes configurations such as Google Container Engine have lean host configurations that may not contain the features of Nvidia Docker that are required by PEDL. A separate Nvidia Docker installation may be required.

The following pod definition can be used to test if a cluster currently supports installing PEDL:

apiVersion: v1
kind: Pod
metadata:
  name: test-pedl-support
spec:
  containers:
  - name: docker
    image: docker
    command:
    - docker
    - run
    - nvidia/cuda
    - nvidia-smi
    volumeMounts:
    - name: docker-socket
      mountPath: /var/run/docker.sock
  volumes:
  - name: docker-socket
    hostPath:
      path: /var/run/docker.sock
      type: Socket

Installing the Chart

To install the chart with the release name my-release:

$ helm install --name my-release deploy/kubernetes/pedl-*.tgz

This command deploys the PEDL Master and an accompanying PostgreSQL database on a Kubernetes cluster. The configuration section lists possible options.

Tip: List all releases using helm list.

PEDL agents pods are created separately from the helm installation. To add PEDL agents to the cluster, label the desired nodes with determined.ai/pedl-agent=present.

For example,

kubectl label nodes worker-gpu-0 determined.ai/pedl-agent=present

To remove an agent from a node, remove the agent label from the node:

kubectl label nodes <node-name> determined.ai/pedl-agent-

PEDL agents presume they have complete control over the resources of a node. We advise tainting nodes running PEDL agents to prevent non-PEDL workloads from executing on the same node. We recommend using the following taint, which PEDL agents tolerate by default:

key: determined.ai/pedl-agent
value: present
effect: NoSchedule

Upgrading the Chart

To upgrade a chart:

helm upgrade my-release deploy/kubernetes/pedl-*.tgz

Uninstalling the Chart

To uninstall or delete a chart:

helm delete my-release

Configuring the Chart

To configure a chart:

helm upgrade my-release --values values.yaml

Persistence

The PEDL chart mounts a Persistent Volume for the PostgreSQL instance.

If the PersistentVolumeClaim should not be managed by the chart, set postgresql.persistence.existingClaim to an existing claim:

  1. Create the PersistentVolume
  2. Create the PersistentVolumeClaim
  3. Install and configure the chart with the existing claim
$ helm install --set postgresql.persistence.existingClaim=PVC_NAME pedl-*.tgz

Setting Resource Limits and Requests

Resource limits can be placed on the PEDL master, its database, and agents by setting the corresponding chart values:

# Resource limits and requests for the PEDL master
resources:
  limits:
    cpu: 8
    memory: 16Gi
  requests:
    cpu: 4
    memory: 8Gi

# Resource limits and requests for PostgreSQL
postgresql:
  resources:
    limits:
      cpu: 8
      memory: 32Gi
    requests:
      cpu: 4
      memory: 16Gi

# Resource limits and requests for PEDL agents
agent:
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 0.1
      memory: 256Mi

Configuring Network Proxies

Network proxying can be configured by setting environment variables of the agent.

For example, the following chart values will configure the PEDL agent to use http://proxy.com for all HTTP requests except requests to foo.external.com and bar.external.org:

agent:
  env:
  - name: PEDL_HTTP_PROXY
    value: http://proxy.com
  - name: PEDL_NO_PROXY
    value: foo.external.com,bar.external.org

The System Administration Guide contains more details.

Accessing the PEDL Master

By default, the PEDL master is only accessible by cluster IP. It can be reached outside a Kubernetes cluster with the kubectl port-forward command.

Alternatively, the master can be reached by a node port, load balancer, or cluster ingress. The latter two options require that the Kubernetes cluster be configured to work with a cloud provider.

A node port is a port that will be bound on each Kubernetes host machine. The master can be configured to be accessible via a node port by setting the chart parameter service.type to NodePort and the parameter service.nodePort to the desired node port.

The master can be configured to be accessible via a cloud provider's load balancer by setting the parameter service.type to LoadBalancer.

The master can be configured to use a cloud ingress by setting the parameter ingress.enabled to true and setting the appropriate values for ingress.annotations, ingress.hosts and ingress.tls. The exact settings depend on the cloud provider.

For more details, see the Kubernetes documentation on Publishing Services and Ingresses.

Configuration Reference

The following table lists all the configurable parameters of the PEDL chart and their default values.

Parameter Description Default
masterImage PEDL master image repository determinedai/pedl-master
masterAddress PEDL master IP address (if not set, use service IP) nil
agentImage PEDL agent image repository determinedai/pedl-agent
agent.enableCPUScheduling Enable agents to schedule tasks on CPUs false
agent.env PEDL agent environment variables []
agent.resources PEDL agent resource limits and requests limits: {cpu: 1, memory: 2Gi}, requests: {cpu: 0.1, memory: 256Mi}
agent.tolerations Taint tolerations for the agent [{key: nvidia.com/gpu, value: present, effect: NoSchedule}]
trialRunner.network The Docker network mode of the trial runner default
trialRunner.uid The UID to use when running a trial runner container nil
trialRunner.gid The GID to use when running a trial runner container nil
imagePullPolicy Image pull policy IfNotPresent
resources PEDL master resource limits and requests limits: {cpu: 8, memory: 16Gi}, requests: {cpu: 4, memory: 8Gi}
tolerations Taint tolerations for PEDL master []
nodeSelector Node labels for pod assignment {}
registry.server Determined AI registry server https://index.docker.io/v1/
registry.user Determined AI registry username determinedaicustomer
registry.password Determined AI registry password aPGMABpTTW6Aj2LtseRZCnVD9W3kJvtsJNVzrapD
registry.email Determined AI registry email hello@determined.ai
service.type ClusterIP, NodePort, or LoadBalancer ClusterIP
service.port Port available to pods in the cluster 8080
service.externalIPs External IP addresses connected to the service. nil
service.nodePort Exposed node port for service type NodePort nil
service.clusterIP Manual IP address assigned to the service nil
service.loadBalancerIP Manual IP address assigned to the load balancer nil
service.loadBalancerSourceRanges Restrict traffic through the load balancer to the client IP ranges nil
service.annotations Additional annotations to append to the service nil
ingress.enabled Enable ingress controller resource false
ingress.annotations Specify ingress class nil
ingress.hosts PEDL master hostnames nil
ingress.tls TLS certificates associated with an Ingress nil

PostgreSQL Configuration

The following tables lists the configurable parameters of the PostgreSQL dependency and their default values.

N.B.: these configurations should be under the postgresql prefix (e.g., postgresql.resources).

Parameter Description Default
image.registry postgresql image registry docker.io
image.repository postgresql image repository bitnami/postgresql
image.tag postgresql image tag 10.8.0
image.pullPolicy Image pull policy IfNotPresent
image.pullSecrets Image pull secrets nil
resources Postgres resource limits and requests limits: {cpu: 8, memory: 32Gi}, requests: {cpu: 4, memory: 16Gi}
postgresqlPassword Password for admin user postgres
postgresqlDatabase Name for new database to create pedl
postgresqlInitdbArgs Initdb Arguments nil
schedulerName Name of an alternate scheduler nil
postgresqlExtendedConfig Additional Runtime Config Parameters {maxConnections: 2000, sharedBuffers: 512MB}
persistence.enabled Use a PVC to persist data true
persistence.existingClaim Provide an existing PersistentVolumeClaim nil
persistence.storageClass Storage class of backing PVC nil (uses alpha storage class annotation)
persistence.accessMode Access mode for PostgreSQL volume ReadWriteOnce
persistence.annotations Persistent Volume annotations {}
persistence.size Size of data volume 100Gi
persistence.subPath Subdirectory of the volume to mount at ""
persistence.mountPath Mount path of data volume /bitnami/postgresql
service.externalIPs External IPs to listen on []
service.port TCP port 5432