Set up and Manage an AWS Kubernetes (EKS) Cluster

Determined can be installed on a cluster that is hosted on a managed Kubernetes service such as Amazon EKS. This document describes how to set up an EKS cluster with GPU-enabled nodes. The recommended setup includes deploying a cluster with a single non-GPU node that will host the Determined master and database, and an autoscaling group of GPU nodes. After creating a suitable EKS cluster, you can then proceed with the standard instructions for installing Determined on Kubernetes.

Determined requires GPU-enabled nodes and the Kubernetes cluster to be running version >= 1.19 and <= 1.21, though later versions may work.

Prerequisites

Before setting up an EKS cluster, the user should have the latest versions of AWS CLI, kubectl, and eksctl installed on their local machine.

Additionally, make sure to be subscribed to the EKS-optimized AMI with GPU support. Continuing without subscribing will cause node creation to fail.

Create an S3 Bucket

One resource that eksctl does not automatically create is an S3 bucket, which is necessary for Determined to store checkpoints. To quickly create an S3 bucket, use the command:

aws s3 mb s3://<bucket-name>

The bucket name needs to be specified in both the eksctl cluster config as well as the Determined Helm chart.

Create the Cluster

The quickest and easiest way to deploy an EKS cluster is with eksctl. eksctl supports cluster creation with either command line arguments or a cluster config file. Below is a template config that deploys a managed node group for Determined’s master instance, as well as an autoscaling GPU node group for workers. To fill in the template, insert the cluster name and S3 bucket name.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: <cluster-name> # Specify your cluster name here
  region: us-west-2 # The default region is us-west-2
  version: "1.19" # 1.20 and 1.21 are also supported

# Cluster availability zones must be explicitly named in order for single availability zone node groups to work.
availabilityZones:
  - "us-west-2b"
  - "us-west-2c"
  - "us-west-2d"

iam:
  withOIDC: true # Enables IAM IODC provider
  serviceAccounts:
  - metadata:
      name: checkpoint-storage-s3-bucket
      # If no namespace is set, "default" will be used.
      # Namespace will be created if it does not already exist.
      namespace: default
      labels:
        aws-usage: "determined-checkpoint-storage"
    attachPolicy: # Inline policy can be defined along with `attachPolicyARNs`
      Version: "2012-10-17"
      Statement:
      - Effect: Allow
        Action:
        - "s3:ListBucket"
        Resource: 'arn:aws:s3:::<bucket-name>' # Name of the previously created bucket
      - Effect: Allow
        Action:
        - "s3:GetObject"
        - "s3:PutObject"
        - "s3:DeleteObject"
        Resource: 'arn:aws:s3:::<bucket-name>/*'
  - metadata:
      name: cluster-autoscaler
      namespace: kube-system
      labels:
        aws-usage: "determined-cluster-autoscaler"
    attachPolicy:
      Version: "2012-10-17"
      Statement:
      - Effect: Allow
        Action:
        - "autoscaling:DescribeAutoScalingGroups"
        - "autoscaling:DescribeAutoScalingInstances"
        - "autoscaling:DescribeLaunchConfigurations"
        - "autoscaling:DescribeTags"
        - "autoscaling:SetDesiredCapacity"
        - "autoscaling:TerminateInstanceInAutoScalingGroup"
        - "ec2:DescribeLaunchTemplateVersions"
        Resource: '*'

managedNodeGroups:
  - name: managed-m5-2xlarge
    instanceType: m5.2xlarge
    availabilityZones:
      - us-west-2b
      - us-west-2c
      - us-west-2d
    minSize: 1
    maxSize: 2
    volumeSize: 200
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
    ssh:
      allow: true # will use ~/.ssh/id_rsa.pub as the default ssh key
    labels:
      nodegroup-type: m5.2xlarge
      nodegroup-role: cpu-worker
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/user-eks: "owned"
      k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: m5.2xlarge
      k8s.io/cluster-autoscaler/node-template/label/nodegroup-role: cpu-worker

nodeGroups:
  - name: g4dn-metal-us-west-2b
    instanceType: g4dn.metal # 8 GPUs per machine
    # Restrict to a single AZ to optimize data transfer between instances
    availabilityZones:
      - us-west-2b
    minSize: 0
    maxSize: 2
    volumeSize: 200
    volumeType: gp2
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
    ssh:
      allow: true # This will use ~/.ssh/id_rsa.pub as the default ssh key.
    labels:
      nodegroup-type: g4dn.metal-us-west-2b
      nodegroup-role: gpu-worker
      # https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#special-note-on-gpu-instances
      k8s.amazonaws.com/accelerator: nvidia-tesla-t4
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/user-eks: "owned"
      k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: g4dn.metal-us-west-2b
      k8s.io/cluster-autoscaler/node-template/label/nodegroup-role: gpu-worker

The cluster specified above allows users to run experiments on an untainted g4dn.metal instances with minor additions to their experiment configs. To create a cluster with tainted instances, see the Tainting Nodes section below.

To launch the cluster with eksctl, run:

eksctl create cluster --config-file <cluster config yaml>

Note

For an experiment to run, its config must be modified to specify a service account for S3 access . An example of this is provided in the Configuring Per-Task Pod Specs section of the Customize a Pod guide.

Create a kubeconfig for EKS

After creating the cluster, kubectl should be used to deploy apps. In order for kubectl to be used with EKS, users need to create or update the cluster kubeconfig. This can be done with the command:

aws eks --region <region-code> update-kubeconfig --name <cluster_name>

Enable GPU support

To use GPU instances, the NVIDIA Kubernetes device plugin needs to be installed. Use the following command to install the plugin:

# Deploy a DaemonSet that enables the GPUs.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

Enable Autoscaler

Lastly, EKS requires manual deployment of an autoscaler. Save the following configuration in a new file such as determined-autoscaler.yaml:

You will need to update the <cluster-autoscaler-image> to match the major and minor numbers of your Kubernetes version. For example, if you are using Kubernetes 1.20, use the cluster-autoscaler version 1.20 image found here: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0

For a full list of cluster-autoscaler releases see here: https://github.com/kubernetes/autoscaler/releases

After finding the particular release you want, click on the release and scroll to the bottom to see a list of image URLs. Example: https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.20.0

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
    spec:
      serviceAccountName: cluster-autoscaler
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: "Equal"
          value: "true"
          effect: NoSchedule
      containers:
        - image: <cluster-autoscaler-image>  # See, https://github.com/kubernetes/autoscaler/releases
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --scale-down-delay-after-add=5m
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<cluster-name>
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-bundle.crt"

To deploy an autoscaler that works with Determined, apply the official autoscaler configuration first, then apply the custom determined-autoscaler.yaml.

# Apply the official autoscaler configuration
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-control-plane.yaml

# Apply the custom deployment
kubectl apply -f <cluster-autoscaler yaml, e.g. `determined-autoscaler.yaml`>

Change the Experiment Configuration

To run an experiment with EKS, two additions must be made to the experiment config. A service account must be specified in order to allow Determined to save checkpoints to S3 and tolerances, if there are tainted nodes, must be listed for the experiment to be scheduled. An example of the necessary changes is shown here:

environment:
  pod_spec:
    ...
    spec:
      ...
      serviceAccountName: checkpoint-storage-s3-bucket
      # Tolerations should only be included if nodes are tainted
      tolerations:
        - key: <tainted-group-key, e.g g4dn.metal-us-west-2b>
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"

Details about pod configuration can be found in Per-task Pod Specs.

Make Changes to Determined

Following the deployment of EKS, make sure that the necessary changes to Determined have been applied in order to successfully run experiments. These changes include adding the created S3 bucket to Determined’s Helm chart and specifying a service account in the default pod specs. When modifying the Helm chart to include S3, no keys or endpoint urls are needed. Additionally, if running on tainted nodes, be sure to add pod tolerations to the experiment spec to ensure they will get scheduled.

Use an AWS Load Balancer (optional)

It is possible to use ALB with the Determined EKS cluster instead of nginx. Determined expects the health check to be on /det/, so the config of alb.ingress.kubernetes.io/healthcheck-path must be set to /det/ in the master ingress yaml. An example of a master ingress yaml is shown here:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/inbound-cidrs: 0.0.0.0/0
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]'
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/healthcheck-path: "/det/"
    kubernetes.io/ingress.class: alb
  name: determined-master-ingress
spec:
  rules:
   - host: yourhost.com
     http:
      paths:
      - backend:
          serviceName: determined-master-service-determined
          servicePort: 8080
        path: /*
        pathType: ImplementationSpecific

In order for this ingress to work as expected the Helm parameter of useNodePortForMaster must be set to true and the AWS Load Balancer Controller must be installed in the cluster.

Manage an EKS Cluster

For general instructions on adding taints and tolerations to nodes, see the Taints and Tolerations section in our Guide to Kubernetes. There, you can find an explanation of taints and tolerations, as well as instructions for using kubectl to add them to existing clusters.

It is important to note that if you use EKS to create nodes with taints, you must also add tolerations using kubectl; otherwise, Kubernetes will be unable to schedule pods on the tainted node.

To taint nodes, users will need to add a taint type and a tag to the node group specified in the cluster config from Create the Cluster. An example of the modifications is shown for a g4dn.metal node group:

- name: g4dn-metal-us-west-2b
  ...
  taints:
    g4dn.metal-us-west-2b: "true:NoSchedule"
  ...
  tags:
    ...
    k8s.io/cluster-autoscaler/node-template/taint/g4dn.metal-us-west-2b: "true:NoSchedule"

Furthermore, tainting requires changes to be made to the GPU enabling DaemonSet and more additions to the experiment config. First, to change the DaemonSet, save a copy of the official version and make the following additions to its tolerations:

spec:
  tolerations:
  ...
  - key: g4dn.metal-us-west-2b
    operator: Exists
    effect: NoSchedule

To modify the experiment config to run on tainted nodes, refer to the Change the Experiment Configuration section.