Setting up an AWS Kubernetes (EKS) Cluster¶
Determined can be installed on a cluster that is hosted on a managed Kubernetes service such as Amazon EKS. This document describes how to set up an EKS cluster with GPU-enabled nodes. The recommended setup includes deploying a cluster with a single non-GPU node that will host the Determined master and database, and an autoscaling group of GPU nodes. After creating a suitable EKS cluster, you can then proceed with the standard instructions for installing Determined on Kubernetes.
Determined requires the Kubernetes cluster to be running version >= 1.15 and to have GPU-enabled nodes.
Prerequisites¶
Before setting up an EKS cluster, the user should have the latest versions of AWS CLI, kubectl, and eksctl installed on their local machine.
Additionally, make sure to be subscribed to the EKS-optimized AMI with GPU support. Continuing without subscribing will cause node creation to fail.
Making an S3 Bucket¶
One resource that eksctl
does not automatically create is an S3
bucket, which is necessary for Determined to store checkpoints. To
quickly create an S3 bucket, use the command:
aws s3 mb s3://<bucket-name>
The bucket name needs to be specified in both the eksctl
cluster
config as well as the Determined Helm chart.
Creating the Cluster¶
The quickest and easiest way to deploy an EKS cluster is with
eksctl
. eksctl
supports cluster creation with either command
line arguments or a cluster config file. Below is a template config that
deploys a managed node group for Determined’s master instance, as well
as an autoscaling GPU node group for workers. To fill in the template,
insert the cluster name and S3 bucket name.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: <cluster-name> # Specify your cluster name here
region: us-west-2 # The default region is us-west-2
version: "1.15" # 1.16 and 1.17 are also supported
# Cluster availability zones must be explicitly named in order for single availability zone node groups to work.
availabilityZones:
- "us-west-2b"
- "us-west-2c"
- "us-west-2d"
iam:
withOIDC: true # Enables IAM IODC provider
serviceAccounts:
- metadata:
name: checkpoint-storage-s3-bucket
# If no namespace is set, "default" will be used.
# Namespace will be created if it does not already exist.
namespace: default
labels:
aws-usage: "determined-checkpoint-storage"
attachPolicy: # Inline policy can be defined along with `attachPolicyARNs`
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "s3:ListBucket"
Resource: 'arn:aws:s3:::<bucket-name>' # Name of the previously created bucket
- Effect: Allow
Action:
- "s3:GetObject"
- "s3:PutObject"
- "s3:DeleteObject"
Resource: 'arn:aws:s3:::<bucket-name>/*'
- metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
aws-usage: "determined-cluster-autoscaler"
attachPolicy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "autoscaling:DescribeAutoScalingGroups"
- "autoscaling:DescribeAutoScalingInstances"
- "autoscaling:DescribeLaunchConfigurations"
- "autoscaling:DescribeTags"
- "autoscaling:SetDesiredCapacity"
- "autoscaling:TerminateInstanceInAutoScalingGroup"
Resource: '*'
managedNodeGroups:
- name: managed-m5-2xlarge
instanceType: m5.2xlarge
availabilityZones:
- us-west-2b
- us-west-2c
- us-west-2d
minSize: 1
maxSize: 2
volumeSize: 200
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
ssh:
allow: true # will use ~/.ssh/id_rsa.pub as the default ssh key
labels:
nodegroup-type: m5.2xlarge
nodegroup-role: cpu-worker
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/user-eks: "owned"
k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: m5.2xlarge
k8s.io/cluster-autoscaler/node-template/label/nodegroup-role: cpu-worker
nodeGroups:
- name: p2-8xlarge-us-west-2b
instanceType: p2.8xlarge # 8 GPUs per machine
# Restrict to a single AZ to optimize data transfer between instances
availabilityZones:
- us-west-2b
minSize: 0
maxSize: 2
volumeSize: 200
volumeType: gp2
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
ssh:
allow: true # This will use ~/.ssh/id_rsa.pub as the default ssh key.
labels:
nodegroup-type: p2.8xlarge-us-west-2b
nodegroup-role: gpu-worker
# https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#special-note-on-gpu-instances
k8s.amazonaws.com/accelerator: nvidia-tesla-k80
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/user-eks: "owned"
k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: p2.8xlarge-us-west-2b
k8s.io/cluster-autoscaler/node-template/label/nodegroup-role: gpu-worker
The cluster specified above allows users to run experiments on an untainted p2.8xlarge instances with minor additions to their experiment configs. To create a cluster with tainted instances, see the Tainting Nodes section below.
To launch the cluster with eksctl
, run:
eksctl create cluster --config-file <cluster config yaml>
Note
For an experiment to run, its config must be modified to specify a service account for S3 access . An example of this is provided in the Configuring Per-Task Pod Specs section of the Specifying Custom Pod Specs guide.
Creating a kubeconfig for EKS¶
After creating the cluster, kubectl
should be used to deploy apps.
In order for kubectl
to be used with EKS, users need to create or
update the cluster kubeconfig. This can be done with the command:
aws eks --region <region-code> update-kubeconfig --name <cluster_name>
Enabling GPU support¶
To use GPU instances, the NVIDIA Kubernetes device plugin needs to be installed. Use the following command to install the plugin:
# Deploy a DaemonSet that enables the GPUs.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
Enabling Autoscaler¶
Lastly, EKS requires manual deployment of an autoscaler. Save the
following configuration in a new file such as
determined-autoscaler.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8085'
spec:
serviceAccountName: cluster-autoscaler
tolerations:
- key: node-role.kubernetes.io/master
operator: "Equal"
value: "true"
effect: NoSchedule
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.17.3
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --scale-down-delay-after-add=5m
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<cluster-name>
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
imagePullPolicy: "Always"
volumes:
- name: ssl-certs
hostPath:
path: "/etc/ssl/certs/ca-bundle.crt"
To deploy an autoscaler that works with Determined, apply the official
autoscaler configuration
first, then apply the custom determined-autoscaler.yaml
.
# Apply the official autoscaler configuration
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml
# Apply the custom deployment
kubectl apply -f <cluster-autoscaler yaml, e.g. `determined-autoscaler.yaml`>
Tainting Nodes¶
Tainting nodes is optional, but users may want to taint nodes to restrict which nodes a pod may be scheduled onto. To taint nodes, users will need to add a taint type and a tag to the node group specified in the cluster config from Creating the Cluster. An example of the modifications is shown for the p2.8xlarge node group from above:
- name: p2-8xlarge-us-west-2b
...
taints:
p2.8xlarge-us-west-2b: "true:NoSchedule"
...
tags:
...
k8s.io/cluster-autoscaler/node-template/taint/p2.8xlarge-us-west-2b: "true:NoSchedule"
Furthermore, tainting requires changes to be made to the GPU enabling DaemonSet and more additions to the experiment config. First, to change the DaemonSet, save a copy of the official version and make the following additions to its tolerations:
spec:
tolerations:
...
- key: p2.8xlarge-us-west-2b
operator: Exists
effect: NoSchedule
To modify the experiment config to run on tainted nodes, refer to the Changes to Experiment Config section below.
Changes to Experiment Config¶
To run an experiment with EKS, two additions must be made to the experiment config. A service account must be specified in order to allow Determined to save checkpoints to S3 and tolerances, if there are tainted nodes, must be listed for the experiment to be scheduled. An example of the necessary changes is shown here:
environment:
pod_spec:
...
spec:
...
serviceAccountName: checkpoint-storage-s3-bucket
# Tolerations should only be included if nodes are tainted
tolerations:
- key: <tainted-group-key, e.g p2.8xlarge-us-west-2b>
operator: "Equal"
value: "true"
effect: "NoSchedule"
Details about pod configuration can be found in Configuring Per-Task Pod Specs.
Changes to Determined¶
Following the deployment of EKS, make sure that the necessary changes to Determined have been applied in order to successfully run experiments. These changes include adding the created S3 bucket to Determined’s Helm chart and specifying a service account in the default pod specs. When modifying the Helm chart to include S3, no keys or endpoint urls are needed. Additionally, if running on tainted nodes, be sure to add pod tolerations to the experiment spec to ensure they will get scheduled.