Set up and Manage an Azure Kubernetes Service (AKS) Cluster#
Determined can be installed on a cluster that is hosted on a managed Kubernetes service such as AKS. This document describes how to set up an AKS cluster with GPU-enabled nodes. The recommended setup includes deploying a cluster with a single non-GPU node that will host the Determined master and database, and an autoscaling group of GPU nodes. After creating a suitable AKS cluster, you can then proceed with the standard instructions for installing Determined on Kubernetes.
Determined requires GPU-enabled nodes and the Kubernetes cluster to be running version >= 1.19 and <= 1.21, though later versions may work.
Prerequisites#
To deploy an AKS cluster, the user must have a resource group to manage the resources consumed by the cluster. To create one, follow the instructions found in the Azure Resource Groups Documentation.
Additionally, users must have the Azure CLI and kubectl installed on their local machine.
Finally, authenticate with the Azure CLI using az login
in order to have access to your Azure
subscription.
Set up the Cluster#
# Specify the Azure Resource Group you will be using to deploy the cluster.
AKS_RESOURCE_GROUP=<resource group name, e.g. "determined-resource-group">
# Set a unique name for your cluster.
AKS_CLUSTER_NAME=<any unique name, e.g. "determined-cluster">
# Set a unique name for your node pool. Azure requires node pool names to consist
# solely of alphanumeric characters, start with a lowercase letter, and
# be no longer than 12 characters.
AKS_GPU_NODE_POOL_NAME=<any unique, conforming, name, e.g. "determined-node-pool">
# Set the GPU VM Size for your node pool. This VM size corresponds to a machine with 4 Tesla K80 GPUs.
GPU_VM_SIZE=Standard_NC24
# Launch the AKS cluster that will contain a single non-GPU node.
az aks create --resource-group ${AKS_RESOURCE_GROUP} --name ${AKS_CLUSTER_NAME} \
--node-count 1 --generate-ssh-keys --vm-set-type VirtualMachineScaleSets \
--load-balancer-sku standard --node-vm-size Standard_D8_v3
# Create a node pool. This will not launch any nodes immediately but will
# scale up and down as needed. If you change the GPU type or the number of
# GPUs per node, you may need to change the machine-type.
az aks nodepool add --resource-group ${AKS_RESOURCE_GROUP} --cluster-name ${AKS_CLUSTER_NAME} \
--name ${AKS_GPU_NODE_POOL_NAME} --node-count 0 --node-vm-size ${GPU_VM_SIZE} \
--enable-cluster-autoscaler --min-count 0 --max-count 4
Create a kubeconfig for AKS#
After creating the cluster, kubectl
should be used to deploy apps. In order for kubectl
to
be used with AKS, users need to create or update the cluster kubeconfig. This can be done with the
command:
az aks get-credentials --resource-group ${AKS_RESOURCE_GROUP} --name ${AKS_CLUSTER_NAME}
Enable GPU Support#
To allow the AKS cluster to recognize GPU hardware resources, refer to the instructions provided by Azure on the Install NVIDIA Device Plugin tutorial.
With this, the cluster is fully set up, and Determined can be deployed onto it.
Manage an AKS Cluster#
Update the Autoscaler#
To update the cluster autoscaler, use the following Azure CLI command:
az aks nodepool update --update-cluster-autoscaler --min-count <new_min_count> \
--max-count <new_max_count> --resource-group ${AKS_RESOURCE_GROUP} \
--cluster-name ${AKS_CLUSTER_NAME} --name ${AKS_GPU_NODE_POOL_NAME}
Add Taints and Tolerations to Nodes#
For general instructions on adding taints and tolerations to nodes, see the Taints and
Tolerations section in our Guide to Kubernetes. There, you can find an explanation of taints and tolerations, as well as
instructions for using kubectl
to add them to existing clusters.
It is important to note that if you use the Azure CLI to create nodes with taints, you must also add
tolerations using kubectl
; otherwise, Kubernetes will be unable to schedule pods on the tainted
node.
To create a nodepool with a taint in AKS, use the --node-taints
flag to specify the type, tag,
and effect:
az aks nodepool add \
--resource-group ${AKS_RESOURCE_GROUP} \
--cluster-name ${AKS_CLUSTER_NAME} \
--name ${AKS_NODE_POOL_NAME} \
--node-count 1 \
--node-taints ${TAINT_TYPE}=${TAINT_TAG}:{TAINT_EFFECT} \
--no-wait
The following CLI command is an example of using the az
CLI to make a node that is unschedulable
unless a Pod has a toleration for a taint with type sku
equal to gpu
with the NoSchedule
effect.
az aks nodepool add \
--resource-group ${AKS_RESOURCE_GROUP} \
--cluster-name ${AKS_CLUSTER_NAME} \
--name ${AKS_NODE_POOL_NAME} \
--node-count 1 \
--node-taints sku=gpu:NoSchedule \
--no-wait
Delete the Cluster#
To delete the AKS cluster, use the following Azure CLI command:
az aks delete --resource-group ${AKS_RESOURCE_GROUP} --name ${AKS_CLUSTER_NAME}