Setting up a Google Kubernetes Engine (GKE) Cluster¶
Determined can be installed on a cluster that is hosted on a managed Kubernetes service such as GKE. This document describes how to set up a GKE cluster with GPU-enabled nodes. The recommended setup includes deploying a cluster with a single non-GPU node that will host the Determined master and database, and an autoscaling group of GPU nodes. After creating a suitable GKE cluster, you can then proceed with the standard instructions for installing Determined on Kubernetes.
Determined requires the Kubernetes cluster to be running version >= 1.15 and to have GPU-enabled nodes.
Prerequisites¶
Before setting up a GKE cluster, the user should have Google Cloud SDK and kubectl installed on their local machine.
Setting Up the Cluster¶
# Set a unique name for your cluster.
GKE_CLUSTER_NAME=<any unique name, e.g. "determined-cluster">
# Set a unique name for your node pool.
GKE_GPU_NODE_POOL_NAME=<any unique name, e.g., "determined-node-pool">
# Set a unique name for the GCS bucket that will store your checkpoints.
# When installing Determined, set checkpointStorage.bucket to the value defined here.
GCS_BUCKET_NAME=<any unique name, e.g., "determined-checkpoint-bucket">
# Set the GPU type for your node pool. Other options include p100, p4, and v100.
GPU_TYPE=nvidia-tesla-k80
# Set the number of GPUs per node.
GPUS_PER_NODE=4
# Launch the GKE cluster that will contain a single non-GPU node.
gcloud container clusters create ${GKE_CLUSTER_NAME} \
--region us-west1 \
--node-locations us-west1-b\
--num-nodes=1 \
--image-type=UBUNTU \
--machine-type=n1-standard-16
# Create a node pool. This will not launch any nodes immediately but will
# scale up and down as needed. If you change the GPU type or the number of
# GPUs per node, you may need to change the machine-type.
gcloud container node-pools create ${GKE_GPU_NODE_POOL_NAME} \
--cluster ${GKE_CLUSTER_NAME} \
--accelerator type=${GPU_TYPE},count=${GPUS_PER_NODE} \
--zone us-west1 \
--num-nodes=0 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=4 \
--image-type=UBUNTU \
--machine-type=n1-standard-32 \
--scopes=storage-full
# Deploy a DaemonSet that enables the GPUs.
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
# Create a GCS bucket to store checkpoints.
gsutil mb gs://${GCS_BUCKET_NAME}