Shortcuts

Setting up a Google Kubernetes Engine (GKE) Cluster

Determined can be installed on a cluster that is hosted on a managed Kubernetes service such as GKE. This document describes how to set up a GKE cluster with GPU-enabled nodes. The recommended setup includes deploying a cluster with a single non-GPU node that will host the Determined master and database, and an autoscaling group of GPU nodes. After creating a suitable GKE cluster, you can then proceed with the standard instructions for installing Determined on Kubernetes.

Determined requires the Kubernetes cluster to be running version >= 1.15 and to have GPU-enabled nodes.

Prerequisites

Before setting up a GKE cluster, the user should have Google Cloud SDK and kubectl installed on their local machine.

Setting Up the Cluster

# Set a unique name for your cluster.
GKE_CLUSTER_NAME=<any unique name, e.g. "determined-cluster">

# Set a unique name for your node pool.
GKE_GPU_NODE_POOL_NAME=<any unique name, e.g., "determined-node-pool">

# Set a unique name for the GCS bucket that will store your checkpoints.
# When installing Determined, set checkpointStorage.bucket to the value defined here.
GCS_BUCKET_NAME=<any unique name, e.g., "determined-checkpoint-bucket">

# Set the GPU type for your node pool. Other options include p100, p4, and v100.
GPU_TYPE=nvidia-tesla-k80

# Set the number of GPUs per node.
GPUS_PER_NODE=4

# Launch the GKE cluster that will contain a single non-GPU node.
gcloud container clusters create ${GKE_CLUSTER_NAME} \
    --region us-west1 \
    --node-locations us-west1-b\
    --num-nodes=1 \
    --image-type=UBUNTU \
    --machine-type=n1-standard-16

# Create a node pool. This will not launch any nodes immediately but will
# scale up and down as needed. If you change the GPU type or the number of
# GPUs per node, you may need to change the machine-type.
gcloud container node-pools create ${GKE_GPU_NODE_POOL_NAME} \
  --cluster ${GKE_CLUSTER_NAME} \
  --accelerator type=${GPU_TYPE},count=${GPUS_PER_NODE} \
  --zone us-west1 \
  --num-nodes=0 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=4 \
  --image-type=UBUNTU \
  --machine-type=n1-standard-32 \
  --scopes=storage-full

# Deploy a DaemonSet that enables the GPUs.
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

# Create a GCS bucket to store checkpoints.
gsutil mb gs://${GCS_BUCKET_NAME}