Shortcuts

Determined on GCP

This document describes how Determined runs on Google Cloud Platform (GCP).

Overview

At a high level, Determined uses Google Compute Engine (GCE) instances as the base unit. The cluster is managed by a master node (a single, non-GPU instance), which in turn provisions and deprovisions other agent nodes (GPU instances) depending on the current volume of experiments being run on the cluster. As an example, if only a master node is running, then you are only being charged for the master. When an experiment is started, the master creates GPU instances as agents, and when the experiment is done the master will turn off the agents so you are not charged for them when no experiments are using them. The master also keeps all experiment metadata in a separate database, which can be queried by the user via the Determined WebUI or CLI. All nodes in the cluster communicate with each other internally within the Virtual Private Cloud (VPC) and the user interacts with the master via a designated port configured during installation.

Architecture Diagram

The diagram below outlines the high level architecture of a Determined cluster in GCP.

../../_images/det-arch-gcp.png

Following the diagram, a standard execution would be:

  1. User submits experiment to master

  2. Master creates one or more agents (depending on experiment) if they don’t exist

  3. Agent accesses required data, images, etc.

  4. Agent completes experiment and communicates completion to master

  5. Master shuts down agents that are no longer needed

Architecture Details

There are two types of resources used to run Determined: core resources that enable the Determined platform, and periphery resources that add optional functionality. The section below provides additional detail on these resources, and users can deploy these resources in GCP by following the Install Determined on GCP guide.

Core Resources

  • Master Node: A single Google Compute Engine (GCE) instance is designated as the master. The master’s primary function is to:

    • host the cluster’s WebUI (browser) where users will monitor their experiments

    • respond to commands from the Determined CLI installed by users locally

    • schedule experiments

    • manage other GCE instances (agents) which run experiments

  • Agent Node(s): For most Determined clusters in GCP, the volume of active experiments dictate the number of agents. All agents are managed by the master and users need not interact with the agents directly.

  • Database: Determined uses a CloudSQL (Postgres) database for storing all experiment metadata.

  • Service Account: A service account is used to manage the creation of compute (GCE) resources and access to Google Cloud Storage (GCS) buckets for checkpoints, TensorBoards, and other data storage as needed.

  • Firewall Rules: Firewall rules are set to ensure each node in the cluster can communicate with each other.

Periphery Resources

  • Network/Subnetwork: The Determined cluster can be configured inside an existing VPC or be set to create a new VPC.

  • Static IP: For production clusters, a static IP is recommended for the master; otherwise an ephemeral IP is automatically generated by GCP.

  • Google Filestore: The Determined cluster can leverage an existing GCS Filestore (assuming it has the correct associated permissions), or the Terraform script can create a Filestore instance with the cluster.

  • Google Cloud Storage (GCS) bucket: The Determined cluster can leverage an existing GCS bucket (assuming it has the correct associated permissions), or the Terraform script can create a bucket with the cluster.