Shortcuts

Install Determined on GCP

This document describes how to deploy a Determined cluster on Google Cloud Platform (GCP). We provide the determined-deploy package for easy creation and deployment of these resources in GCP. The determined-deploy package uses Terraform to automatically deploy and configure a Determined cluster in GCP. Alternatively, if you already have a process for setting up infrastructure with Terraform, you can use our Terraform modules rather than determined-deploy.

For more information about using Determined on GCP, see the Determined on GCP topic guide.

Requirements

Project

To get started on GCP, you will need to create a project.

The following GCP APIs must be enabled on your GCP project:

Credentials

The determined-deploy package requires credentials in order to create resources in GCP. There are two ways to provide these credentials:

  • Use gcloud to authenticate your user account:

    gcloud auth application-default login
    

    This command will open a login page in your browser where you can sign in to the Google account that has access to your project. Ensure your user account has Owner access to the project you want to deploy your cluster in.

  • Use service account credentials.

Install

  1. Install Terraform.

  2. Install determined-deploy using pip:

    pip install determined-deploy
    

Deploying A Cluster

We recommend creating a new directory and running the commands below inside that directory.

Note

The deployment process will create a state file in the directory where it is run. The state file keeps track of deployed resources and their state and is used to update or delete the cluster in the future. Any future update or deletion commands should be run inside the same directory so determined-deploy can read the state file. If the state file is deleted, it will be difficult to manage the deployment afterward. Storing the state file in a safe location is strongly recommended.

To deploy the cluster, run:

det-deploy gcp up --cluster-id CLUSTER_ID --project-id PROJECT_ID

CLUSTER_ID is an arbitrary unique ID for the new cluster. We recommend choosing a cluster ID that is memorable and helps identify what the cluster is being used for.

The deployment process may take 5-10 minutes. When it completes, summary information about the newly deployed cluster will be printed, including the URL of the Determined master.

Required Arguments:

Argument

Description

Default Value

--cluster-id

A string appended to resources to uniquely identify the cluster.

required

--project-id

The project to deploy the cluster in.

required

Optional Arguments:

Argument

Description

Default Value

--keypath

The path to the service account JSON key file if using a service account. Including this flag will supersede default Google Cloud user credentials.

none

--preemptible

Whether to use preemptible agent instances.

False

--gpu-type

The type of GPU to use for the agent instances. Ensure gpu_type is available in your selected region and zone by referring to the GPUs on Compute Engine page.

nvidia-tesla-k80

--gpu-num

The number of GPUs on each agent instance. Between 1 and 8 (more GPUs require a more powerful agent-instance-type). Refer to the GPUs on Compute Engine page for specific GCP requirements.

8

--max-dynamic-agents

Maximum number of dynamic agent instances at one time.

5

--static-agents

Number of static agent instances.

0

--max-idle-agent-period

The length of time to wait before idle dynamic agents will be automatically terminated.

10m

--network

The network to create (ensure there isn’t a network with the same name already in the project, otherwise the deployment will fail).

det-default-cluster-id

--region

The region to deploy the cluster in.

us-west1

--zone

The zone to deploy the cluster in.

region-b

--master-instance-type

Instance type to use for the master instance.

n1-standard-2

--agent-instance-type

Instance type to use for the agent instances.

n1-standard-32

--min-cpu-platform-master

Minimum CPU platform for the master instance.

Intel Skylake

--min-cpu-platform-agent

Minimum CPU platform for the agent instances. Ensure the platform is compatible with your selected gpu-type and available in your selected region and zone by referring to the GPUs on Compute Engine page.

Intel Broadwell

The following gcloud commands will help to validate your configuration, including resource availability in your desired region and zone:

# Validate that the GCP Project ID exists.
gcloud projects list

# Verify that the environment_image is listed.
gcloud compute images list --filter=name:<environment_image>

# Check that a zone is available in the configured region.
gcloud compute zones list --filter=region:<region>

# List the available machine types (for master_machine_type and agent_machine_type) in the configured zone.
gcloud compute machine-types list --filter=zone:<zone>

# List the valid gpu_type values for the configured zone.
gcloud compute accelerator-types list --filter=zone:<zone>

Updating A Cluster

If you need to make changes to your cluster, you can rerun det-deploy gcp up [args] in the same directory and your cluster will be updated. The det-deploy tool will only replace resources that need to be replaced based on the changes you’ve made in the updated execution.

Note

If you’d like to change the region of a deployment after it has already been deployed, we recommend deleting the cluster first, then redeploying the cluster with the new region.

Destroying A Cluster

To bring down the cluster, run the following in the same directory where you ran det-deploy gcp up:

det-deploy gcp down [optional args]

det-deploy will use the .tfstate file in the current directory to determine which resources to destroy. In addition, the following optional arguments are supported:

Argument

Description

Default Value

--keypath

The path to the service account JSON key file if using a service account. Including this flag will supersede default Google Cloud user credentials.

none

Warning

determined-deploy will not delete active agent instances when you deprovision the cluster. Generally, the master instance will shut down any inactive agents after an idle period, but if you’d like to deprovision the cluster while these agent instances exist, you must delete all agent instances first. You can find these agent instances by filtering for instances named det-agent-<cluster-id>; these agents will have full names of the form det-agent-<cluster-id>-<pet name>.

Appendix

Using Service Account Credentials

For more security controls, you can create a service account or select an existing service account from the service account key page in the Google Cloud Console and ensure it has the following IAM roles:

  • Cloud SQL Admin

  • Compute Admin

  • Compute Network Admin

  • Security Admin

  • Service Account Admin

  • Service Account User

  • Service Networking Admin

  • Storage Admin

Roles provide the service account permissions to create specific resources in your project. You can add roles to service accounts following this guide.

Once you have a service account with the appropriate roles, go to the service account key page in the Google Cloud Console and create a JSON key file. Save it to a location you’ll remember; we’ll refer to the path to this key file as the keypath, which is an optional argument you can supply when using determined-deploy. Once you have the keypath you can use it to deploy a GCP cluster by continuing the installation section.