Install Determined on GCP¶
This document describes how to deploy a Determined cluster on Google
Cloud Platform (GCP). We provide the determined-deploy
package for easy creation
and deployment of these resources in GCP.
For more information on using Determined on GCP, see the Determined on GCP topic guide.
determined-deploy
Python Package¶
The determined-deploy
package uses Terraform to automatically deploy and configure a Determined cluster in GCP. Alternatively, if you already have a process for setting up infrastructure with Terraform, you can use the Terraform modules separately outside of determined-deploy
.
Requirements¶
Project¶
To get started on GCP, you will need to create a project.
The following GCP APIs must be enabled on your GCP project:
Credentials¶
The determined-deploy
package requires credentials in order to create resources in GCP. There are two ways to provide these credentials:
Use gcloud to authenticate your user account:
gcloud auth application-default login
This command will open a login page on your browser where you can sign-in to the Google account with access to your project. Ensure your user account has Owner
access to the project you want to deploy your cluster in.
Or alternatively, you can use Service Account credentials.
Deploying¶
We recommend creating a new directory and running the commands below inside that directory.
Note
The deployment process will create a state file in the directory where it is run. The state file keeps track of the resources deployed and their state, which is used for future updates or to delete the cluster. Since the state file will reside in this directory, any future update or deletion commands should be run inside this same directory so determined-deploy
can read the state file.
To deploy the cluster, run:
det-deploy gcp up --cluster-id CLUSTER_ID --project-id PROJECT_ID
Required Arguments:¶
Argument |
Description |
Default Value |
Required |
---|---|---|---|
|
A string appended to resources to uniquely identify the cluster. |
None |
True |
|
The project to deploy the cluster in. |
None |
True |
Optional Arguments:¶
Argument |
Description |
Default Value |
Required |
---|---|---|---|
|
The path to the Service Account JSON key file if using a Service Account. Including this flag will supersede default Google Cloud user credentials. |
None |
False |
|
Whether to use preemptible agent instances. |
false |
False |
|
The type of GPU to use for the agent instances. Ensure |
nvidia-tesla-k80 |
False |
|
The number of GPUs on each agent instance. Between 1-8 (more GPUs require more powerful |
8 |
False |
|
The maximum number of agent instances at one time. |
5 |
False |
|
The maximum amount of time an agent can sit idle before being shut down. |
10m |
False |
|
The network to create (ensure there isn’t a network with the same name already in the project, otherwise the deployment will fail). |
det-default- |
False |
|
The region to deploy the cluster in. |
us-west1 |
False |
|
The zone to deploy the cluster in. |
|
False |
|
Instance type to use for the master instance. |
n1-standard-2 |
False |
|
Instance type to use for the agent instances. |
n1-standard-32 |
False |
|
Minimum cpu platform for the master instance. |
Intel Skylake |
False |
|
Minimum cpu platform for the agent instances. Ensure the platform is compatible with your selected |
Intel Broadwell |
False |
Note
The deployment process may take 5-10 minutes and will return the Web-UI
along with additional cluster information once resources have been created.
The following gcloud
commands will help to validate your configuration, including resource availability in your desired region and zone:
# Validate that the GCP Project ID exists
gcloud projects list
# Verify that the environment_image is listed
gcloud compute images list --filter=name:<environment_image>
# Check that a zone is available in the configured region
gcloud compute zones list --filter=region:<region>
# List the available machine types (for master_machine_type and agent_machine_type) in the configured zone
gcloud compute machine-types list --filter=zone:<zone>
# List the valid gpu_type values for the configured zone
gcloud compute accelerator-types list --filter=zone:<zone>
Updating the cluster¶
If you need to make changes to your cluster, you can re-run det-deploy gcp up [args]
in the same directory and your cluster will be updated. The determined-deploy
package will only replace resources that need to be replaced based on the changes you’ve made in the updated execution.
Warning
If you’d like to change the region
of a deployment after it has already been deployed, we recommend deleting the cluster first, then redeploying the cluster with the new region
.
De-provisioning the cluster¶
To bring down the cluster, run the following in the same directory where you ran the deploy command:
det-deploy gcp down [optional args]
By default, determined-deploy
will use the .tfstate
file in the current directory to determine which resources to de-provision. In addition, the following are available optional arguments:
Argument |
Description |
Default Value |
Required |
---|---|---|---|
|
The path to the Service Account JSON key file if using a Service Account. Including this flag will supersede default Google Cloud user credentials. |
None |
False |
Warning
determined-deploy
will not delete active agent instances when you de-provision the cluster. Generally, the master instance will shut down any inactive agents after an idle period, but if you’d like to de-provision the cluster while these agent instances exist, you must delete all agent instances first. You can find these agent instances by filtering for instances named det-agent-<cluster-id>
and these agent(s) will have a full name in the form det-agent-<cluster-id>-<pet name>
.
Appendix¶
Using Service Account Credentials¶
For more security controls, you can create a Service Account or select an existing Service Account from the service account key page in the Google Cloud Console and ensure it has the following IAM Roles:
Cloud SQL Admin
Compute Admin
Compute Network Admin
Security Admin
Service Account Admin
Service Account User
Service Networking Admin
Storage Admin
Roles provide the Service Account permissions to create specific resources in your project. You can add roles to Service Accounts following this guide.
Once you have a Service Account with the appropriate roles, go to the service account key page in the Google Cloud Console and create a JSON key file. Save it to a location you’ll remember; we’ll refer to the path to this key file as the keypath
, which is an optional argument you can supply when using determined-deploy
. Once you have the keypath
you can use it to deploy a GCP cluster by continuing the installation section.