Install Determined#
This user guide describes how to deploy a Determined cluster on Google Cloud Platform (GCP). The
det deploy
tool makes it easy to create and deploy these resources in GCP. The det deploy
tool uses Terraform to
automatically deploy and configure a Determined cluster in GCP. Alternatively, if you already have a
process for setting up infrastructure with Terraform, you can use our Terraform modules
rather than det deploy
.
For more information about using Determined on GCP, see the Deploy on GCP topic guide.
Requirements#
Project#
To get started on GCP, you will need to create a project.
The following GCP APIs must be enabled on your GCP project:
Credentials#
The det deploy
tool requires credentials in order to create resources in GCP. There are two ways
to provide these credentials:
Use gcloud to authenticate your user account:
gcloud auth application-default login
This command will open a sign-in page in your browser where you can sign in to the Google account that has access to your project. Ensure your user account has
Owner
access to the project you want to deploy your cluster in.
Resource Quotas#
The default GCP Resource Quotas for GPUs are relatively low; you may wish to request a quota increase.
Install#
Install Terraform.
Install
determined
usingpip
:pip install determined
Note
The command, pip install determined
, installs the determined
library which includes the Determined command-line interface (CLI).
Deploy a Cluster#
We recommend creating a new directory and running the commands below inside that directory.
Note
The deployment process will create Terraform state and variables files in the directory where it is run. The state file keeps track of deployed resources and their state and is used to update or delete the cluster in the future. The variables files includes all Terraform variables used for deployment (e.g., service account keypath, cluster ID, GCP region and zone).
Any future update or deletion commands should be run inside the same directory so det deploy
can read the state and variables files. If either of these files is deleted, it will be difficult
to manage the deployment afterward. Storing these files in a safe location is strongly
recommended.
To deploy the cluster, run:
det deploy gcp up --cluster-id CLUSTER_ID --project-id PROJECT_ID
CLUSTER_ID
is an arbitrary unique ID for the new cluster. We recommend choosing a cluster ID
that is memorable and helps identify what the cluster is being used for.
The deployment process may take 5-10 minutes. When it completes, summary information about the newly deployed cluster will be printed, including the URL of the Determined master.
Required Arguments:#
Argument |
Description |
Default Value |
---|---|---|
|
A string appended to resources to uniquely identify the cluster. |
required |
|
The project to deploy the cluster in. |
required |
Optional Arguments:#
Argument |
Description |
Default Value |
---|---|---|
|
The path to the service account JSON key file if using a service account. Including this flag will supersede default Google Cloud user credentials. |
Not set |
|
Whether to use preemptible dynamic agent instances. |
False |
|
The type of GPU to use for the agent instances. Ensure |
nvidia-tesla-t4 |
|
The number of GPUs on each agent instance. Between 0 and 8 (more GPUs require a more
powerful |
4 |
|
Maximum number of dynamic agent instances at one time. |
5 |
|
The maximum number of containers running for agents in the auxiliary resource pool. |
100 |
|
The length of time to wait before idle dynamic agents will be automatically terminated. |
10m |
|
The network to create (ensure there isn’t a network with the same name already in the project, otherwise the deployment will fail). |
det-default- |
|
The region to deploy the cluster in. |
us-west1 |
|
The zone to deploy the cluster in. |
|
|
Instance type to use for the master instance. |
n1-standard-2 |
|
Instance type to use for the agent instances in the auxiliary resource pool. |
n1-standard-4 |
|
Instance type to use for the agent instances in the compute resource pool. |
n1-standard-32 |
|
Minimum CPU platform for the master instance. |
Intel Skylake |
|
Minimum CPU platform for the agent instances. Ensure the platform is compatible with your
selected |
Intel Broadwell |
|
Directory used to store cluster metadata. The same directory cannot be used for multiple clusters at the same time. |
Current working directory |
|
Path to the custom |
Not set |
The following gcloud
commands will help to validate your configuration, including resource
availability in your desired region and zone:
# Validate that the GCP Project ID exists.
gcloud projects list
# Verify that the environment_image is listed.
gcloud compute images list --filter=name:<environment_image>
# Check that a zone is available in the configured region.
gcloud compute zones list --filter=region:<region>
# List the available machine types (for master_machine_type and agent_machine_type) in the configured zone.
gcloud compute machine-types list --filter=zone:<zone>
# List the valid gpu_type values for the configured zone.
gcloud compute accelerator-types list --filter=zone:<zone>
Update a Cluster#
If you need to make changes to your cluster, you can rerun det deploy gcp up [args]
in the same
directory and your cluster will be updated. The det deploy
tool will only replace resources that
need to be replaced based on the changes you’ve made in the updated execution.
Note
If you’d like to change the region
of a deployment after it has already been deployed, we
recommend deleting the cluster first, then redeploying the cluster with the new region
.
Destroy a Cluster#
To bring down the cluster, run the following in the same directory where you ran det deploy gcp
up
:
det deploy gcp down
det deploy
will use the .tfstate
and terraform.tfvars.json
files in the current
directory to determine which resources to destroy. If you deployed with a service account JSON key
file, the same credentials file will be used for deprovisioning. Otherwise, default Google Cloud
credentials are used.
Custom master.yaml templates#
Similarly to a corresponding AWS feature, advanced users who
require a deep customization of master settings (i.e., the master.yaml
config file) can use the
master.yaml
templating feature. Since det deploy gcp
fills in plenty of
infrastructure-related values such as subnetwork ids or boot disk images, we provide a simplified
templating solution, similar to helm charts in kubernetes. Template
language is based on golang templates, and includes sprig
helper library and toYaml
serialization helper.
Example workflow:
Get the default template using
det deploy gcp dump-master-config-template > /path/to/master.yaml.tmpl
Customize the template as you see fit by editing it in any text editor. For example, let’s say a user wants to utilize (default) 4-GPU instances for the default compute pool, but they also often run single-GPU notebook jobs, for which a single-GPU instance would be perfect. So, you want to add a third pool
compute-pool-solo
with a customized instance type.Start with the default template, and find the
resource_pools
section:resource_pools: - pool_name: aux-pool max_aux_containers_per_agent: {{ .resource_pools.pools.aux_pool.max_aux_containers_per_agent }} provider: instance_type: {{- toYaml .resource_pools.pools.aux_pool.instance_type | nindent 8 }} {{- toYaml .resource_pools.gcp | nindent 6}} - pool_name: compute-pool max_aux_containers_per_agent: 0 provider: instance_type: {{- toYaml .resource_pools.pools.compute_pool.instance_type | nindent 8 }} cpu_slots_allowed: true {{- toYaml .resource_pools.gcp | nindent 6}}:
Then, append a new section:
- pool_name: compute-pool-solo max_aux_containers_per_agent: 0 provider: instance_type: machine_type: n1-standard-4 gpu_type: nvidia-tesla-t4 gpu_num: 1 preemptible: false {{- toYaml .resource_pools.gcp | nindent 6}}
Use the new template:
det deploy gcp <ALL PREVIOUSLY USED FLAGS> --master-config-template-path /path/to/edited/master.yaml.tmpl
All set! Check the Cluster page in WebUI to ensure your cluster has 3 resource pools. In case of errors, ssh to the master instance as instructed by
det deploy gcp
output, and checksudo journalctl -u google-startup-scripts.service
,/var/log/cloud-init-output.log
, orsudo docker logs determined-master
.
Service Account Credentials#
For more security controls, you can create a service account or select an existing service account from the service account key page in the Google Cloud Console and ensure it has the following IAM roles:
Cloud Filestore Editor
Cloud SQL Admin
Compute Admin
Compute Network Admin
Security Admin
Service Account Admin
Service Account User
Service Networking Admin
Storage Admin
Roles provide the service account permissions to create specific resources in your project. You can add roles to service accounts following this guide.
Once you have a service account with the appropriate roles, go to the service account key page in
the Google Cloud Console
and create a JSON key file. Save it to a location you’ll remember; we’ll refer to the path to this
key file as the keypath
, which is an optional argument you can supply when using det deploy
.
Once you have the keypath
you can use it to deploy a GCP cluster by continuing the
installation section.
Run Determined on NVIDIA A100 GPUs#
Determined makes it possible to try out your models on latest NVIDIA A100 GPUs; however, there are a few considerations:
A100s may not be available in your default GCP region and zone, and you may need to specify a different one explicitly. See more on GPU availablity.
Make sure you have sufficient resource quota for A100s in your target region and zone. See more on quotas.
Adjust maximum number of instances and to be within your quota using
--max-dynamic-agents NUMBER
.
This command line will spin up a cluster of up to 2 A100s in the us-central1-c
zone:
det deploy gcp up --cluster-id CLUSTER_ID --project-id PROJECT_ID \
--max-dynamic-agents 2 \
--compute-agent-instance-type a2-highgpu-1g --gpu-num 1 \
--gpu-type nvidia-tesla-a100 \
--region us-central1 --zone us-central1-c \
--gpu-env-image determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.11-gpu-0.24.0 \
--cpu-env-image determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-0.24.0