Dynamic Agents on GCP

This document describes how to install, configure, and upgrade a deployment of PEDL with Dynamic Agents on GCP. PEDL consists of several components:

  • a master that schedules workloads and stores metadata
  • one or more agents that run workloads, typically using GPUs

When running PEDL with Dynamic Agents, the PEDL master dynamically provisions and terminates Compute Engine instances to meet the needs of the cluster.

  • Provisioning new PEDL agents is quick: we make API calls to GCP to provision new instances within a few seconds of new tasks arriving. Within a few minutes new instances will have registered themselves with the PEDL master and start running tasks.
  • When PEDL agents become idle, we give them a five minute grace period before terminating the instances. This grace period provides for a short interval of time for the PEDL agent instance to receive new tasks.

The PEDL master and agents should typically be installed and configured by a system administrator. Each user of PEDL should also install a copy of the command-line tools, as described here.

These instructions describe how to install PEDL with Dynamic Agents on GCP.

System Requirements

Compute Engine Project

The PEDL master and the PEDL agents are intended to run in the same project.

Instance Labels

When using Dynamic Agents on GCP, PEDL identifies the Compute Engine instances that it is managing using a configurable instance label (see configuration for details). Administrators should be careful to ensure that this label is not used by other Compute Engine instances that are launched outside of PEDL; if that assumption is violated, unexpected behavior may occur.

Compute Engine Images

  • The PEDL master node will run on a custom image that will be shared with you by Determined AI.

  • PEDL agent nodes will run on a custom image that will be shared with you by Determined AI.

Compute Engine Machine Types

  • The PEDL master node should be deployed on a Compute Engine instance with >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 100GB of disk storage. This would be a Compute Engine n1-standard-2 or more powerful.

GCP API Access

  • The PEDL master needs to run as a service account that has the permissions to manage Compute Engine instances. There are two options:

    1. Create a particular service account with the Compute Admin role. Then set the PEDL master to use this account. See Compute Engine IAM roles for more details on how to configure the service account.

      • In order for the PEDL agent to be associated with a service account, the PEDL master needs to have access to service accounts. Please ensure the service account of the PEDL master has the Service Account User role.

      • In order for the PEDL agent to use a shared VPC, the service account that the master runs with needs to have the Compute Network User role.

    2. Use the default service account and add the Compute Engine: Read Write scope.

  • Optionally, the PEDL agent may be associated with a service account.

Note

Access scopes are the legacy method of specifying permissions for your instance. A best practice is to set the full cloud-platform access scope on the instance, then securely limit the service account's API access with Cloud IAM roles. See Access Scopes for details.

Network Requirements

See Network Requirements for details.

Cluster Configuration

The PEDL Cluster is configured with master.yaml file located at /usr/local/pedl/etc on the PEDL master instance. Below you'll find an example configuration and an explanation for each field.

provisioner:
  master_url: <scheme://host:port>
  startup_script: <startup script>
  agent_docker_network: pedl
  max_idle_agent_period: 5m

  provider: gcp
  base_config: <instance resource base configuration>
  project: <project id>
  zone: <zone>
  boot_disk_size: 200
  boot_disk_source_image: projects/<project-id>/global/images/<image-name>
  label_key: <label key for agent discovery>
  label_value: <label value for agent discovery>
  name_prefix: <name prefix>
  network_interface:
    network: projects/<project>/global/networks/<network>
    subnetwork: projects/<project>/regions/<region>/subnetworks/<subnetwork>
    external_ip: false
  network_tags: ["<tag1>", "<tag2>"]
  service_account:
    email: "<service account email>"
    scopes: ["https://www.googleapis.com/auth/cloud-platform"]
  instance_type:
    machine_type: n1-standard-32
    gpu_type: nvidia-tesla-v100
    gpu_num: 4
  max_instances: 5
  • provisioner: top level field that contains the configuration needed for the PEDL master to provision the PEDL agent instances.

  • max_idle_agent_period: length of the waiting period before terminating an idle agent instance. This string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as "30s", "1h", or "1m30s". Valid time units are "s", "m", "h". (Optional)

  • master_url: the full url of the master. A valid url is in the format of scheme://host:port. The scheme must be either http or https. If the master is deployed on GCP, rather than hardcoding the ip address, we advise you use one of the following to set the host as an alias: internal-ip orexternal-ip. Which one you should select is based on your network configuration. On master startup, we will replace the above alias address with its real value. Defaults to http as scheme, local ip address as host, and 8080 as port. (Optional)

  • startup_script: startup script for agents. This script will run right away when agent instances start up. For example, it can be used for formating and mounting a disk. Defaults to an empty string.

  • agent_docker_network: the Docker network to use for the PEDL agent and task containers. If this is set to "host", Docker host-mode networking will be used instead. The default value is "pedl".

  • provider: the provider to provision instances with. To run dynamic agents on GCP, set it to be gcp. (Required)

  • base_config: instance resource base configuration that will be merged with the fields below to construct GCP inserting instance request. See [REST Resource: instances] (https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert) for details.

  • project: the project id of the GCP resources used by PEDL. Defaults to the project of the master. (Optional)

  • zone: the zone of the GCP resources used by PEDL. Defaults to the zone of the master. (Optional)

  • boot_disk_size: size of the root volume of the PEDL agent in GB. We recommend at least 100GB. Defaults to 200. (Optional)

  • boot_disk_source_image: the boot disk source image of the PEDL agent that was shared with you. To use a specific version of the PEDL agent image from a specific project, it should be set in the format: projects/<project-id>/global/images/<image-id>. (Required)

  • label_key: key for labeling the PEDL agent instances. Defaults to managed-by. (Optional)

  • label_value: value for labeling the PEDL agent instances. Defaults to the master instance name if the master is on GCP otherwise determined-ai-pedl. (Optional)

  • name_prefix: name prefix to set for the PEDL agent instances. The names of the PEDL agent instances are a concatenation of the name prefix and a pet name. Defaults to the master instance name if the master is on GCP otherwise determined-ai-pedl. (Optional)

  • network_interface: network configuration for the PEDL agent instances. See the GCP API Access section for the suggested configuration. (Required)

    • network: network resource for the PEDL agent instances. The network configuration should specify the project id of the network. It should be set in the format: projects/<project>/global/networks/<network>. (Required)

    • subnetwork: subnetwork resource for the PEDL agent instances. The subnet configuration should specify the project id and the region of the subnetwork. It should be set in the format: projects/<project>/regions/<region>/subnetworks/<subnetwork>. (Required)

    • external_ip: flag to using external IP address for the PEDL agent instances. See Network Requirements for instructions on whether an external IP should be set. Defaults to false. (Optional)

  • network_tags: an array of network tags to set firewalls for the PEDL agent instances. This is the one you identified or created in System requirements - Firewall Rules. Defaults to be an empty array. (Optional)

  • service_account: service account for the PEDL agent instances. See the GCP API Access section for suggested configuration. (Optional)

    • email: email of the service account for the PEDL agent instances. Defaults to be an empty string. (Optional)

    • scopes: list of scopes authorized for the PEDL agent instances. As suggested in GCP API Access, we recommend you set the scopes to ["https://www.googleapis.com/auth/cloud-platform"]. Defaults to ["https://www.googleapis.com/auth/cloud-platform"]. (Optional)

  • instance_type: type of instance for the PEDL agents. (Optional)

    • machine_type: type of machine for the PEDL agents. Defaults to n1-standard-32. (Optional)

    • gpu_type: type of GPU for the PEDL agents. Defaults to nvidia-tesla-v100. (Optional)

    • gpu_num: number of GPU for the PEDL agents. Defaults to 4. (Optional)

  • max_instances: max number of PEDL agent instances. Defaults to 5. (Optional)

How to attach a disk containing a data set to each dynamic agent

If your input data set is on a persistent disk, you can attach that disk to each dynamic agent by using the base instance configuration and preparing commands. The following is an example configuration. See [REST Resource: instances] (https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert) for the full list of configuration options supported by GCP. See Formatting and mounting a zonal persistent disk for more examples of formatting or mounting disks in GCP.

Here is an example master configuration of attaching a second existing disk.

Note

If a specific non-root user needs to access the disk, please run the tasks linked with the POSIX UID/GID of the user (See Running tasks as particular agent users for details.) and grant access to the corresponding UID/GID.

After installing the master, you can use the following command to validate if you could read and write on the attached disk.

cat > command.yaml << EOF
bind_mounts:
  - host_path: /mnt/disks/second
    container_path: /second
EOF
# Test attached read-only disk.
pedl command run --config-file command.yaml ls -l /second

Installation

These instructions describe how to install PEDL for the first time; for directions on how to upgrade an existing PEDL installation, see the Upgrades section below.

Ensure that you are using the most up-to-date PEDL images. Keep the image IDs handy as we will need them later.

Master

To install the master, we will launch an instance from the PEDL master image.

Let's start by navigating to the Compute Engine Dashboard of the GCP Console. Click "Create Instance" and follow the instructions below:

  1. Choose Machine Type: we recommend a n1-standard-2 or more powerful.

  2. Configure Boot Disk:

    a. Choose Boot Disk Image: find the PEDL master image in "Images" and click "Select".

    b. Set Boot Disk Size: set Size to be at least 100GB. If you have a previous PEDL installation that you are upgrading, you want to use the snapshot or existing disk. This disk will be used to store all your experiment metadata and checkpoints.

  3. Configure Identity and API access: choose the service account according to these requirements.

  4. Configure Firewalls: choose or create a security group according to these Network Requirements. Check off Allow HTTP traffic.

  5. Review and launch the instance.

  6. SSH into the PEDL master and edit the config at /usr/local/pedl/etc/master.yaml according to the guide on Cluster Configuration.

  7. Start the PEDL master by entering make -C /usr/local/pedl enable-master into the terminal.

Agent

There is no installation needed for the Agent. The PEDL master will dynamically launch PEDL agent instances based on the Cluster Configuration.

Upgrades

Upgrading an existing PEDL installation with Dynamic Agents on GCP requires the same steps as an installation without dynamic agents. See upgrades.