Dynamic Agents on GCP

This document describes how to install, configure, and upgrade a deployment of PEDL with Dynamic Agents on GCP. PEDL consists of several components:

  • a master that schedules workloads and stores metadata
  • one or more agents that run workloads, typically using GPUs

When running PEDL with Dynamic Agents, the PEDL master dynamically provisions and terminates Compute Engine instances to meet the needs of the cluster.

  • Provisioning new PEDL agents is quick: we make API calls to GCP to provision new instances within a few seconds of new tasks arriving. Within a few minutes new instances will have registered themselves with the PEDL master and start running tasks.
  • When PEDL agents become idle, we give them a five minute grace period before terminating the instances. This grace period provides for a short interval of time for the PEDL agent instance to receive new tasks.

The PEDL master and agents should typically be installed and configured by a system administrator. Each user of PEDL should also install a copy of the command-line tools, as described here.

These instructions describe how to install PEDL with Dynamic Agents on GCP.

System Requirements

Compute Engine Project

The PEDL master and the PEDL agents are intended to run in the same project.

Compute Engine Instance Labels

An important assumption of PEDL with Dynamic Agents is that any Compute Engine instances with the configured label_key:label_value pair are managed by the PEDL master (See configuration). If this pair is not unique to your PEDL installation, there will be unexpected behavior for your installation of PEDL and any Compute Engine instances with the configured label_key:label_value pair.

Compute Engine Images

  • The PEDL master node will run on a custom image that will be shared with you by Determined AI.

  • PEDL agent nodes will run on a custom image that will be shared with you by Determined AI.

Compute Engine Machine Types

  • The PEDL master node should be deployed on a Compute Engine instance with >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 100GB of disk storage. This would be a Compute Engine n1-standard-2 or more powerful.

Master Cloud API Access

The PEDL master needs to run as a service account that has the permissions for managing Compute Engine:

  • You can create a particular service account with the role Compute Admin, or you can use the default service account with the access scope changed to have Compute Engine: Read Write.

Network and Firewall Rules

For consideration of network performance, we advise for the agents to run on the same network as the master. We'll set up independent firewalls for the master and the agent nodes.

Master

These are the rules needed for the PEDL master:

  • TCP inbound on port 8080 from the PEDL agent security group and any IP needing access to PEDL.

  • TCP outbound on all ports to the PEDL agent ingress rules.

Agent

These are the rules needed for the PEDL agent:

  • The PEDL agent should have an external ip address to download the docker images.

  • TCP inbound on all ports from the PEDL master.

  • TCP outbound on all ports to the internet.

Note

You will also need to configure any internal services housing data or packages that you need to allow inbound from the agent firewalls. For example if your data is housed on S3, you need to ensure that the PEDL agent instances have access to this data.

Cluster Configuration

The PEDL Cluster is configured with master.yaml file located at /usr/local/pedl/etc/ on the PEDL master instance. Below you'll find an example configuration and an explanation for each field.

provisioner:
  master_address: master_address
  max_idle_agent_period: 5m
  cloud: gcp
  project: gce.project-id
  zone: gce.zone
  boot_disk_source_image: global/images/image_name
  label_key: determined-ai
  label_value: agent
  name_prefix: determined-ai-agent-
  network_interface:
    network: default
    subnetwork: sub1
    external_ip: true
  network_tags: ["tag1", "tag2"]
  boot_disk_size: 100
  instance_type:
    machine_type: n1-standard-32
    gpu_type: nvidia-tesla-v100
    gpu_num: 4
  max_instances: 5
  • provisioner: top level field that contains the configuration needed for the PEDL master to provision the PEDL agent instances.

  • max_idle_agent_period: length of the waiting period before terminating an idle agent instance. This string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as "30s", "1h", or "1m30s". Valid time units are "s", "m", "h". (Optional)

  • master_address: the address of the PEDL master. Rather than hardcoding this IP address, we advise you use one of the following to set the master address as an alias: gce.internal-ip or gce.external-ip. Which one you should select is based on your network configuration. On master startup, we will use the Google Cloud API to obtain the real address if the master address configuration matches the aforementioned options. (Required)

  • cloud: the cloud provider to provision instances with. To run dynamic agents on GCP, set it to be gcp. (Required)

  • project: the project id of the cloud provider to provision the agent instances. We advise you use the alia gce.project-id to use the project where the master instance is. Defaults to gce.project-id. (Optional)

  • zone: the zone of the cloud provider to provision the agent instances. We advise setting this zone to be the same region as the PEDL master for better network performance. Defaults to gce.zone. (Optional)

  • boot_disk_size: size of the root volume of the PEDL agent in GB. We recommend at least 100GB. Defaults to 100. (Optional)

  • boot_disk_source_image: the boot disk source image of the PEDL agent that was shared with you. To use a specific version of the PEDL agent image from a specific project: projects/<project-id>/global/images/<image-id>. (Required)

  • label_key: key for labeling the PEDL agent instances. Defaults to determined-ai. (Optional)

  • label_value: value for labeling the PEDL agent instances. Defaults to agent. (Optional)

  • name_prefix: name prefix to set for the PEDL agent instances. The names of the PEDL agent instances are a concatenation of the name prefix and a pet name. Defaults to determined-ai-agent-. (Optional)

  • network_interface: network configuration for the PEDL agent instances. (Optional)

    • network: network resource for the PEDL agent instances. Defaults to default. (Optional)

    • subnetwork: subnetwork resource for the PEDL agent instances. It cannot be empty if network is not set to be default. Defaults to empty string. (Optional)

    • external_ip: flag to using external ip address for the PEDL agent instances. defaults to true. (Optional)

  • network_tags: an array of network tags to set firewalls for the PEDL agent instances. This is the one you identified or created in System requirements - Firewall Rules. Defaults to be an empty array. (Optional)

  • instance_type: type of instance for the PEDL agents. (Optional)

    • machine_type: type of machine for the PEDL agents. Defaults to n1-standard-32. (Optional)

    • gpu_type: type of GPU for the PEDL agents. Defaults to nvidia-tesla-v100. (Optional)

    • gpu_num: number of GPU for the PEDL agents. Defaults to 4. (Optional)

  • max_instances: max number of PEDL agent instances. Defaults to 5. (Optional)

Installation

These instructions describe how to install PEDL for the first time; for directions on how to upgrade an existing PEDL installation, see the Upgrades section below.

Ensure that you are using the most up-to-date PEDL images. Keep the image IDs handy as we will need them later.

Master

To install the master, we will launch an instance from the PEDL master image.

Let's start by navigating to the Compute Engine Dashboard of the GCP Console. Click "Create Instance" and follow the instructions below:

  1. Choose Machine Type: we recommend a n1-standard-2 or more powerful.

  2. Configure Boot Disk:

    a. Choose Boot Disk Image: find the PEDL master image in "Images" and click "Select".

    b. Set Boot Disk Size: set Size to be at least 100GB. If you have a previous PEDL installation that you are upgrading, you want to use the snapshot or existing disk. This disk will be used to store all your experiment metadata and checkpoints.

  3. Configure Identity and API access: choose the service account according to these requirements.

  4. Configure Firewalls: choose or create a security group according to these requirements. Check off Allow HTTP traffic.

  5. Review and launch the instance.

  6. SSH into the PEDL master and edit the config at /usr/local/pedl/etc/master.yaml according to the guide on Cluster Configuration.

  7. Start the PEDL master by entering make -C /usr/local/pedl enable-master into the terminal.

Agent

There is no installation needed for the Agent. The PEDL master will dynamically launch PEDL agent instances based on the Cluster Configuration.

Upgrades

Upgrading an existing PEDL installation with Dynamic Agents on GCP requires the same steps as an installation without dynamic agents. See upgrades.