Skip to content

Dynamic Agents on AWS

This document describes how to install, configure, and upgrade a deployment of PEDL with Dynamic Agents that is running on AWS. PEDL consists of a couple of components:

  • a master that schedules workloads and stores metadata
  • one or more agents that run workloads, typically using GPUs

When running PEDL with Dynamic Agents, the PEDL master dynamically provisions and terminates EC2 instances to meet the needs of the cluster.

  • Provisioning new PEDL agents is quick: we make API calls to AWS to provision new instances within a few seconds of new tasks arriving. Within a few minutes new instances will have registered themselves with the PEDL master and start running tasks.
  • When PEDL agents become idle, we give them a five minute grace period before terminating the instances. This grace period provides for a short interval of time for the PEDL agent instance to receive new tasks.

The PEDL master and agents should typically be installed and configured by a system administrator. Each user of PEDL should also install a copy of the command-line tools, as described here.

These instructions describe how to install PEDL with Dynamic Agents on AWS.

System Requirements

EC2 Instance Tags

An important assumption of PEDL with Dynamic Agents is that any EC2 instances with the configured tag_key:tag_value pair are managed by the PEDL master (See configuration). If this pair is not unique to your PEDL installation, there will be unexpected behavior for your installation of PEDL and any EC2 instances with the configured tag_key:tag_value pair.

EC2 AMIs

  • The PEDL master node will run on a custom AMI that will be shared with you by Determined AI.

  • PEDL agent nodes will run on a custom AMI that will be shared with you by Determined AI.

EC2 Instance Types

  • The PEDL master node should be deployed on an EC2 instance supporting >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 100GB of disk storage. This would be an EC2 t2.medium or more powerful.

  • Each PEDL agent node must be any of the P3 or P2 instances on AWS. This can be configured in the Cluster Configuration.

Master IAM Role

The PEDL master needs to have an IAM role with the following permissions:

  • ec2:CreateTags: used to tag the PEDL agent instances that the PEDL master provisions. These tags are configured by the Cluster Configuration.

  • ec2:DescribeInstances: used to find active PEDL agent instances based on tags.

  • ec2:RunInstances: used to provision PEDL agent instances.

  • ec2:TerminateInstances: used to terminate idle PEDL agent instances.

An example IAM policy with the appropriate permissions is below:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:TerminateInstances",
                "ec2:CreateTags",
                "ec2:RunInstances"
            ],
            "Resource": "*"
        }
    ]
}

Security Groups

We'll set up separate security groups for the master and the agent nodes.

Master

These are the rules needed for the PEDL master to work.

  • TCP inbound on port 8080 from the PEDL agent security group and any IP needing access to PEDL.

  • TCP outbound on all ports to the PEDL agent security group.

Agent

These are the rules needed for the PEDL agent to work.

  • TCP inbound on all ports from the PEDL master security group.

  • TCP outbound on all ports to the internet.

Note

You will also need to configure any internal services housing data or packages that you need to allow inbound from the agent security group. For example if your data is housed on S3, you need to ensure that the PEDL agent instances have access to this data.

Cluster Configuration

The PEDL Cluster is configured with master.yaml file located at /usr/local/pedl/etc/ on the PEDL master instance. Below you'll find an example configuration and an explanation for each field.

provisioner:
  master_address: master_address
  max_idle_agent_period: 5m
  cloud: aws
  region: us-west-2
  image_id: ami-12345
  security_group: group_name
  ssh_key_name: determined-ai-ssh
  tag_key: determined-ai
  tag_value: agent
  instance_name: determined-ai-agent
  root_volume_size: 100
  max_instances: 5
  instance_type: p3.8xlarge
  • provisioner: top level field that contains the configuration needed for the PEDL master to provision the PEDL agent instances.

  • master_address: the address of the master. Rather than hardcoding this IP address, we advise you use one of the following to set the master address as an alias: ec2.local-ipv4, ec2.public-ipv4, ec2.local-hostname, or ec2.public-hostname. Which one you should select is based on your network configuration. On master startup, we will use the AWS API to obtain the real address if the master address configuration matches the aforementioned options. (Required)

  • max_idle_agent_period: length of the waiting period before terminating an idle agent instance. This string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as "30s", "1h", or "1m30s". Valid time units are "s", "m", "h". (Optional)

  • cloud: the cloud provider to provision instances with. To run dynamic agents on AWS, set it to be aws. (Required)

  • region: the region of the cloud provider to provision the agent instances. We advise setting this region to be the same region as the PEDL master for better network performance. Defaults to ec2.region. (Optional)

  • image_id: the AMI ID of the PEDL agent that was shared with you. (Required)

  • security_group: the security group to run the PEDL agents as. This is the one you identified or created in System Requirements - Security Groups. (Required)

  • ssh_key_name: the name of the ssh key registered with AWS for ssh key access to the agent instances. (Required)

  • tag_key: key for tagging the PEDL agent instances. Defaults to determined-ai. (Optional)

  • tag_value: value for tagging the PEDL agent instances. Defaults to agent. (Optional)

  • instance_name: name to set for the PEDL agent instances. Defaults to determined-ai-agent. (Optional)

  • root_volume_size: size of the root volume of the PEDL agent in GB. We recommend at least 100GB. Defaults to 100. (Optional)

  • max_instances: max number of PEDL agent instances. Defaults to 5. (Optional)

  • instance_type: type of instance for the PEDL agents. We only support P3 and P2 type instances. Defaults to p3.8xlarge. (Optional)

Installation

These instructions describe how to install PEDL for the first time; for directions on how to upgrade an existing PEDL installation, see the Upgrades section below.

Ensure that you have been shared the most up-to-date PEDL AMIs and keep the AMI IDs handy as we will need them later (e.g. ami-0f4677bfc3161edc8).

Master

To install the master, we will launch an instance from the PEDL master AMI.

Let's start by navigating to the EC2 Dashboard of the AWS Console. Click "Launch Instance" and follow the instructions below:

  1. Choose AMI: find the PEDL Master AMI in "My AMIs" and click "Select".

  2. Choose Instance Type: we recommend a t2.medium or more powerful.

  3. Configure Instance: choose the IAM role according to these requirements.

  4. Add Storage: click Add New Volume and add an EBS volume of at least 100GB. If you have a previous PEDL installation that you are upgrading, you want to use the attach the same EBS volume as the previous installation. This volume will be used to store all your experiment metadata and checkpoints.

  5. Configure Security Group: choose or create a security group according to these requirements.

  6. Review and launch the instance.

  7. SSH into the PEDL master and edit the config at /usr/local/pedl/etc/master.yaml according to the guide on Cluster Configuration.

  8. Start the PEDL master by entering make -C /usr/local/pedl enable-master into the terminal.

Agent

There is no installation needed for the Agent. The PEDL master will dynamically launch PEDL agent instances based on the Cluster Configuration.

Upgrades

Upgrading an existing PEDL installation with Dynamic Agents on AWS requires the same steps as an installation without dynamic agents. See upgrades.