Shortcuts

Install Determined on AWS

This document describes how to deploy a Determined cluster on Amazon Web Services (AWS). We provide the determined-deploy package for easy creation and deployment of these resources. If you would rather create the cluster manually, see the Manual Deployment section below.

For more information on using Determined on AWS, see the Determined on AWS topic guide.

determined-deploy Python Package

The determined-deploy package uses AWS CloudFormation to automatically deploy and configure a Determined cluster. The CloudFormation builds the necessary components for Determined into a single CloudFormation stack.

Requirements

  • The machine executing determined-deploy must be configured with either AWS credentials or have an IAM role attached with permissions to access AWS CloudFormation APIs. See the AWS Documentation for information on how to use AWS credentials.

Installation

These instructions describe how to install determined-deploy:

pip install determined-deploy

Deploying

det-deploy aws up --cluster-id CLUSTER_ID --keypair KEYPAIR

Deployment Types

Simple

The simple deployment provides an easy way to deploy a Determined cluster in AWS. This creates the master instance in the default subnet for the account.

VPC

The VPC deployment creates a separate VPC with public subnets, and the Determined cluster is deployed into these subnets.

Secure

The secure deployment creates resources to lock down the Determined cluster. These resources are:

  • A VPC with a public and private subnet

  • A NAT Gateway for the private subnet to make outbound connections

  • An S3 VPC Gateway so the private subnet can access S3

  • A bastion instance in the public subnet

  • A master instance in the private subnet

CLI Arguments

Spinning up the cluster

det-deploy aws up --cluster-id CLUSTER_ID --keypair KEYPAIR

Argument

Description

Default Value

Required

--cluster-id

Unique id for the cluster which will be used for the CloudFormation Stack Name

None

True

--keypair

AWS Keypair for master and agent instances

None

True

--master-instance-type

Instance Type for master instance

m5.large

False

--agent-instance-type

Instance Type for agent instances

p2.8xlarge

False

--deployment-type

The deployment template to use - Deployment Types

simple

False

--inbound-cidr

CIDR range for inbound traffic

0.0.0.0/0

False

--db-password

Password for postgres user for Database

postgres

False

--hasura-secret

Password for Hasura GraphQL

hasura

False

--region

AWS region to deploy into

us-west-2

False

--max-idle-agent-period

Amount of idle time an agent will wait before shutting down

10m (10 minutes)

False

--max-instances

Maximum instances that the master instance can allocate to the cluster

5

False

--dry-run

Print the template but do not execute it

False

False

Tearing down the cluster

det-deploy aws down --cluster-id CLUSTER_ID

Argument

Description

Default Value

Required

--cluster-id

Unique id for the cluster which will be used for the CloudFormation Stack Name

None

True

Manual Deployment

Database

Determined requires a Postgres-compatible database such as AWS Aurora. Configure the cluster to use the database by updating the master.yaml with the database information. Make sure to create a database before running the Determined Cluster. (e.g. CREATE DATABASE <database-name>).

Example master.yaml:

db:
  user: "${database-user}"
  password: "${database-password}"
  host: "${database-hostname}"
  port: 5432
  name: "${database-name}"

Security Groups

VPC Security Groups provide a set of rules for inbound and outbound network traffic. The requirements for a Determined cluster are:

Master

  • Egress on all ports to agent security group

  • Egress outbound to the internet

  • Ingress on 8080 to view to the Determined UI

  • Ingress on 22 for SSH (not required but strongly advised)

  • Ingress on all ports from agent security group

Example:

MasterSecurityGroupEgress:
  Type: AWS::EC2::SecurityGroupEgress
  Properties:
    GroupId: !GetAtt MasterSecurityGroup.GroupId
    DestinationSecurityGroupId: !GetAtt AgentSecurityGroup.GroupId
    FromPort: 0
    ToPort: 65535
    IpProtocol: tcp

MasterSecurityGroupInternet:
  Type: AWS::EC2::SecurityGroupEgress
  Properties:
    GroupId: !GetAtt MasterSecurityGroup.GroupId
    CidrIp: 0.0.0.0/0
    FromPort: 0
    ToPort: 65535
    IpProtocol: tcp

MasterSecurityGroupIngress:
  Type: AWS::EC2::SecurityGroupIngress
  Properties:
    GroupId: !GetAtt MasterSecurityGroup.GroupId
    FromPort: 8080
    ToPort: 8080
    IpProtocol: tcp
    SourceSecurityGroupId: !GetAtt AgentSecurityGroup.GroupId

MasterSecurityGroupIngressUI:
  Type: AWS::EC2::SecurityGroupIngress
  Properties:
    GroupId: !GetAtt MasterSecurityGroup.GroupId
    FromPort: 8080
    ToPort: 8080
    IpProtocol: tcp
    CidrIp: !Ref InboundCIDRRange

MasterSSHIngress:
  Type: AWS::EC2::SecurityGroupIngress
  Properties:
    GroupId: !GetAtt MasterSecurityGroup.GroupId
    IpProtocol: tcp
    FromPort: 22
    ToPort: 22
    CidrIp: !Ref InboundCIDRRange

Agent

  • Egress on all ports to internet

  • Ingress on all ports from master security group

  • Ingress on all ports from agent security group

  • Ingress on 22 for SSH (not required but strongly advised)

Example:

AgentSecurityGroupEgress:
  Type: AWS::EC2::SecurityGroupEgress
  Properties:
    GroupId: !GetAtt AgentSecurityGroup.GroupId
    CidrIp: 0.0.0.0/0
    FromPort: 0
    ToPort: 65535
    IpProtocol: tcp

AgentSecurityGroupIngressMaster:
  Type: AWS::EC2::SecurityGroupIngress
  Properties:
    GroupId: !GetAtt AgentSecurityGroup.GroupId
    FromPort: 0
    ToPort: 65535
    IpProtocol: tcp
    SourceSecurityGroupId: !GetAtt MasterSecurityGroup.GroupId

AgentSecurityGroupIngressAgent:
  Type: AWS::EC2::SecurityGroupIngress
  Properties:
    GroupId: !GetAtt AgentSecurityGroup.GroupId
    FromPort: 0
    ToPort: 65535
    IpProtocol: tcp
    SourceSecurityGroupId: !GetAtt AgentSecurityGroup.GroupId

AgentSSHIngress:
  Type: AWS::EC2::SecurityGroupIngress
  Properties:
    GroupId: !GetAtt AgentSecurityGroup.GroupId
    IpProtocol: tcp
    FromPort: 22
    ToPort: 22
    CidrIp: !Ref InboundCIDRRange

IAM Roles

IAM Roles are comprised of IAM Polices, which provide access to AWS APIs such as the EC2 or S3 API. The required IAM Policies needed for the Determined cluster are:

Master

  • Allow EC2 to assume role

  • Allow EC2 to dynamically create and terminate instances with agent role

MasterRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: 2012-10-17
      Statement:
        - Effect: Allow
          Principal:
            Service:
              - ec2.amazonaws.com
          Action:
            - sts:AssumeRole
    Policies:
      - PolicyName: determined-agent-policy
        PolicyDocument:
          Version: 2012-10-17
          Statement:
            - Effect: Allow
              Action:
                - ec2:DescribeInstances
                - ec2:TerminateInstances
                - ec2:CreateTags
                - ec2:RunInstances
              Resource: "*"
      - PolicyName: pass-role
        PolicyDocument:
          Version: 2012-10-17
          Statement:
            - Effect: Allow
              Action: iam:PassRole
              Resource: !GetAtt AgentRole.Arn

Agent

  • Allow EC2 to assume role

  • Allow S3 access for checkpoint storage

  • Allow agent instance to describe instances

AgentRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: 2012-10-17
      Statement:
        - Effect: Allow
          Principal:
            Service:
              - ec2.amazonaws.com
          Action:
            - sts:AssumeRole
    Policies:
      - PolicyName: agent-s3-policy
        PolicyDocument:
          Version: 2012-10-17
          Statement:
            - Effect: Allow
              Action: "s3:*"
              Resource: "*"
      - PolicyName: determined-ec2
        PolicyDocument:
          Version: 2012-10-17
          Statement:
            - Effect: Allow
              Action:
                - ec2:DescribeInstances
              Resource: "*"

Master Node

The master node should be deployed on an EC2 instance supporting >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 100GB of disk storage. This corresponds to an EC2 t2.medium instance or better. The AMI should be the default Ubuntu 16.04 AMI.

Running Determined

  1. Install Docker and create the determined Docker network.

    apt-get remove docker docker-engine docker.io containerd runc
    apt-get update
    apt-get install -y \
      apt-transport-https \
      ca-certificates \
      curl \
      gnupg-agent \
      software-properties-common
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
    add-apt-repository \
      "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) \
      stable"
    apt-get update
    apt-get install -y docker-ce docker-ce-cli containerd.io
    
    docker network create determined
    
  2. Configure the cluster with master.yaml. See Cluster Configuration for more information.

    Notes:

    • hasura-secret is a user-defined random string. Ensure that the value is the same in master.yaml and the docker run command for the Hasura container.

    • agent-ami should be the latest Determined Agent AMI.

    • instance-type should be any p2 or p3 EC2 instance type.

    Warning

    An important assumption of Determined with Dynamic Agents is that any EC2 instances with the configured tag_key:tag_value pair are managed by the Determined master. This pair should be unique to your Determined installation. If it is not, Determined may inadvertently manage your non-Determined EC2 instances.

    checkpoint_storage:
      type: s3
      bucket: ${CheckpointBucket}
      save_experiment_best: 0
      save_trial_best: 1
      save_trial_latest: 1
    
    db:
      user: postgres
      password: "${DBPassword}"
      host: "${Database.Endpoint.Address}"
      port: 5432
      name: determined
    
    hasura:
      secret: <hasura-secret>
      address: determined-graphql:8080
    
    provisioner:
      iam_instance_profile_arn: ${AgentInstanceProfile.Arn}
      image_id: ${AgentAmiId}
      agent_docker_image: determinedai/determined-agent:${Version}
      instance_name: determined-agent-${UserName}
      instance_type: ${AgentInstanceType}
      master_url: http://local-ipv4:8080
      max_idle_agent_period: ${MaxIdleAgentPeriod}
      max_instances: ${MaxInstances}
      network_interface:
        public_ip: true
        security_group_id: ${AgentSecurityGroup.GroupId}
      provider: aws
      root_volume_size: 200
      ssh_key_name: ${Keypair}
      tag_key: determined-${UserName}
      tag_value: determined-${UserName}-agent
    
  3. Start Hasura.

    docker run \
      -d \
      --rm \
      --name determined-graphql \
      --network determined \
      -e HASURA_GRAPHQL_ADMIN_SECRET=<hasura-secret> \
      -e HASURA_GRAPHQL_CONSOLE_ASSETS_DIR=/srv/console-assets \
      -e HASURA_GRAPHQL_DATABASE_URL=postgres://postgres:${DBPassword}@${Database.Endpoint.Address}:5432/determined \
      -e HASURA_GRAPHQL_ENABLED_APIS=graphql,metadata \
      -e HASURA_GRAPHQL_ENABLED_LOG_TYPES=startup \
      -e HASURA_GRAPHQL_ENABLE_CONSOLE=false \
      -e HASURA_GRAPHQL_ENABLE_TELEMETRY=false \
      hasura/graphql-engine:v1.1.0
    
  4. Start the Determined master.

    docker run \
      --rm \
      --network determined \
      -p 8080:8080 \
      -v master.yaml:/etc/determined/master.yaml \
      determinedai/determined-master:${Version}
    

Once you have your Determined cluster running on AWS, try out some of our tutorials.