Install Determined on AWS¶
This document describes how to deploy a Determined cluster on Amazon Web Services (AWS). We provide the determined-deploy package for easy creation and deployment of these resources. If you would rather create the cluster manually, see the Manual Deployment section below.
For more information on using Determined on AWS, see the Determined on AWS topic guide.
determined-deploy
Python Package¶
The determined-deploy
package uses AWS CloudFormation to automatically deploy
and configure a Determined cluster. The CloudFormation builds the necessary components
for Determined into a single CloudFormation stack.
Requirements¶
The machine executing
determined-deploy
must be configured with either AWS credentials or have an IAM role attached with permissions to access AWS CloudFormation APIs. See the AWS Documentation for information on how to use AWS credentials.
Installation¶
These instructions describe how to install determined-deploy
:
pip install determined-deploy
Deploying¶
det-deploy aws up --cluster-id CLUSTER_ID --keypair KEYPAIR
Deployment Types¶
Simple¶
The simple deployment provides an easy way to deploy a Determined cluster in AWS. This creates the master instance in the default subnet for the account.
VPC¶
The VPC deployment creates a separate VPC with public subnets, and the Determined cluster is deployed into these subnets.
Secure¶
The secure deployment creates resources to lock down the Determined cluster. These resources are:
A VPC with a public and private subnet
A NAT Gateway for the private subnet to make outbound connections
An S3 VPC Gateway so the private subnet can access S3
A bastion instance in the public subnet
A master instance in the private subnet
CLI Arguments¶
Spinning up the cluster
det-deploy aws up --cluster-id CLUSTER_ID --keypair KEYPAIR
Argument |
Description |
Default Value |
Required |
---|---|---|---|
|
Unique id for the cluster which will be used for the CloudFormation Stack Name |
None |
True |
|
AWS Keypair for master and agent instances |
None |
True |
|
Instance Type for master instance |
m5.large |
False |
|
Instance Type for agent instances |
p2.8xlarge |
False |
|
The deployment template to use - Deployment Types |
simple |
False |
|
CIDR range for inbound traffic |
0.0.0.0/0 |
False |
|
Password for |
postgres |
False |
|
Password for Hasura GraphQL |
hasura |
False |
|
AWS region to deploy into |
us-west-2 |
False |
|
Amount of idle time an agent will wait before shutting down |
10m (10 minutes) |
False |
|
Maximum instances that the master instance can allocate to the cluster |
5 |
False |
|
Print the template but do not execute it |
False |
False |
Tearing down the cluster
det-deploy aws down --cluster-id CLUSTER_ID
Argument |
Description |
Default Value |
Required |
---|---|---|---|
|
Unique id for the cluster which will be used for the CloudFormation Stack Name |
None |
True |
Manual Deployment¶
Database¶
Determined requires a Postgres-compatible database such as AWS Aurora. Configure the cluster to use the database by
updating the master.yaml
with the database information. Make sure to create a database before running
the Determined Cluster. (e.g. CREATE DATABASE <database-name>
).
Example master.yaml
:
db:
user: "${database-user}"
password: "${database-password}"
host: "${database-hostname}"
port: 5432
name: "${database-name}"
Security Groups¶
VPC Security Groups provide a set of rules for inbound and outbound network traffic. The requirements for a Determined cluster are:
Master¶
Egress on all ports to agent security group
Egress outbound to the internet
Ingress on 8080 to view to the Determined UI
Ingress on 22 for SSH (not required but strongly advised)
Ingress on all ports from agent security group
Example:
MasterSecurityGroupEgress:
Type: AWS::EC2::SecurityGroupEgress
Properties:
GroupId: !GetAtt MasterSecurityGroup.GroupId
DestinationSecurityGroupId: !GetAtt AgentSecurityGroup.GroupId
FromPort: 0
ToPort: 65535
IpProtocol: tcp
MasterSecurityGroupInternet:
Type: AWS::EC2::SecurityGroupEgress
Properties:
GroupId: !GetAtt MasterSecurityGroup.GroupId
CidrIp: 0.0.0.0/0
FromPort: 0
ToPort: 65535
IpProtocol: tcp
MasterSecurityGroupIngress:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !GetAtt MasterSecurityGroup.GroupId
FromPort: 8080
ToPort: 8080
IpProtocol: tcp
SourceSecurityGroupId: !GetAtt AgentSecurityGroup.GroupId
MasterSecurityGroupIngressUI:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !GetAtt MasterSecurityGroup.GroupId
FromPort: 8080
ToPort: 8080
IpProtocol: tcp
CidrIp: !Ref InboundCIDRRange
MasterSSHIngress:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !GetAtt MasterSecurityGroup.GroupId
IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: !Ref InboundCIDRRange
Agent¶
Egress on all ports to internet
Ingress on all ports from master security group
Ingress on all ports from agent security group
Ingress on 22 for SSH (not required but strongly advised)
Example:
AgentSecurityGroupEgress:
Type: AWS::EC2::SecurityGroupEgress
Properties:
GroupId: !GetAtt AgentSecurityGroup.GroupId
CidrIp: 0.0.0.0/0
FromPort: 0
ToPort: 65535
IpProtocol: tcp
AgentSecurityGroupIngressMaster:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !GetAtt AgentSecurityGroup.GroupId
FromPort: 0
ToPort: 65535
IpProtocol: tcp
SourceSecurityGroupId: !GetAtt MasterSecurityGroup.GroupId
AgentSecurityGroupIngressAgent:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !GetAtt AgentSecurityGroup.GroupId
FromPort: 0
ToPort: 65535
IpProtocol: tcp
SourceSecurityGroupId: !GetAtt AgentSecurityGroup.GroupId
AgentSSHIngress:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !GetAtt AgentSecurityGroup.GroupId
IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: !Ref InboundCIDRRange
IAM Roles¶
IAM Roles are comprised of IAM Polices, which provide access to AWS APIs such as the EC2 or S3 API. The required IAM Policies needed for the Determined cluster are:
Master¶
Allow EC2 to assume role
Allow EC2 to dynamically create and terminate instances with agent role
MasterRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- ec2.amazonaws.com
Action:
- sts:AssumeRole
Policies:
- PolicyName: determined-agent-policy
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- ec2:DescribeInstances
- ec2:TerminateInstances
- ec2:CreateTags
- ec2:RunInstances
Resource: "*"
- PolicyName: pass-role
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: iam:PassRole
Resource: !GetAtt AgentRole.Arn
Agent¶
Allow EC2 to assume role
Allow S3 access for checkpoint storage
Allow agent instance to describe instances
AgentRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- ec2.amazonaws.com
Action:
- sts:AssumeRole
Policies:
- PolicyName: agent-s3-policy
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: "s3:*"
Resource: "*"
- PolicyName: determined-ec2
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- ec2:DescribeInstances
Resource: "*"
Master Node¶
The master node should be deployed on an EC2 instance supporting >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 100GB of disk storage. This corresponds to an EC2 t2.medium instance or better. The AMI should be the default Ubuntu 16.04 AMI.
Running Determined¶
Install Docker and create the
determined
Docker network.apt-get remove docker docker-engine docker.io containerd runc apt-get update apt-get install -y \ apt-transport-https \ ca-certificates \ curl \ gnupg-agent \ software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" apt-get update apt-get install -y docker-ce docker-ce-cli containerd.io docker network create determined
Configure the cluster with
master.yaml
. See Cluster Configuration for more information.Notes:
hasura-secret
is a user-defined random string. Ensure that the value is the same inmaster.yaml
and thedocker run
command for the Hasura container.agent-ami
should be the latest Determined Agent AMI.instance-type
should be any p2 or p3 EC2 instance type.
Warning
An important assumption of Determined with Dynamic Agents is that any EC2 instances with the configured tag_key:tag_value pair are managed by the Determined master. This pair should be unique to your Determined installation. If it is not, Determined may inadvertently manage your non-Determined EC2 instances.
checkpoint_storage: type: s3 bucket: ${CheckpointBucket} save_experiment_best: 0 save_trial_best: 1 save_trial_latest: 1 db: user: postgres password: "${DBPassword}" host: "${Database.Endpoint.Address}" port: 5432 name: determined hasura: secret: <hasura-secret> address: determined-graphql:8080 provisioner: iam_instance_profile_arn: ${AgentInstanceProfile.Arn} image_id: ${AgentAmiId} agent_docker_image: determinedai/determined-agent:${Version} instance_name: determined-agent-${UserName} instance_type: ${AgentInstanceType} master_url: http://local-ipv4:8080 max_idle_agent_period: ${MaxIdleAgentPeriod} max_instances: ${MaxInstances} network_interface: public_ip: true security_group_id: ${AgentSecurityGroup.GroupId} provider: aws root_volume_size: 200 ssh_key_name: ${Keypair} tag_key: determined-${UserName} tag_value: determined-${UserName}-agent
Start Hasura.
docker run \ -d \ --rm \ --name determined-graphql \ --network determined \ -e HASURA_GRAPHQL_ADMIN_SECRET=<hasura-secret> \ -e HASURA_GRAPHQL_CONSOLE_ASSETS_DIR=/srv/console-assets \ -e HASURA_GRAPHQL_DATABASE_URL=postgres://postgres:${DBPassword}@${Database.Endpoint.Address}:5432/determined \ -e HASURA_GRAPHQL_ENABLED_APIS=graphql,metadata \ -e HASURA_GRAPHQL_ENABLED_LOG_TYPES=startup \ -e HASURA_GRAPHQL_ENABLE_CONSOLE=false \ -e HASURA_GRAPHQL_ENABLE_TELEMETRY=false \ hasura/graphql-engine:v1.1.0
Start the Determined master.
docker run \ --rm \ --network determined \ -p 8080:8080 \ -v master.yaml:/etc/determined/master.yaml \ determinedai/determined-master:${Version}
Once you have your Determined cluster running on AWS, try out some of our tutorials.