Determined on AWS¶
This document describes the resources needed to run a Determined cluster on AWS.
Overview¶
A master node (a single, non-GPU instance) manages the cluster, provisioning and terminating agent nodes dynamically as new workloads are started by users. The master stores metadata in an external database; using AWS Aurora or RDS is recommended. Users interact with the cluster by using a CLI or a visiting a WebUI hosted on the master. Nodes in the cluster communicate with one another over a Virtual Private Cloud (VPC); users interact with the master via a designed external port configured during installation.
Following the diagram, a standard execution would be:
User submits experiment to master
Master creates one or more agents (depending on experiment) if they don’t exist
Agent accesses required data, images, etc.
Agent completes experiment and communicates completion to master
Master shuts down agents that are no longer needed
Architecture Details¶
This section provides details on the core resources which are required to run Determined, and periphery resources which are optionally configurable based on user requirements.
Core Resources¶
- Master Node: A single EC2 instance that:
hosts the Determined Web UI where users monitor their experiments
responds to commands from the Determined CLI
schedules workloads
manages other EC2 instances (agents) which run experiments
Agent Node(s): For most Determined clusters in AWS, the number of agents varies with the volume and type of workloads currently running. All agents are managed by the master and users do not have to interact with them directly. For more information on scaling clusters, or dynamic agents, see dynamic agents.
Database: Determined uses an Amazon Relational Database Service (Postgres) database to store metadata. In addition, Determined leverages a Hasura query engine on the master for GraphQL querying.
AWS Identity and Access Management (IAM)**: IAM roles are attached to the instances to manage the creation of compute (EC2) resources and access to Amazon Simple Storage Service (S3) buckets for checkpoints, TensorBoards, and other data storage as needed.
Security Groups: VPC Security Groups ensure that each node in the cluster can communicate with each other.
Periphery Resources¶
Network/Subnetwork: The Determined cluster runs in an existing or newly created VPC.
Elastic IP: For production clusters, the master should have an associated elastic IP; otherwise, AWS automatically assigns an ephemeral IP.
Amazon Simple Storage Service (S3) Bucket: The Determined cluster can leverage an existing S3 bucket (assuming it has the correct associated permissions), or the Cloudformation script can create a bucket with the cluster.
Next Steps¶
See the AWS installation guide to see how to deploy a Determined cluster on AWS.