Elastic Infrastructure

When running in a cloud environment, Determined can automatically provision and terminate GPU instances as the set of deep learning workloads on the cluster changes. We call this capability elastic infrastructure; the agents that are provisioned by the system are called dynamic agents.

The diagram below outlines the high-level system architecture when using dynamic agents:


Following the diagram, the execution would be:

  1. The master collects information on the agents and workloads in the cluster.

  2. The master calculates the ideal size of the cluster, and decides how many agents to launch and which agents to terminate.

  3. The master makes API calls to agent providers, such as AWS and GCP, to provision and terminate agents as necessary.

Architecture Details

The master periodically collects information on idle agents (agents with no active workloads) and pending workloads (agents waiting to be scheduled), and then uses this information to scale the cluster automatically.

  • When workloads are pending and cannot be scheduled due to lack of available agents, the master calculates the number of agents that are needed to execute these pending workloads. The calculation is done based on the configuration of scaling behavior and agent type. Within a few seconds of a new pending workload arriving, the master will attempt to provision a new instance from the current cloud provider. Once the agent instance has been created, it will automatically connect to the current master. The time it takes to create a new instance depends on the cloud provider and the configured instance type, but ~60 seconds is typical.

  • An agent that is not running any containers is considered idle. By default, idle dynamic agents will automatically be terminated after 5 minutes of inactivity. This behavior gives agents a chance to run multiple workloads after they have been provisioned.

The full list of our topic guides can be found below: