Elastic InfrastructureΒΆ

When running in a cloud environment, Determined can automatically provision and terminate GPU instances as the set of deep learning workloads on the cluster changes. We call this capability elastic infrastructure; the agents that are provisioned by the system are called dynamic agents.

The diagram below outlines the high-level system architecture when using dynamic agents:

../_images/det-arch-elastic-infra.png

Following the diagram, the execution would be:

  1. The master collects information on idle agents (agents with no active workloads) and pending workloads (agents waiting to be scheduled).

  2. The master calculates the ideal size of the cluster and decides how many agents to launch and which agents to terminate. The calculation is done based on the configured scaling behavior and the specification of the resource pools.

    • An agent that is not running any containers is considered idle. By default, idle dynamic agents will automatically be terminated after 5 minutes of inactivity. This behavior gives agents a chance to run multiple workloads after they have been provisioned.

  3. The master makes API calls to agent providers, such as AWS and GCP, to provision and terminate agents as necessary.

  4. Once the agent instance has been created, it will automatically connect to the current master. The time it takes to create a new instance depends on the cloud provider and the configured instance type, but >60 seconds is typical.