Resource Pools

To run tasks such as experiments or notebooks, Determined needs to have resources (CPUs, GPUs) on which to run the tasks. However, different tasks have different resource requirements and, given the cost of GPU resources, it’s important to choose the right resources for specific goals so that you get the most value out of your money. For example, you may want to run your training on beefy V100 GPU machines, while you want your Tensorboards to run on cheap CPU machines with minimal resources.

Determined has the concept of a resource pool, which is a collection of identical resources that are located physically close to each other. Determined allows you to configure your cluster to have multiple resource pools and to assign tasks to a specific resource pool, so that you can use different sets of resources for different tasks. Each resource pool handles scheduling and instance provisioning independently.

When you configure a cluster, you set which pool is the default for auxiliary tasks and which pool is the default for compute tasks. CPU-only tasks such as Tensorboards will run on the default auxiliary pool unless you specify that they should run in a different pool when launching the task. Tasks which require a slot, such as experiments or GPU-notebooks, will use the default compute pool unless otherwise specified. For this reason we recommend that you always create a cluster with at least two pools, one with low-cost CPU instances for auxiliary tasks and one with GPU instances for compute tasks. This is the default setup when launching a cluster on AWS or GCP via det deploy.

Here are some scenarios where it can be valuable to use multiple resource pools:

  • Use GPU for training while using CPUs for TensorBoard.

    You create one pool, aws-v100, that provisions p3dn.24xlarge instances (large V100 EC2 instances) and another pool, aws-cpu that provisions m5.large instances (small and cheap CPU instances). You train your experiments using the aws-v100 pool, while you run your Tensorboards in the aws-cpu pool. When your experiments complete, the aws-v100 pool can scale down to zero to save money, but you can continue to run your TensorBoard. Without resource pools, you would have needed to keep a p3dn.24xlarge instance running to keep the TensorBoard alive. By default Tensorboard will always run on the default CPU pool.

  • Use GPUs in different availability zones on AWS.

    You have one pool aws-v100-us-east-1a that runs p3dn.24xlarge in the us-east-1a availability zone and another pool aws-v100-us-east-1b that runs p3dn.24xlarge instances in the us-east-1b availability zone. You can launch an experiment into aws-v100-us-east-1a and, if AWS does not have sufficient p3dn.24xlarge capacity in that availability zone, you can launch the experiment in aws-v100-us-east-1b to check if that availability zone has capacity. Note that currently the “AWS does not have capacity” notification is only visible in the master logs, not on the experiment itself.

  • Use spot/preemptible instances and fall back to on-demand if needed.

    You have one pool aws-v100-spot that you use to try to run training on spot instances and another pool aws-v100-on-demand that you fall back to if AWS does not have enough spot capacity to run your job. Determined will not switch from spot to on-demand instances automatically, but by configuring resource pools appropriately, it should be easy for users to select the appropriate pool depending on the job they want to run and the current availability of spot instances in the AWS region they are using. For more information on using spot instances, refer to AWS Spot Instances.

  • Use cheaper GPUs for prototyping on small datasets and expensive GPU for training on full datasets.

    You have one pool with less expensive GPUs that you use for initial prototyping on small data sets and another pool that you use for training more mature models on large datasets.

Limitations

Currently resource pools are completely independent from each other so it is not possible to launch an experiment that tries to use one pool and then falls back to another one if a certain condition is met. You will need to manually decide to shift an experiment from one pool to another.

We do not currently allow a cluster to have resource pools in multiple AWS/GCP regions or across multiple cloud providers. If the master is running in one AWS/GCP region, all resource pools must also be in that AWS/GCP region.

If you create a task that needs slots and specify a pool that will never have slots (i.e. a pool with CPU-only instances), that task can never get scheduled. Currently that task will appear to be PENDING permanently.

We are constantly working to improve Determined and would love to hear your feedback either through GitHub issues or in our community Slack.

Setting Up Resource Pools

Resource pools are configured via the Master Configuration. For each resource pool, you can configure scheduler and provider information.

If you are using static resource pools and launching agents by hand, you will need to update the Agent Configuration to specify which resource pool the agent should join.

Migrating to Resource Pools

With the introduction of resource pools, the Master Configuration format has changed to a new format.

This is a backwards compatible change and cluster configurations in the old format will continue to work. A configuration in the old format is interpreted as a cluster with a single resource pool that is the default for both CPU and GPU tasks. However, to take full advantage of resource pools, you will need to convert to the new format, which is a simple process of moving around and renaming a small number of top-level fields.

The old format had the top level fields of scheduler and provisioner which set the scheduler and provisioner settings for the cluster. The new format has the top level fields resource_manager and resource_pools. The resource_manager section is for cluster level setting such as which pools should be used by default and the default scheduler settings. The scheduler information is identical to the scheduler field in the legacy format. The resource_pools section is a list of resource pools each of which has a name, description and resource pool level settings. Each resource pool can be configured with a provider field that contains the same information as the provisioner field in the legacy format. Each resource pool can also have a scheduler field that sets resource pool specific scheduler settings. If the scheduler field is not set for a specific resource pool, the default settings are used.

Note that defining resource pool-specific scheduler settings is all-or-nothing. If the pool-specific scheduler field is blank, all scheduler settings will be inherited from the settings defined in resource_manager.scheduler. If any fields are set in the pool-specific scheduler section, no settings will be inherited from resource_manager.scheduler - you need to redefine everything.

Here is an example master configuration illustrating the potential problem.

resource_manager:
  type: agent
  scheduler:
    type: round_robin
    fitting_policy: best
  default_aux_resource_pool: pool1
  default_compute_resource_pool: pool1

resource_pools:
  - pool_name: pool1
    scheduler:
      fitting_policy: worst

In this example, we are setting the cluster-wide scheduler defaults to use a best-fit, round robin scheduler in resource_manager.scheduler. We are then overwriting the scheduler settings at the pool level for pool1. Because we set scheduler.fitting_policy=worst, no settings are inherited from resource_manager.scheduler so pool1 will end up using a worst-fit, fair share scheduler (because when scheduler.type is left blank, the default value is fair_share).

If you want to have pool1 use a worst-fit, round robin scheduler, you need to make sure you redefine the scheduler type at the pool-specific level:

resource_manager:
  type: agent
  scheduler:
    type: round_robin
    fitting_policy: best
  default_aux_resource_pool: pool1
  default_compute_resource_pool: pool1

resource_pools:
  - pool_name: pool1
    scheduler:
      type: round_robin
      fitting_policy: worst

Launching Tasks Into Resource Pools

When creating a task, the job configuration file has a section called “resources”. You can set the resource_pool subfield to specify the resource_pool that a task should be launched into.

resources:
    resource_pool: pool1

If this field is not set, the task will be launched into one of the two default pools defined in the Master Configuration. Experiments will be launched into the default compute pool. Tensorboards will be launched into the default auxiliary pool. Commands, Shells, and Notebooks that request a slot (which is the default behavior if the resources.slots field is not set) will be launched into the default compute pool. Commands, Shells, and Notebooks that explicitly request 0 slots (for example the “Launch CPU-only Notebook” button in the Web UI) will use the auxiliary pool.