Shortcuts

Resource Pools

To run tasks such as experiments or notebooks, Determined needs to have resources (CPUs, GPUs) on which to run the tasks. However, different tasks have different resource requirements and, given the cost of GPU resources, it’s important to choose the right resources for specific goals so that you get the most value out of your money. For example, you may want to run your training on beefy V100 GPU machines, while you want your Tensorboards to run on cheap CPU machines with minimal resources.

Determined has the concept of resource pools, which is a set of resources that are identical and located physically close to each other. Determined allows you to configure your cluster to have multiple resource pools and to assign tasks to a specific resource pool, so that you can use different sets of resources for different tasks. Each resource pool handles scheduling and instance provisioning independently.

When you configure a cluster, you set which pool is the default for CPU tasks and which pool is the default for GPU tasks. CPU-only tasks such a Tensorboards will run on the default CPU pool unless you specify that it should run in a different pool when launching the task. Tasks which require a slot such as Experiments or GPU-notebooks will launch on to the default GPU pool unless otherwise specified. For this reason we recommend that you always create a cluster with at least two pools, one with low-cost CPU instances for CPU-only tasks and one with GPU instances for GPU tasks. This is the default setup when launching a cluster via det-deploy.

Here are some scenarios where it can be valuable to use multiple resource pools:

  • Use GPU for training while using CPUs for TensorBoard.

    You create one pool, aws-v100, that provisions p3dn.24xlarge instances (large V100 EC2 instances) and another pool, aws-cpu that provisions m5.large instances (small and cheap CPU instances). You train your experiments using the aws-v100 pool, while you run your Tensorboards in the aws-cpu pool. When your experiments complete, the aws-v100 pool can scale down to zero to save money, but you can continue to run your Tensorboard. Without resource pools, you would have needed to keep a p3dn.24xlarge instance running to keep the Tensorboard alive. By default Tensorboard will always run on the default CPU pool.

  • Use GPUs in different availability zones on AWS

    You have one pool aws-v100-us-east-1a that runs p3dn.24xlarge in the us-east-1a availability zone and another pool aws-v100-us-east-1b that runs p3dn.24xlarge instances in the us-east-1b availability zone. You can launch an experiment into aws-v100-us-east-1a and, if AWS does not have sufficient p3dn.24xlarge capacity in that availability zone, you can launch the experiment in aws-v100-us-east-1b to check if that availability zone has capacity. Note that currently the “AWS does not have capacity” notification is only visible in the master logs, not on the experiment itself.

  • Use spot/preemptible instance and fall back to on-demand if needed.

    You have one pool aws-v100-spot that you use to try to run training on spot instances and another pool aws-v100-on-demand that you fall back to if AWS does not have enough spot capacity to run your job.

  • Use cheaper GPUs for prototyping on small datasets and expensive GPU for training on full datasets

    You have one pool with less expensive GPUs that you use for initial prototyping on small data sets and another pool that you use for training more mature models on large datasets,

Limitations

Currently resource pools are completely independent from each other so it is not possible to launch an experiment that tries to use one pool and then falls back to another one if a certain condition is met. You will need to manually decide to shift an experiment from one pool to another.

We do not currently allow a cluster to have resource pools in multiple AWS regions/GCP zones or across multiple cloud providers. If the master is running in one AWS region/GCP zone, all resource pools must also be in that AWS region/GCP zone.

If you create a task that needs slots and specify a pool that will never have slots (i.e. a pool with CPU-only instances), that task can never get scheduled. Currently that task will appear to be PENDING permanently.

We are constantly working to improve Determined and would love to hear your feedback either through GitHub issues or in our community Slack.

Setting Up Resource Pools

Resource pools are configured via the Master Configuration. For each resource pool, you can configure scheduler and provider information.

If you are using static resource pools and launching agents by hand, you will need to update the Agent Configuration to specify which resource pool the agent should join.

Migrating to Resource Pools

With the introduction of resource pools, the Master Configuration format has changed to a new format.

This is a backwards compatible change and cluster configurations in the old format will continue to work. A configuration in the old format is interpreted as a cluster with a single resource pool that is the default for both CPU and GPU tasks. However, to take full advantage of resource pools, you will need to convert to the new format, which is a simple process of moving around and renaming a small number of top-level fields.

The old format had the top level fields of scheduler and provisioner which set the scheduler and provisioner settings for the cluster. The new format has the top level fields resource_manager and resource_pools. The resource_manager section is for cluster level setting such as which pools should be used by default and the default scheduler settings. The scheduler information is identical to the scheduler field in the legacy format. The resource_pools section is a list of resource pools each of which has a name, description and resource pool level settings. Each resource pool can be configured with a provider field that contains the same information as the provisioner field in the legacy format. Each resource pool can also have a scheduler field that sets resource pool specific scheduler settings. If the scheduler field is not set for a specific resource pool, the default settings are used.

Note that defining resource pool-specific scheduler settings is all-or-nothing. If the pool-specific scheduler field is blank, all scheduler settings will be inherited from the settings defined in resource_manager.scheduler. If any fields are set in the pool-specific scheduler section, no settings will be inherited from resource_manager.scheduler - you need to redefine everything.

Here is an example master configuration illustrating the potential problem.

resource_manager:
  type: agent
  scheduler:
    type: round_robin
    fitting_policy: best
  default_cpu_resource_pool: pool1
  default_gpu_resource_pool: pool1

resource_pools:
  - pool_name: pool1
    scheduler:
      fitting_policy: worst

In this example, we are setting the cluster-wide scheduler defaults to use a best-fit, round robin scheduler in resource_manager.scheduler. We are then overwriting the scheduler settings at the pool level for pool1. Because we set scheduler.fitting_policy=worst, no settings are inherited from resource_manager.scheduler so pool1 will end up using a worst-fit, fair share scheduler (because when scheduler.type is left blank, the default value is fair_share).

If you want to have pool1 use a worst-fit, round robin scheduler, you need to make sure you redefine the scheduler type at the pool-specific level:

resource_manager:
  type: agent
  scheduler:
    type: round_robin
    fitting_policy: best
  default_cpu_resource_pool: pool1
  default_gpu_resource_pool: pool1

resource_pools:
  - pool_name: pool1
    scheduler:
      type: round_robin
      fitting_policy: worst

Launching Tasks Into Resource Pools

When creating a task, the configuration file has a section called “resources”. You can set the resource_pool subfield to specify the resource_pool that a task should be launched into.

resources:
    resource_pool: pool1

If this field is not set, the task will be launched into one of the two default pools defined in the Master Configuration. Experiments will be launched into the default GPU pool. Tensorboards will be launched into the default CPU pool. Commands, Shells, and Notebooks that request a slot (which is the default behavior if the ‘slots’ field is not set) will be launched into the default GPU pool. Commands, Shells, and Notebooks that explicitly request 0 slots (for example the ‘Launch CPU-only Notebook’ button in the Web UI) will be launched into the CPU pool.