To run tasks such as experiments or notebooks, Determined needs to have resources (CPUs, GPUs) on which to run the tasks. However, different tasks have different resource requirements and, given the cost of GPU resources, it’s important to choose the right resources for specific goals so that you get the most value out of your money. For example, you may want to run your training on beefy V100 GPU machines, while you want your Tensorboards to run on cheap CPU machines with minimal resources.
Determined has the concept of resource pools, which is a set of resources that are identical and located physically close to each other. Determined allows you to configure your cluster to have multiple resource pools and to assign tasks to a specific resource pool, so that you can use different sets of resources for different tasks. Each resource pool handles scheduling and instance provisioning independently.
When you configure a cluster, you set which pool is the default for CPU
tasks and which pool is the default for GPU tasks. CPU-only tasks such a
Tensorboards will run on the default CPU pool unless you specify that it
should run in a different pool when launching the task. Tasks which
require a slot such as Experiments or GPU-notebooks will launch on to
the default GPU pool unless otherwise specified. For this reason we
recommend that you always create a cluster with at least two pools, one
with low-cost CPU instances for CPU-only tasks and one with GPU
instances for GPU tasks. This is the default setup when launching a
Here are some scenarios where it can be valuable to use multiple resource pools:
Use GPU for training while using CPUs for TensorBoard.
You create one pool,
aws-v100, that provisions
p3dn.24xlargeinstances (large V100 EC2 instances) and another pool,
m5.largeinstances (small and cheap CPU instances). You train your experiments using the
aws-v100pool, while you run your Tensorboards in the
aws-cpupool. When your experiments complete, the
aws-v100 poolcan scale down to zero to save money, but you can continue to run your Tensorboard. Without resource pools, you would have needed to keep a
p3dn.24xlargeinstance running to keep the Tensorboard alive. By default Tensorboard will always run on the default CPU pool.
Use GPUs in different availability zones on AWS
You have one pool
us-east-1aavailability zone and another pool
p3dn.24xlargeinstances in the
us-east-1bavailability zone. You can launch an experiment into
aws-v100-us-east-1aand, if AWS does not have sufficient
p3dn.24xlargecapacity in that availability zone, you can launch the experiment in
aws-v100-us-east-1bto check if that availability zone has capacity. Note that currently the “AWS does not have capacity” notification is only visible in the master logs, not on the experiment itself.
Use spot/preemptible instance and fall back to on-demand if needed.
You have one pool
aws-v100-spotthat you use to try to run training on spot instances and another pool
aws-v100-on-demandthat you fall back to if AWS does not have enough spot capacity to run your job.
Use cheaper GPUs for prototyping on small datasets and expensive GPU for training on full datasets
You have one pool with less expensive GPUs that you use for initial prototyping on small data sets and another pool that you use for training more mature models on large datasets,
Currently resource pools are completely independent from each other so it is not possible to launch an experiment that tries to use one pool and then falls back to another one if a certain condition is met. You will need to manually decide to shift an experiment from one pool to another.
We do not currently allow a cluster to have resource pools in multiple AWS regions/GCP zones or across multiple cloud providers. If the master is running in one AWS region/GCP zone, all resource pools must also be in that AWS region/GCP zone.
If you create a task that needs slots and specify a pool that will never have slots (i.e. a pool with CPU-only instances), that task can never get scheduled. Currently that task will appear to be PENDING permanently.
We are constantly working to improve Determined and would love to hear your feedback either through GitHub issues or in our community Slack.
Setting Up Resource Pools¶
Resource pools are configured via the Master Configuration. For each resource pool, you can configure scheduler and provider information.
If you are using static resource pools and launching agents by hand, you will need to update the Agent Configuration to specify which resource pool the agent should join.
Migrating to Resource Pools¶
With the introduction of resource pools, the Master Configuration format has changed to a new format.
This is a backwards compatible change and cluster configurations in the old format will continue to work. A configuration in the old format is interpreted as a cluster with a single resource pool that is the default for both CPU and GPU tasks. However, to take full advantage of resource pools, you will need to convert to the new format, which is a simple process of moving around and renaming a small number of top-level fields.
The old format had the top level fields of
provisioner which set the scheduler and provisioner settings for the
cluster. The new format has the top level fields
resource_manager section is for cluster
level setting such as which pools should be used by default and the
default scheduler settings. The
scheduler information is identical
scheduler field in the legacy format. The
section is a list of resource pools each of which has a name,
description and resource pool level settings. Each resource pool can be
configured with a
provider field that contains the same information
provisioner field in the legacy format. Each resource pool
can also have a
scheduler field that sets resource pool specific
scheduler settings. If the
scheduler field is not set for a specific
resource pool, the default settings are used.
Note that defining resource pool-specific
scheduler settings is
all-or-nothing. If the pool-specific
scheduler field is blank, all
scheduler settings will be inherited from the settings defined in
resource_manager.scheduler. If any fields are set in the
scheduler section, no settings will be inherited from
resource_manager.scheduler - you need to redefine everything.
Here is an example master configuration illustrating the potential problem.
resource_manager: type: agent scheduler: type: round_robin fitting_policy: best default_cpu_resource_pool: pool1 default_gpu_resource_pool: pool1 resource_pools: - pool_name: pool1 scheduler: fitting_policy: worst
In this example, we are setting the cluster-wide scheduler defaults to
use a best-fit, round robin scheduler in
We are then overwriting the scheduler settings at the pool level for
pool1. Because we set
settings are inherited from
resource_manager.scheduler so pool1 will
end up using a worst-fit, fair share scheduler (because when
scheduler.type is left blank, the default value is
If you want to have
pool1 use a worst-fit, round robin scheduler,
you need to make sure you redefine the scheduler type at the
resource_manager: type: agent scheduler: type: round_robin fitting_policy: best default_cpu_resource_pool: pool1 default_gpu_resource_pool: pool1 resource_pools: - pool_name: pool1 scheduler: type: round_robin fitting_policy: worst
Launching Tasks Into Resource Pools¶
When creating a task, the configuration file has a section called
“resources”. You can set the
resource_pool subfield to specify the
resource_pool that a task should be launched into.
resources: resource_pool: pool1
If this field is not set, the task will be launched into one of the two default pools defined in the Master Configuration. Experiments will be launched into the default GPU pool. Tensorboards will be launched into the default CPU pool. Commands, Shells, and Notebooks that request a slot (which is the default behavior if the ‘slots’ field is not set) will be launched into the default GPU pool. Commands, Shells, and Notebooks that explicitly request 0 slots (for example the ‘Launch CPU-only Notebook’ button in the Web UI) will be launched into the CPU pool.