Resource Pools¶
To run tasks such as experiments or notebooks, Determined needs to have resources (CPUs, GPUs) on which to run the tasks. However, different tasks have different resource requirements and, given the cost of GPU resources, it’s important to choose the right resources for specific goals so that you get the most value out of your money. For example, you may want to run your training on beefy V100 GPU machines, while you want your Tensorboards to run on cheap CPU machines with minimal resources.
Determined has the concept of a resource pool, which is a collection of identical resources that are located physically close to each other. Determined allows you to configure your cluster to have multiple resource pools and to assign tasks to a specific resource pool, so that you can use different sets of resources for different tasks. Each resource pool handles scheduling and instance provisioning independently.
When you configure a cluster, you set which pool is the default for auxiliary tasks and which pool
is the default for compute tasks. CPU-only tasks such as Tensorboards will run on the default
auxiliary pool unless you specify that they should run in a different pool when launching the task.
Tasks which require a slot, such as experiments or GPU-notebooks, will use the default compute pool
unless otherwise specified. For this reason we recommend that you always create a cluster with at
least two pools, one with low-cost CPU instances for auxiliary tasks and one with GPU instances for
compute tasks. This is the default setup when launching a cluster on AWS or GCP via det deploy
.
Here are some scenarios where it can be valuable to use multiple resource pools:
Use GPU for training while using CPUs for TensorBoard.
You create one pool,
aws-v100
, that provisionsp3dn.24xlarge
instances (large V100 EC2 instances) and another pool,aws-cpu
that provisionsm5.large
instances (small and cheap CPU instances). You train your experiments using theaws-v100
pool, while you run your Tensorboards in theaws-cpu
pool. When your experiments complete, theaws-v100 pool
can scale down to zero to save money, but you can continue to run your TensorBoard. Without resource pools, you would have needed to keep ap3dn.24xlarge
instance running to keep the TensorBoard alive. By default Tensorboard will always run on the default CPU pool.Use GPUs in different availability zones on AWS.
You have one pool
aws-v100-us-east-1a
that runsp3dn.24xlarge
in theus-east-1a
availability zone and another poolaws-v100-us-east-1b
that runsp3dn.24xlarge
instances in theus-east-1b
availability zone. You can launch an experiment intoaws-v100-us-east-1a
and, if AWS does not have sufficientp3dn.24xlarge
capacity in that availability zone, you can launch the experiment inaws-v100-us-east-1b
to check if that availability zone has capacity. Note that currently the “AWS does not have capacity” notification is only visible in the master logs, not on the experiment itself.Use spot/preemptible instances and fall back to on-demand if needed.
You have one pool
aws-v100-spot
that you use to try to run training on spot instances and another poolaws-v100-on-demand
that you fall back to if AWS does not have enough spot capacity to run your job. Determined will not switch from spot to on-demand instances automatically, but by configuring resource pools appropriately, it should be easy for users to select the appropriate pool depending on the job they want to run and the current availability of spot instances in the AWS region they are using. For more information on using spot instances, refer to AWS Spot Instances.Use cheaper GPUs for prototyping on small datasets and expensive GPU for training on full datasets.
You have one pool with less expensive GPUs that you use for initial prototyping on small data sets and another pool that you use for training more mature models on large datasets.
Limitations¶
Currently resource pools are completely independent from each other so it is not possible to launch an experiment that tries to use one pool and then falls back to another one if a certain condition is met. You will need to manually decide to shift an experiment from one pool to another.
We do not currently allow a cluster to have resource pools in multiple AWS/GCP regions or across multiple cloud providers. If the master is running in one AWS/GCP region, all resource pools must also be in that AWS/GCP region.
If you create a task that needs slots and specify a pool that will never have slots (i.e. a pool with CPU-only instances), that task can never get scheduled. Currently that task will appear to be PENDING permanently.
We are constantly working to improve Determined and would love to hear your feedback either through GitHub issues or in our community Slack.
Setting Up Resource Pools¶
Resource pools are configured via the Master Configuration. For each resource pool, you can configure scheduler and provider information.
If you are using static resource pools and launching agents by hand, you will need to update the Agent Configuration to specify which resource pool the agent should join.
Migrating to Resource Pools¶
With the introduction of resource pools, the Master Configuration format has changed to a new format.
This is a backwards compatible change and cluster configurations in the old format will continue to work. A configuration in the old format is interpreted as a cluster with a single resource pool that is the default for both CPU and GPU tasks. However, to take full advantage of resource pools, you will need to convert to the new format, which is a simple process of moving around and renaming a small number of top-level fields.
The old format had the top level fields of scheduler
and provisioner
which set the scheduler
and provisioner settings for the cluster. The new format has the top level fields
resource_manager
and resource_pools
. The resource_manager
section is for cluster level
setting such as which pools should be used by default and the default scheduler settings. The
scheduler
information is identical to the scheduler
field in the legacy format. The
resource_pools
section is a list of resource pools each of which has a name, description and
resource pool level settings. Each resource pool can be configured with a provider
field that
contains the same information as the provisioner
field in the legacy format. Each resource pool
can also have a scheduler
field that sets resource pool specific scheduler settings. If the
scheduler
field is not set for a specific resource pool, the default settings are used.
Note that defining resource pool-specific scheduler
settings is all-or-nothing. If the
pool-specific scheduler
field is blank, all scheduler settings will be inherited from the
settings defined in resource_manager.scheduler
. If any fields are set in the pool-specific
scheduler
section, no settings will be inherited from resource_manager.scheduler
- you need
to redefine everything.
Here is an example master configuration illustrating the potential problem.
resource_manager:
type: agent
scheduler:
type: round_robin
fitting_policy: best
default_aux_resource_pool: pool1
default_compute_resource_pool: pool1
resource_pools:
- pool_name: pool1
scheduler:
fitting_policy: worst
In this example, we are setting the cluster-wide scheduler defaults to use a best-fit, round robin
scheduler in resource_manager.scheduler
. We are then overwriting the scheduler settings at the
pool level for pool1
. Because we set scheduler.fitting_policy=worst
, no settings are
inherited from resource_manager.scheduler
so pool1 will end up using a worst-fit, fair share
scheduler (because when scheduler.type
is left blank, the default value is fair_share
).
If you want to have pool1
use a worst-fit, round robin scheduler, you need to make sure you
redefine the scheduler type at the pool-specific level:
resource_manager:
type: agent
scheduler:
type: round_robin
fitting_policy: best
default_aux_resource_pool: pool1
default_compute_resource_pool: pool1
resource_pools:
- pool_name: pool1
scheduler:
type: round_robin
fitting_policy: worst
Launching Tasks Into Resource Pools¶
When creating a task, the job configuration file has a section called “resources”. You can set the
resource_pool
subfield to specify the resource_pool
that a task should be launched into.
resources:
resource_pool: pool1
If this field is not set, the task will be launched into one of the two default pools defined in the
Master Configuration. Experiments will be launched into the default compute pool.
Tensorboards will be launched into the default auxiliary pool. Commands, Shells, and Notebooks that
request a slot (which is the default behavior if the resources.slots
field is not set) will be
launched into the default compute pool. Commands, Shells, and Notebooks that explicitly request 0
slots (for example the “Launch CPU-only Notebook” button in the Web UI) will use the auxiliary pool.