This topic guide covers the two different scheduling policies that are supported in Determined. Administrators can configure the desired scheduler in Master Configuration. It is also possible to configure different scheduling behavior for different resource pools.
Once the scheduling policy has been defined for the current master and/or resource pool, the scheduling behavior of an individual task is influenced by several task configuration values:
For the fair-share scheduler,
resources.weightlets users set the resource demand of a task relative to other tasks.
For the priority scheduler,
resources.prioritylets users assign a priority order to tasks.
Regardless of the scheduler,
searcher.max_concurrent_trialslets users cap the number of slots that an
adaptive_ashahyperparameter search experiment will request at any given time.
Zero-slot tasks (e.g., CPU-only notebooks, tensorboards) are scheduled independently of tasks that require slots (e.g., experiments, GPU notebooks). The fair-share scheduler schedules zero-slot tasks on a FIFO basis. The priority scheduler schedules zero-slot tasks based on priority.
The master allocates cluster resources (slots) to active tasks based on their priority. While tasks of higher priority (lower priority number) are pending, no lower priority tasks will be scheduled. For instance, if tasks with priorities of five and forty-two are pending, the latter will not be scheduled until the former has been. Tasks of equal priority are scheduled in the order in which they were created.
By default, the priority scheduler does not use preemption. If preemption is enabled (Master Configuration), when a higher priority task is pending and cannot be scheduled because no idle resources are available, the scheduler will attempt to schedule it by preempting lower priority tasks, starting with the task with the lowest priority. When a trial is preempted, its state is checkpointed so that the progress of the trial is not lost. Enabling preemption ensures that cluster resources can be reallocated to high priority tasks more promptly; however, preemption can also result in additional overhead due to checkpointing low priority tasks, which might be expensive for some models.
Notebooks, tensorboards, shells, and commands are not preemptible. These tasks will continue to occupy cluster resources until they complete or are terminated.
Here is an example of how the priority scheduler behaves with preemption enabled:
User submits a priority 2 adaptive_asha experiment with max_concurrent_trials 20 and slots_per_trial 1. 8 trials run and utilize all 8 GPUs.
User submits a priority 1 distributed training experiment with slots_per_trial 4. 4 ASHA trials are preempted so the new distributed training experiment can run. Note that if preemption was not enabled, the new experiment would not get scheduled until the ASHA experiment’s GPU demand becomes <= 4.
User starts a priority 3 notebook with resources.slots 1. The notebook has a lower priority than the two active experiments, so it will run as soon as the two active experiments collectively need <= 7 GPUs.
ASHA and the distributed training experiment both complete, and the notebook task with priority 3 will run.
User submits a priority 1 distributed training experiment with slots_per_trial 8. Although this workload has a higher priority than the active notebook task, it cannot be scheduled because it requires 8 slots, notebooks are not preemptible, and therefore only 7 slots are available.
User submits a priority 2 distributed training experiment with slots_per_trial 4. It will not be scheduled even though 7 slots are available, because it is behind a higher priority pending task.
The notebook is killed. The distributed training job that has priority 1 then starts running. Once that experiment is complete, distributed training experiment with priority 2 runs.
Scheduling on Kubernetes¶
By default, the Kubernetes scheduler does not perform gang scheduling or support preemption of pods. While it does take pod priority into account, it greedily schedules pods without consideration for the job each pod belongs to. This can result in problematic behavior for deep learning workloads, particularly for distributed training jobs that use many GPUs. A distributed training job that uses multiple pods requires all pods to be scheduled and running in order to make progress. Because Kubernetes does not support gang scheduling by default, cluster deadlocks can arise. For example, suppose that two experiments are launched simultaneously that each require 16 GPUs on a cluster with only 16 GPUs. It is possible that Kubernetes will assign some GPUs to one experiment and some GPUs to the other. Because neither experiment will receive the resources it needs to begin executing, the system will wait indefinitely.
Determined addresses these problems through the use of the lightweight coscheduling plugin, which extends the Kubernetes scheduler to support priority-based gang scheduling. To implement gang scheduling, the coscheduling plugin will not schedule a pod unless there are enough available resources to also schedule the rest of the pods in the same job. To function, the plugin requires special labels to be set that specify the number of nodes that each job needs for execution. Determined automatically calculates and sets these labels for GPU experiments that it launches.
The coscheduling plugin is in beta and is therefore not enabled by
default. To enable it, edit
values.yaml in the Determined Helm chart
to set the
defaultScheduler field to
Importantly, the coscheduling plugin does not work with Kubernetes’
cluster autoscaling feature: static node pools must be used to achieve
gang scheduling. Also, while the plugin allocates resources to jobs
based on their priority, it does not support preemption. For example, if
the cluster is full of low priority jobs and a new high priority job is
submitted, the high priority job will not be scheduled until one of the
low priority jobs finishes. Additionally, there isn’t an implementation
max_concurrent_trials that would limit the
resources of an experiment, i.e. one for hyperparameter search. Lastly,
Determined’s capability to automatically set pod labels is restricted to
GPU experiments; Determined does not currently set labels for CPU
experiments or user commands.
To enable gang scheduling with commands or CPU experiments, enable the
values.yaml and modify the experiment config to
contain the following:
environment: pod_spec: metadata: labels: pod-group.scheduling.sigs.k8s.io/name: <unique task name> pod-group.scheduling.sigs.k8s.io/min-available: <# of GPUs required> spec: schedulerName: coscheduler
You can also use
schedulerName: default-scheduler to use the default