Shortcuts

Scheduling

This topic guide covers the two different scheduling policies that are supported in Determined. Administrators can configure the desired scheduler in Master Configuration. It is also possible to configure different scheduling behavior for different resource pools.

Once the scheduling policy has been defined for the current master and/or resource pool, the scheduling behavior of an individual task is influenced by several task configuration values:

  • For the fair-share scheduler, resources.weight lets users set the resource demand of a task relative to other tasks.

  • For the priority scheduler, resources.priority lets users assign a priority order to tasks.

  • Regardless of the scheduler, searcher.max_concurrent_trials lets users cap the number of slots that an adaptive_asha hyperparameter search experiment will request at any given time.

Note

Zero-slot tasks (e.g., CPU-only notebooks, tensorboards) are scheduled independently of tasks that require slots (e.g., experiments, GPU notebooks). The fair-share scheduler schedules zero-slot tasks on a FIFO basis. The priority scheduler schedules zero-slot tasks based on priority.

Fair-Share

The master allocates cluster resources (slots) among the active experiments using a weighted fair-share scheduling policy. Slots are divided among the active experiments according to the demand (number of desired concurrent tasks) of each experiment. For instance, in an eight-GPU cluster running two experiments with demands of ten and thirty, the scheduler assigns two slots and six slots respectively. As new experiments become active or the resource demand of an active experiment changes, the scheduler will adjust how slots are allocated to experiments as appropriate.

The behavior of the fair-share scheduler can be modified by changing the weight of a workload. A workload’s demand for slots is multiplied by the workload’s weight for scheduling purposes; hence, a workload with a higher weight will be assigned proportionally more resources than a workload with lower weight. The default weight is 1. For example, in the scenario above, if the weight of the first experiment is set to 3 and the weight of the second experiment is set to 1, each experiment will be assigned four slots.

Priority

The master allocates cluster resources (slots) to active tasks based on their priority. While tasks of higher priority (lower priority number) are pending, no lower priority tasks will be scheduled. For instance, if tasks with priorities of five and forty-two are pending, the latter will not be scheduled until the former has been. Tasks of equal priority are scheduled in the order in which they were created.

By default, the priority scheduler does not use preemption. If preemption is enabled (Master Configuration), when a higher priority task is pending and cannot be scheduled because no idle resources are available, the scheduler will attempt to schedule it by preempting lower priority tasks, starting with the task with the lowest priority. When a trial is preempted, its state is checkpointed so that the progress of the trial is not lost. Enabling preemption ensures that cluster resources can be reallocated to high priority tasks more promptly; however, preemption can also result in additional overhead due to checkpointing low priority tasks, which might be expensive for some models.

Note

Notebooks, tensorboards, shells, and commands are not preemptible. These tasks will continue to occupy cluster resources until they complete or are terminated.

Here is an example of how the priority scheduler behaves with preemption enabled:

  1. User submits a priority 2 adaptive_asha experiment with max_concurrent_trials 20 and slots_per_trial 1. 8 trials run and utilize all 8 GPUs.

  2. User submits a priority 1 distributed training experiment with slots_per_trial 4. 4 ASHA trials are preempted so the new distributed training experiment can run. Note that if preemption was not enabled, the new experiment would not get scheduled until the ASHA experiment’s GPU demand becomes <= 4.

  3. User starts a priority 3 notebook with resources.slots 1. The notebook has a lower priority than the two active experiments, so it will run as soon as the two active experiments collectively need <= 7 GPUs.

  4. ASHA and the distributed training experiment both complete, and the notebook task with priority 3 will run.

  5. User submits a priority 1 distributed training experiment with slots_per_trial 8. Although this workload has a higher priority than the active notebook task, it cannot be scheduled because it requires 8 slots, notebooks are not preemptible, and therefore only 7 slots are available.

  6. User submits a priority 2 distributed training experiment with slots_per_trial 4. It will not be scheduled even though 7 slots are available, because it is behind a higher priority pending task.

  7. The notebook is killed. The distributed training job that has priority 1 then starts running. Once that experiment is complete, distributed training experiment with priority 2 runs.

Scheduling on Kubernetes

By default, the Kubernetes scheduler does not perform gang scheduling or support preemption of pods. While it does take pod priority into account, it greedily schedules pods without consideration for the job each pod belongs to. This can result in problematic behavior for deep learning workloads, particularly for distributed training jobs that use many GPUs. A distributed training job that uses multiple pods requires all pods to be scheduled and running in order to make progress. Because Kubernetes does not support gang scheduling by default, cluster deadlocks can arise. For example, suppose that two experiments are launched simultaneously that each require 16 GPUs on a cluster with only 16 GPUs. It is possible that Kubernetes will assign some GPUs to one experiment and some GPUs to the other. Because neither experiment will receive the resources it needs to begin executing, the system will wait indefinitely.

Determined addresses these problems through the use of the lightweight coscheduling plugin, which extends the Kubernetes scheduler to support priority-based gang scheduling. To implement gang scheduling, the coscheduling plugin will not schedule a pod unless there are enough available resources to also schedule the rest of the pods in the same job. To function, the plugin requires special labels to be set that specify the number of nodes that each job needs for execution. Determined automatically calculates and sets these labels for GPU experiments that it launches.

The coscheduling plugin is in beta and is therefore not enabled by default. To enable it, edit values.yaml in the Determined Helm chart to set the defaultScheduler field to coscheduler.

Importantly, the coscheduling plugin does not work with Kubernetes’ cluster autoscaling feature: static node pools must be used to achieve gang scheduling. Also, while the plugin allocates resources to jobs based on their priority, it does not support preemption. For example, if the cluster is full of low priority jobs and a new high priority job is submitted, the high priority job will not be scheduled until one of the low priority jobs finishes. Additionally, there isn’t an implementation of max_slots or max_concurrent_trials that would limit the resources of an experiment, i.e. one for hyperparameter search. Lastly, Determined’s capability to automatically set pod labels is restricted to GPU experiments; Determined does not currently set labels for CPU experiments or user commands.

To enable gang scheduling with commands or CPU experiments, enable the coscheduler in values.yaml and modify the experiment config to contain the following:

environment:
   pod_spec:
      metadata:
         labels:
            pod-group.scheduling.sigs.k8s.io/name: <unique task name>
            pod-group.scheduling.sigs.k8s.io/min-available: <# of GPUs required>
      spec:
         schedulerName: coscheduler

You can also use schedulerName: default-scheduler to use the default Kubernetes scheduler.