Scheduling¶
This topic guide covers the two different scheduling policies that are supported in Determined. Administrators can configure the desired scheduler in Master Configuration. Given a scheduling algorithm configured at the cluster level, task scheduling is further dictated by task configuration values:
For the fair-share scheduler,
resources.weight
lets users inflate the resource demand of a task relative to others.For the priority scheduler,
resources.priority
lets users assign a priority order to tasks.Regardless of the scheduler,
searcher.max_concurrent_trials
lets users cap the number of slots that anadaptive_asha
hyperparameter search experiment can utilize concurrently.
Note
Zero-slot tasks (e.g., cpu notebooks, tensorboards) are scheduled independenty of tasks that require slots (e.g., experiments, gpu notebooks). The fair-share scheduler schedules zero-slot tasks on a FIFO basis. The priority scheduler schedules zero-slot tasks based on priority.
Priority¶
The master allocates cluster resources (slots) to active tasks based on their priority. While tasks of higher priority (lower priority number) are pending, no lower priority tasks will be scheduled. For instance, if tasks with priorities of five and forty-two are pending, the latter will not be scheduled until the former has been. Tasks of equal priority are scheduled in the order in which they were created.
By default the priority scheduler will not perform any preemption. If preemption is enabled (Master Configuration), in scenarios where a higher priority task is pending and cannot be scheduled, the scheduler will attempt to schedule it by preempting lower priority tasks, starting with the lowest priorities.
An example of the priority scheduler:
User submits priority 2 adaptive_asha experiment with max_trials 20 and slots_per_trial 1. 8 trials run and utilize all 8 GPUs.
User submits priority 1 distributed training experiment with slots_per_trial 4. 4 asha trials are preempted so it can run. Note that if preemption were disabled, this would not get scheduled until the ASHA experiment’s GPU demand becomes <= 4.
User starts priority 3 notebook with resources.slots 1. Will run as soon as adaptive_asha plus distributed training experiments collectively need <= 7 GPUs.
Asha and the distributed training expriment both complete.
User submits priority 1 distributed training experiment with slots_per_trial 8. It will not be scheduled because notebooks are not preemptible and only 7 slots available.
User submits priority 2 distributed training experiment with slots_per_trial 4. It will not be scheduled even though 7 slots available, because it is behind a higher priority task.
Notebook killed. P1 distributed training experiment runs. Once complete, P2 distributed training experiment runs.
Note
Notebooks, tensorboards, shells, and commands are non-preemptible. These tasks will continue to occupy cluster resources until they complete or are terminated.