Job Queue Management

Job queue management extends scheduler functionality to offer better visibility and control over scheduling decisions. It does this by using the job queue, which provides information about job ordering and which jobs are queued, and by permitting dynamic job modification. Job queue management shows all submitted jobs and job states and lets you modify configuration options, including priority, queue position, and resource pool membership.

Queue management is available to the fair share, priority, and Kubernetes preemption schedulers.

To manage job queues, navigate to the WebUI Job Queue section or use the det job set of CLI commands.

Job Queue Constraints

The following constraints are associated with using the job queue to modify jobs:

  • The priority and fair share fields are mutually exclusive. The priority field is active only for the priority scheduler and the fair share field is active only for the fair share scheduler. Both fields cannot be active simultaneously.

  • The CLI ahead-of and behind-of options and the WebUI Move to Top operation are available only for the priority scheduler and are not available for the fair share scheduler. These operations are not yet fully supported for the Kubernetes priority scheduler.

  • The change resource pool operation can only be performed on experiments. To change the resource pool of other tasks, cancel the task and resubmit it.

  • Changing priority in Kubernetes always implicitly cancels the job, including when the job priority is increased. Priority change in Kubernetes is limited to experiments.

  • Changing resource pools is not available on Kubernetes because it does not currently support multiple resource pools.

Job States

Queued jobs can be in the Queued or Scheduled state.

State

Description

Queued

Job received but resources not allocated.

Scheduled

Scheduled to run or running, and resources might have been allocated.

Completed or errored jobs are not counted as active and are omitted from the queued job list.

View the Job Queue

You can view the job queue using the CLI or WebUI. In the WebUI, click the Job Queue tab. In the CLI, use one of the following commands:

$ det job list

or

$ det job ls

These commands show the default resource pool queue. To view other resource pool queues, use the --resource-pool option, specifying the pool:

$ det job list --resource-pool compute-pool

For more information about the CLI options, see the CLI documentation or use the det job list -h command.

The WebUI and the CLI display a results table ordered by scheduling order. The scheduling order does not represent the job priority. In addition to job order, the listing shows:

Job Property

Description

ID

Unique ID assigned to the job.

Type

Job type:

  • TYPE_COMMAND

  • TYPE_EXPERIMENT

  • TYPE_NOTEBOOK

  • TYPE_SHELL

  • TYPE_TENSORBOARD

Job Name

Command or experiment name, also displayed in the WebUI.

Priority

Latest priority assigned to the job.

Submitted

Job submission timestamp.

Slots

Number of slots acquired or needed.

Status

Job status:

  • Queued

  • Scheduled

User

Job owner.

Modify the Job Queue

The job queue can be changed in the WebUI Job Queue section or by using the CLI det job update command. You can make changes on a per-job basis by selecting a job and a job operation. Operations include:

  • changing job priorities for agent and Kubernetes priority scheduler

  • changing weights for agent resource manager using the fair share scheduler

  • changing the order of queued jobs

  • changing resource pools

WebUI Method

To use the WebUI to modify the job:

  1. Go to the Job Queue section.

  2. Find the job to modify.

  3. Click the three dots in the right-most column of the job.

  4. Find and click the Manage Job option.

  5. Make the change you want on the pop-up page, and click OK.

CLI Method

To use the CLI to modify the job queue, use the det job update command. For more information, run det job update --help command.

Examples:

$ det job update jobID --priority 10
$ det job update jobID --resource-pool compute-pool
$ det job update jobID --ahead-of jobID-2

The example changes the jobID job priority to 10, the resource pool of the job to compute-pool, and moves jobID ahead of job jobID-2 in the queue. Currently running jobs with a priority lower than and including the jobID-2 priority can be preempted to run the newly promoted jobID job.

To update a job in batch, use the update-batch argument:

$ det job update-batch job1.priority=1 job2.resource-pool="compute" job3.ahead-of=job1

Example CLI workflow:

$ det job list
   # | ID       | Type            | Job Name   | Priority | Submitted            | Slots (acquired/needed) | Status          | User
-----+--------------------------------------+-----------------+--------------------------+------------+---------------------------+---------
   0 | 0d714127 | TYPE_EXPERIMENT | first_job  |       42 | 2022-01-01 00:01:00  | 1/1                     | STATE_SCHEDULED | user1
   1 | 73853c5c | TYPE_EXPERIMENT | second_job |       42 | 2022-01-01 00:01:01  | 0/1                     | STATE_QUEUED    | user1

$ det job update 73853c5c --ahead-of 0d714127

$ det job list
   # | ID       | Type            | Job Name   | Priority | Submitted            | Slots (acquired/needed) | Status          | User
-----+--------------------------------------+-----------------+--------------------------+------------+---------------------------+---------
   0 | 73853c5c | TYPE_EXPERIMENT | second_job |       42 | 2022-01-01 00:01:01  | 1/1                     | STATE_SCHEDULED | user1
   1 | 0d714127 | TYPE_EXPERIMENT | first_job  |       42 | 2022-01-01 00:01:00  | 0/1                     | STATE_QUEUED    | user1

$ det job update-batch 73853c5c.priority=1 0d714127.priority=1

$ det job list
   # | ID       | Type            | Job Name   | Priority | Submitted            | Slots (acquired/needed) | Status          | User
-----+--------------------------------------+-----------------+--------------------------+------------+---------------------------+---------
   0 | 73853c5c | TYPE_EXPERIMENT | second_job |       1 | 2022-01-01 00:01:01  | 1/1                     | STATE_SCHEDULED | user1
   1 | 0d714127 | TYPE_EXPERIMENT | first_job  |       1 | 2022-01-01 00:01:00  | 0/1                     | STATE_QUEUED    | user1