Slurm/PBS Requirements#

This document describes the specific requirements for deploying Determined on Slurm or PBS workload managers.

For general environment requirements, please refer to HPC Environment Requirements.

Slurm Requirements#

Determined should function with your existing Slurm configuration. To optimize how Determined interacts with Slurm, we recommend the following steps:

  • Enable Slurm for GPU Scheduling.

    Configure Slurm with SelectType=select/cons_tres. This enables Slurm to track GPU allocation instead of tracking only CPUs. When enabled, Determined submits batch jobs by specifying --gpus={slots_per_trial}. If this is not available, you must change the slurm section tres_supported option to false.

  • Configure GPU Generic Resources (GRES).

    Determined works best when allocating GPUs. Information about what GPUs are available is available using GRES. You can use the AutoDetect feature to configure GPU GRES automatically. Otherwise, you should manually configure GRES GPUs such that Slurm can schedule nodes with the GPUs you want.

    For the automatic selection of nodes with GPUs, Slurm must be configured for GresTypes=gpu and nodes with GPUs must have properly configured GRES indicating the presence of any GPUs. When enabled, Determined can ensure GPUs are available by specifying --gres=gpus:1. If Slurm GRES cannot be properly configured, specify the slurm section gres_supported option to false, and it is the user’s responsibility to ensure that GPUs will be available on nodes selected for the job using other configurations such as targeting a specific resource pool with only GPU nodes, or specifying a Slurm constraint in the experiment configuration.

  • Ensure homogeneous Slurm partitions.

    Determined maps Slurm partitions to Determined resource pools. It is recommended that the nodes within a partition be homogeneous for Determined to effectively schedule GPU jobs.

    • A Slurm partition with GPUs is identified as a CUDA/ROCm resource pool. The type is inherited from the resource_manager.slot_type configuration. It can be also be specified-per partition using resource_manager.partition_overrides

    • A Slurm partition with no GPUs is identified as an AUX resource pool.

    • The Determined default resource pool is set to the Slurm default partition. Override this default using the slurm section default_compute_resource_pool or default_aux_resource_pool option.

    • If a Slurm partition is not homogeneous, you may create a resource pool that provides homogenous resources out of that partition using a custom resource pool. Configure a resource pool with provider_type: hpc, specify the underlying Slurm partition name to receive the job and include a task_container_defaults section with the necessary slurm options to select the desired homogenous set of resources from that partition.

  • Ensure the MaxNodes value for each partition is not less than the number of GPUs in the partition.

    Determined delegates node selection for a job to Slurm by specifying a node range (1-slots_per_trial). If slots_per_trial exceeds the MaxNodes value for the partition, the job will remain in state PENDING with reason code PartitionNodelimit. Make sure that all partitions that have MaxNodes specified use a value larger than the number of GPUs in the partition.

  • Enable multiple jobs per compute node.

    Determined uses GPU or CPU resource requests to Slurm. When Slurm schedules jobs, however, it also considers the memory requirements of the job. In order to enable multiple jobs to be scheduled on a node concurrently, configuration is required in slurm.conf.

    The default memory allocated for a job is UNLIMITED. This prevents multiple jobs from executing on the same node unless this value is reduced. The default memory allocation for a job is derived from one of the slurm.conf configuration variables DefMemPerNode, DefMemPerGPU, or DefMemPerCPU. In order to enable individual GPUs/CPUs scheduling by default configure DefMemPerNode (which provides a total amount of memory for each job) or DefMemPerGPU and DefMemPerCPU (which derives the memory allocation from the number of GPU or CPU associated with the job). Configure one or more of these values to reduce the default memory allocation and enable jobs to divide up the available memory on compute nodes.

    An alternative to changing the default memory configuration via slurm.conf, is to provide explicit options on each job via the Determined configuration (task_container_defaults, resource pool configuration, or experiment configuration slurm.sbatch_args).

    For details about how those requests are derived, see HPC Launching Architecture.

  • Enable resource separation using cgroups.

    While Slurm always allocates distinct resources for each job, by default there is no enforced separation when the resources are co-located in the same compute node. Such enforcement can be enabled using cgroups. GPU allocation is communicated to the application via the environment variables CUDA_VISIBLE_DEVICES or ROCR_VISIBLE_DEVICES. Determined uses those specifications to utilize only the GPU resources scheduled by Slurm for the job, but CPU and memory have no enforcement. If desired, you can enable such enforcement with the Slurm cgroups configuration. Enable cgroups support in slurm.conf, then enable enforcement of specific resource classes in cgroup.conf (ConstrainCores for CPU, ConstrainDevices for GPU, and ConstrainRAMSpace for memory).

  • Tune the Slurm configuration for Determined job preemption.

    Slurm preempts jobs using signals. When a Determined job receives SIGTERM, it begins a checkpoint and graceful shutdown. To prevent unnecessary loss of work, it is recommended to set GraceTime (secs) high enough to permit the job to complete an entire Determined scheduling_unit.

    To enable GPU job preemption, use PreemptMode=CANCEL or PreemptMode=REQUEUE, because PreemptMode=SUSPEND does not release GPUs so does not allow a higher-priority job to access the allocated GPU resources. Determined manages the requeue of a successfully preempted job so even with PreemptMode=REQUEUE, the Slurm job will be canceled and resubmitted.

PBS Requirements#

Determined should function with your existing PBS configuration. To optimize how Determined interacts with PBS, we recommend the following steps:

  • Enable PBS to store job history.

    Job completion detection requires that the job history feature be enabled. PBS administrators can employ the following command to set the value of job_history_enable:

    sudo qmgr -c "set server job_history_enable = True"
    
  • Configure PBS to manage GPU resources.

    To optimize GPU allocation, Determined automatically selects compute nodes with GPUs by default using the -select={slots_per_trial}:ngpus=1 option. If PBS cannot identify GPUs in this way, set the pbs section gres_supported option to false when configuring Determined. In this case, users must ensure GPU availability on nodes by other means, such as targeting GPU-only resource pools, or specifying a PBS constraint in the experiment configuration.

    PBS should be configured to set the environment variable CUDA_VISIBLE_DEVICES (or ROCR_VISIBLE_DEVICES for ROCm) using a PBS cgroup hook, as explained in the PBS Administrator’s Guide. If PBS is not configured to set CUDA_VISIBLE_DEVICES, Determined will utilize only a single GPU on each node. To fully utilize multiple GPUs, you must either manually configure CUDA_VISIBLE_DEVICES or set the pbs.slots_per_node setting in your experiment configuration file to indicate the desired number of GPU slots for Determined.

  • Ensure the ngpus resource is defined with the correct values.

    To ensure the successful operation of Determined, define the ngpus resource value for each node on the cluster. Additionally, the resource should have the appropriate flags to enable proper processing by PBS when scheduling jobs. You can check if the ngpus resource is defined with the appropriate flags using the following command:

    [~]$ qmgr -c "list resource ngpus"
    Resource ngpus
        type = long
        flag = hn
    

    The output should indicate that ngpus is defined as a type long resource with flags h (host-level resource) and n (consumable resource). If your HPC cluster nodes do not produce the same output, use the following commands to create and set the resource ngpus for each node:

    [~]$ qmgr -c "create resource ngpus type=long, flag=hn" # create the ngpus resource
    [~]$ qmgr -c "set node <nodename> ngpus=<number of GPUs>" # set the value for ngpus
    

    If you use virtual nodes (vnodes), make sure the ngpus value is set only on the vnodes, not the parent node. Below are the commands and sample output to ensure this:

    [~]$ sudo qmgr -c "list node node002[0] resources_available"
    Node node002[0]
        resources_available.arch = linux
        resources_available.host = node002
        resources_available.hpmem = 0b
        resources_available.mem = 45943mb
        resources_available.ncpus = 18
        resources_available.ngpus = 2 # ngpus value is set on vnode node002[0]
        resources_available.vmem = 46933mb
        resources_available.vnode = node002[0]
    [~]$ sudo qmgr -c "list node node002 resources_available"
    Node node002
        resources_available.accel_type = tesla
        resources_available.arch = linux
        resources_available.host = node002
        resources_available.hpmem = 0b
        resources_available.mem = 0b
        resources_available.ncpus = 0 # ngpus value is not set on parent node node002
        resources_available.Qlist = gpuQ,gpu_hi_priQ
        resources_available.vmem = 0b
        resources_available.vnode = node002
    

    If the ngpus value is set on the parent node, use the following command to unset it:

    [madagund@node003 ~]$ sudo qmgr -c "unset node <node_name> resources_available.ngpus"
    

    Next, make sure that ngpus is listed as a resource in the <sched_priv_directory>/sched_config file.

    [~]$ sudo cat <sched_priv_directory>/sched_config | grep "resources:"
    resources: "ngpus, ncpus, mem, arch, host, vnode, ..., foo"
    

    Finally, restart the pbs server to apply the changes.

    [~]$ sudo systemctl restart pbs
    
  • Configure PBS to report GPU Accelerator type.

    It is recommended that PBS administrators set the value for resources_available.accel_type on each node that contains an accelerator. Otherwise, the Cluster tab on the Determined WebUI will show unconfigured for the Accelerator field in the Resource Pool information.

    PBS administrators can use the following set of commands to set the value of resources_available.accel_type on a single node:

    • Check if the resources_available.accel_type value is set.

      pbsnodes -v node001 | grep resources_available.accel_type
      
    • If required, set the desired value for resources_available.accel_type.

      sudo qmgr -c "set node node001 resources_available.accel_type=tesla"
      
    • When there are multiple types of GPUs on the node, use a comma-separated value.

      sudo qmgr -c "set node node001 resources_available.accel_type=tesla,kepler"
      
    • Verify that the resources_available.accel_type value is now set.

      pbsnodes -v node001 | grep resources_available.accel_type
      

    Repeat the above steps to set the resources_available.accel_type value for every node containing GPU. Once the resources_available.accel_type value is set for all the necessary nodes, admins can verify the Accelerator field on the Cluster pane of the WebUI.

  • Ensure homogeneous PBS queues.

    Determined maps PBS queues to Determined resource pools. It is recommended that the nodes within a queue be homogeneous for Determined to effectively schedule GPU jobs.

    • A PBS queue with GPUs is identified as a CUDA/ROCm resource pool. The type is inherited from the resource_manager.slot_type configuration. It can be also be specified per partition using resource_manager.partition_overrides.

    • A PBS queue with no GPUs is identified as an AUX resource pool.

    • The Determined default resource pool is set to the PBS default queue. Override this default using the pbs section default_compute_resource_pool or default_aux_resource_pool option.

    • If a PBS queue is not homogeneous, you may create a resource pool that provides homogenous resources out of that queue using a custom resource pool. Configure a resource pool with provider_type: hpc, specify the underlying PBS queue name to receive the job and include a task_container_defaults section with the necessary pbs options to select the desired homogenous set of resources from that queue.

  • Tune the PBS configuration for Determined job preemption.

    PBS supports a wide variety of criteria to trigger job preemption, and you may use any per your system and job requirements. Once a job is identified for preemption, PBS supports four different options for job preemption which are specified via the preemption_order scheduling parameter. The preemption order value is 'SCR'. The preemption methods are specified by the following letters:

    S - Suspend the job.

    This is not applicable for GPU jobs.

    C - Checkpoint the job.

    This requires a custom checkpoint script is added to PBS.

    R - Requeue the job.

    Determined does not support the re-queueing of a task. Determined jobs specify the -r n option to PBS to prevent this case.

    D - Delete the job.

    Determined jobs support this option without configuration.

    Given those options, the simplest path to enable Determined job preemption is by including D in the preemption_order. You may include R in the preemption_order, but it is disabled for Determined jobs. You may include C to the preemption_order if you additionally configure a checkpoint script. Refer to the PBS documentation for details. If you choose to implement a checkpoint script, you may initiate a Determined checkpoint by sending a SIGTERM signal to the Determined job. When a Determined job receives a SIGTERM, it begins a checkpoint and graceful shutdown. To prevent unnecessary loss of work, it is recommended that you wait for at least one Determined scheduling_unit for the job to complete after sending the SIGTERM. If after that period of time the job has not terminated, then send a SIGKILL to forcibly release all resources.

Apptainer/Singularity Requirements#

Determined supports Apptainer (formerly known as Singularity) for container runtime in HPC environments. Ensure that Apptainer or Singularity is properly installed and configured on all compute nodes of your cluster.

Note

In addition to the core Apptainer/Singularity installation package, the apptainer-suid or singularity-suid component is also required for full Determined functionality.

Singularity has numerous options that may be customized in the singularity.conf file. Determined has been verified using the default values and therefore does not require any special configuration on the compute nodes of the cluster.

Podman Requirements#

When Determined is configured to use Podman, the containers are launched in rootless mode. Your HPC cluster administrator should have completed most of the configuration for you, but there may be additional per-user configuration that is required. Before attempting to launch Determined jobs, verify that you can run simple Podman containers on a compute node. For example:

podman run hello-world

If you are unable to do that successfully, then one or more of the following configuration changes may be required in your $HOME/.config/containers/storage.conf file:

  1. Podman does not support rootless container storage on distributed file systems (e.g. NFS, Lustre, GPSF). On a typical HPC cluster, user directories are on a distributed file system and the default container storage location of $HOME/.local/share/containers/storage is therefore not supported. If this is the case on your HPC cluster, configure the graphroot option in your storage.conf to specify a local file system available on compute nodes. Alternatively, you can request that your system administrator configure the rootless_storage_path in /etc/containers/storage.conf on all compute nodes.

  2. Podman utilizes the directory specified by the environment variable XDG_RUNTIME_DIR. Normally, this is provided by the login process. Slurm and PBS, however, do not provide this variable when launching jobs on compute nodes. When XDG_RUNTIME_DIR is not defined, Podman attempts to create the directory /run/user/$UID for this purpose. If /run/user is not writable by a non-root user, then Podman commands will fail with a permission error. To avoid this problem, configure the runroot option in your storage.conf to a writeable local directory available on all compute nodes. Alternatively, you can request your system administrator to configure the /run/user to be user-writable on all compute nodes.

Create or update $HOME/.config/containers/storage.conf as required to resolve the issues above. The example storage.conf file below uses the file system /tmp, but there may be a more appropriate file system on your HPC cluster that you should specify for this purpose.

[storage]
driver = "overlay"
graphroot = "/tmp/$USER/storage"
runroot = "/tmp/$USER/run"

Any changes to your storage.conf should be applied using the command:

podman system migrate

Enroot Requirements#

Install and configure Enroot on all compute nodes of your cluster as per the Enroot Installation instructions for your platform. There may be additional per-user configuration that is required.

  1. Enroot utilizes the directory ${ENROOT_RUNTIME_PATH} (with default value ${XDG_RUNTIME_DIR}/enroot) for temporary files. Normally XDG_RUNTIME_DIR is provided by the login process, but Slurm and PBS do not provide this variable when launching jobs on compute nodes. When neither ENROOT_RUNTIME_PATH/XDG_RUNTIME_DIR is defined, Enroot attempts to create the directory /run/enroot for this purpose. This typically fails with a permission error for any non-root user. Select one of the following alternatives to ensure that XDG_RUNTIME_DIR or ENROOT_RUNTIME_PATH is defined and points to a user-writable directory when Slurm/PBS jobs are launched on the cluster.

    • Have your HPC cluster administrator configure Slurm/PBS to provide XDG_RUNTIME_DIR, or

      change the default ENROOT_RUNTIME_PATH defined in /etc/enroot/enroot.conf on each node in your HPC cluster.

    • If using Slurm, provide an ENROOT_RUNTIME_PATH definition in task_container_defaults.environment_variables in master.yaml.

      task_container_defaults:
         environment_variables:
            - ENROOT_RUNTIME_PATH=/tmp/$(whoami)
      
    • If using Slurm, provide an ENROOT_RUNTIME_PATH definition in your experiment configuration.

  2. Unlike Singularity or Podman, you must manually download the Docker image file to the local file system (enroot import) and then each user must create an Enroot container using that image (enroot create). When the HPC launcher generates the enroot command for a job, it automatically applies the same transformation to the name that Enroot does on import (/ and : characters are replaced with +) to enable Docker image references to match the associated Enroot container. The following shell commands will download and then create an Enroot container for the current user. If other users have read access to /shared/enroot/images, they need only perform the enroot create step to make the container available for their use.

    image=determinedai/pytorch-ngc:0.38.0
    cd /shared/enroot/images
    enroot import docker://$image
    enroot create /shared/enroot/images/${image//[\/:]/\+}.sqsh
    
  3. The Enroot container storage directory for the user ${ENROOT_CACHE_PATH} (which defaults to $HOME/.local/share/enroot) must be accessible on all compute nodes.

  4. A convenience script, /usr/bin/manage-enroot-cache, is provided by the HPC launcher installation to simplify the management of enroot images.