Installation Requirements¶
Basic Requirements¶
Deploying the Determined HPC Launcher on Slurm/PBS has the following requirements.
The login node, admin node, and compute nodes must be installed and configured with one of the following Linux distributions:
Red Hat® Enterprise Linux (RHEL) or CentOS 7.9
RHEL or Rocky Linux® 8.5, 8.6
RHEL 9
SUSE® Linux Enterprise Server (SLES) 12 SP3 , 15 SP3, 15 SP4
Ubuntu® 20.04, 22.04
Cray OS (COS) 2.3, 2.4
Note: More restrictive Linux distribution dependencies may be required by your choice of Slurm/PBS version and container runtime (Singularity/Apptainer®, Podman, or NVIDIA® Enroot).
Slurm 20.02 or greater (excluding 22.05.5 through at least 22.05.8 - see Slurm Known Issues) or PBS 2021.1.2 or greater.
Apptainer 1.0 or greater, Singularity 3.7 or greater, Enroot 3.4.0 or greater or PodMan 3.3.1 or greater.
A cluster-wide shared filesystem with consistent path names across the HPC cluster.
User and group configuration must be consistent across all nodes.
All nodes must be able to resolve the hostnames of all other nodes.
To run jobs with GPUs, the Nvidia or AMD drivers must be installed on each compute node. Determined requires a version greater than or equal to 450.80 of the Nvidia drivers. The Nvidia drivers can be installed as part of a CUDA installation but the rest of the CUDA toolkit is not required.
Determined supports the active Python versions.
Launcher Requirements¶
The launcher has the following additional requirements on the installation node:
Support for an RPM or Debian-based package installer
Java 1.8 or greater
Sudo is configured to process configuration files present in the
/etc/sudoers.d
directoryAccess to the Slurm or PBS command line interface for the cluster
Access to a cluster-wide file system with a consistent path names across the cluster
Proxy Configuration Requirements¶
If internet connectivity requires a use of a proxy, verify the following requirements:
Ensure that the proxy variables are defined in /etc/environment (or /etc/sysconfig/proxy on SLES).
Ensure that the no_proxy setting covers the login and admin nodes. If these nodes may be referenced by short names known only within the cluster, they must explicitly be included in the no_proxy setting.
If your experiment code communicates between compute nodes with a protocol that honors proxy environment variables, you should additionally include the names of all compute nodes in the no_proxy variable setting.
The HPC launcher imports http_proxy, https_proxy, ftp_proxy, rsync_proxy, gopher_proxy, socks_proxy, socks5_server, and no_proxy from /etc/environment and /etc/sysconfig/proxy. These environment variables are automatically exported in lowercase and uppercase into any launched jobs and containers.
Slurm Requirements¶
Determined should function with your existing Slurm configuration. The following steps are recommended to optimize how Determined interacts with Slurm:
Enable Slurm for GPU Scheduling.
Configure Slurm with SelectType=select/cons_tres. This enables Slurm to track GPU allocation instead of tracking only CPUs. When enabled, Determined submits batch jobs by specifying
--gpus={slots_per_trial}
. If this is not available, you must change the slurm sectiontres_supported
option tofalse
.Configure GPU Generic Resources (GRES).
Determined works best when allocating GPUs. Information about what GPUs are available is available using GRES. You can use the AutoDetect feature to configure GPU GRES automatically. Otherwise, you should manually configure GRES GPUs such that Slurm can schedule nodes with the GPUs you want.
For the automatic selection of nodes with GPUs, Slurm must be configured for
GresTypes=gpu
and nodes with GPUs must have properly configured GRES indicating the presence of any GPUs. When enabled, Determined can ensure GPUs are available by specifying--gres=gpus:1
. If Slurm GRES cannot be properly configured, specify the slurm sectiongres_supported
option tofalse
, and it is the user’s responsibility to ensure that GPUs will be available on nodes selected for the job using other configurations such as targeting a specific resource pool with only GPU nodes, or specifying a Slurm constraint in the experiment configuration.Ensure homogeneous Slurm partitions.
Determined maps Slurm partitions to Determined resource pools. It is recommended that the nodes within a partition be homogeneous for Determined to effectively schedule GPU jobs.
A Slurm partition with GPUs is identified as a CUDA/ROCM resource pool. The type is inherited from the
resource_manager.slot_type
configuration. It can be also be specified-per partition usingresource_manager.partition_overrides
A Slurm partition with no GPUs is identified as an AUX resource pool.
The Determined default resource pool is set to the Slurm default partition. Override this default using the slurm section
default_compute_resource_pool
ordefault_aux_resource_pool
option.If a Slurm partition is not homogeneous, you may create a resource pool that provides homogenous resources out of that partition using a custom resource pool. Configure a resource pool with
provider_type: hpc
, specify the underlying Slurm partition name to receive the job and include a task_container_defaults section with the necessaryslurm
options to select the desired homogenous set of resources from that partition.
Ensure the
MaxNodes
value for each partition is not less than the number of GPUs in the partition.Determined delegates node selection for a job to Slurm by specifying a node range (1-
slots_per_trial
). Ifslots_per_trial
exceeds theMaxNodes
value for the partition, the job will remain in statePENDING
with reason codePartitionNodelimit
. Make sure that all partitions that haveMaxNodes
specified use a value larger than the number of GPUs in the partition.Tune the Slurm configuration for Determined job preemption.
Slurm preempts jobs using signals. When a Determined job receives SIGTERM, it begins a checkpoint and graceful shutdown. To prevent unnecessary loss of work, it is recommended to set
GraceTime (secs)
high enough to permit the job to complete an entire Determinedscheduling_unit
.To enable GPU job preemption, use
PreemptMode=CANCEL
orPreemptMode=REQUEUE
, becausePreemptMode=SUSPEND
does not release GPUs so does not allow a higher-priority job to access the allocated GPU resources. Determined manages the requeue of a successfully preempted job so even withPreemptMode=REQUEUE
, the Slurm job will be canceled and resubmitted.
PBS Requirements¶
Determined should function with your existing PBS configuration. The following steps are recommended to optimize how Determined interacts with PBS:
Configure PBS to manage GPU resources.
Determined works best when allocating GPUs. By default, Determined selects compute nodes with GPUs using the option
-select={slots_per_trial}:ngpus=1
. If PBS cannot be configured to identify GPUs in this manner, specify the pbs sectiongres_supported
option tofalse
when configuring Determined, and it will then be the user’s responsibility to ensure that GPUs will be available on nodes selected for the job using other configurations such as targeting a specific resource pool with only GPU nodes, or specifying a PBS constraint in the experiment configuration.PBS should be configured to provide the environment variable
CUDA_VISIBLE_DEVICES
(ROCR_VISIBLE_DEVICES
for ROCm) using a PBS cgroup hook as described in the PBS Administrator’s Guide. If PBS is not configured to setCUDA_VISIBLE_DEVICES
, Determined will utilize a single GPU on each node. To fully utilize multiple GPUs, you must either manually defineCUDA_VISIBLE_DEVICES
appropriately or provide thepbs.slots_per_node
setting in your experiment configuration to indicate how many GPU slots are intended for Determined to use.Ensure homogeneous PBS queues.
Determined maps PBS queues to Determined resource pools. It is recommended that the nodes within a queue be homogeneous for Determined to effectively schedule GPU jobs.
A PBS queue with GPUs is identified as a CUDA/ROCM resource pool. The type is inherited from the
resource_manager.slot_type
configuration. It can be also be specified per partition usingresource_manager.partition_overrides
.A PBS queue with no GPUs is identified as an AUX resource pool.
The Determined default resource pool is set to the PBS default queue. Override this default using the pbs section
default_compute_resource_pool
ordefault_aux_resource_pool
option.If a PBS queue is not homogeneous, you may create a resource pool that provides homogenous resources out of that queue using a custom resource pool. Configure a resource pool with
provider_type: hpc
, specify the underlying PBS queue name to receive the job and include a task_container_defaults section with the necessary pbs` options to select the desired homogenous set of resources from that queue.
Tune the PBS configuration for Determined job preemption.
PBS supports a wide variety of criteria to trigger job preemption, and you may use any per your system and job requirements. Once a job is identified for preemption, PBS supports four different options for job preemption which are specified via the
preemption_order
scheduling parameter. The preemption order value is'SCR'
. The preemption methods are specified by the following letters:S
- Suspend the job.This is not applicable for GPU jobs.
C
- Checkpoint the job.This requires a custom checkpoint script is added to PBS.
R
- Requeue the job.Determined does not support the re-queueing of a task. Determined jobs specify the
-r n
option to PBS to prevent this case.D
- Delete the job.Determined jobs support this option without configuration.
Given those options, the simplest path to enable Determined job preemption is by including
D
in thepreemption_order
. You may includeR
in thepreemption_order
, but it is disabled for Determined jobs. You may includeC
to thepreemption_order
if you additionally configure a checkpoint script. Refer to the PBS documentation for details. If you choose to implement a checkpoint script, you may initiate a Determined checkpoint by sending aSIGTERM
signal to the Determined job. When a Determined job receives aSIGTERM
, it begins a checkpoint and graceful shutdown. To prevent unnecessary loss of work, it is recommended that you wait for at least one Determinedscheduling_unit
for the job to complete after sending theSIGTERM
. If after that period of time the job has not terminated, then send aSIGKILL
to forcibly release all resources.
Singularity/Apptainer Requirements¶
Singularity/Apptainer is the recommended container runtime for Determined on HPC clusters. Apptainer
is a fork of Singularity 3.8 and provides both the apptainer
and singularity
commands. For
purposes of this documentation, you can consider all references to Singularity to also apply to
Apptainer. The Determined launcher interacts with Singularity/Apptainer using the singularity
command.
Singularity has numerous options that may be customized in the singularity.conf
file. Determined
has been verified using the default values and therefore does not require any special configuration
on the compute nodes of the cluster.
PodMan Requirements¶
When Determined is configured to use PodMan, the containers are launched in rootless mode. Your HPC cluster administrator should have completed most of the configuration for you, but there may be additional per-user configuration that is required. Before attempting to launch Determined jobs, verify that you can run simple PodMan containers on a compute node. For example:
podman run hello-world
If you are unable to do that successfully, then one or more of the following configuration changes
may be required in your $HOME/.config/containers/storage.conf
file:
PodMan does not support rootless container storage on distributed file systems (e.g. NFS, Lustre, GPSF). On a typical HPC cluster, user directories are on a distributed file system and the default container storage location of
$HOME/.local/share/containers/storage
is therefore not supported. If this is the case on your HPC cluster, configure thegraphroot
option in yourstorage.conf
to specify a local file system available on compute nodes. Alternatively, you can request that your system administrator configure therootless_storage_path
in/etc/containers/storage.conf
on all compute nodes.PodMan utilizes the directory specified by the environment variable
XDG_RUNTIME_DIR
. Normally, this is provided by the login process. Slurm and PBS, however, do not provide this variable when launching jobs on compute nodes. WhenXDG_RUNTIME_DIR
is not defined, PodMan attempts to create the directory/run/user/$UID
for this purpose. If/run/user
is not writable by a non-root user, then PodMan commands will fail with a permission error. To avoid this problem, configure therunroot
option in yourstorage.conf
to a writeable local directory available on all compute nodes. Alternatively, you can request your system administrator to configure the/run/user
to be user-writable on all compute nodes.
Create or update $HOME/.config/containers/storage.conf
as required to resolve the issues above.
The example storage.conf
file below uses the file system /tmp
, but there may be a more
appropriate file system on your HPC cluster that you should specify for this purpose.
[storage]
driver = "overlay"
graphroot = "/tmp/$USER/storage"
runroot = "/tmp/$USER/run"
Any changes to your storage.conf
should be applied using the command:
podman system migrate
Enroot Requirements¶
Install and configure Enroot on all compute nodes of your cluster as per the Enroot Installation instructions for your platform. There may be additional per-user configuration that is required.
Enroot utilizes the directory
${ENROOT_RUNTIME_PATH}
(with default value${XDG_RUNTIME_DIR}/enroot
) for temporary files. NormallyXDG_RUNTIME_DIR
is provided by the login process, but Slurm and PBS do not provide this variable when launching jobs on compute nodes. When neither ENROOT_RUNTIME_PATH/XDG_RUNTIME_DIR is defined, Enroot attempts to create the directory /run/enroot for this purpose. This typically fails with a permission error for any non-root user. Select one of the following alternatives to ensure thatXDG_RUNTIME_DIR
orENROOT_RUNTIME_PATH
is defined and points to a user-writable directory when Slurm/PBS jobs are launched on the cluster.- Have your HPC cluster administrator configure Slurm/PBS to provide
XDG_RUNTIME_DIR
, or change the default
ENROOT_RUNTIME_PATH
defined in/etc/enroot/enroot.conf
on each node in your HPC cluster.
- Have your HPC cluster administrator configure Slurm/PBS to provide
If using Slurm, provide an
ENROOT_RUNTIME_PATH
definition intask_container_defaults.environment_variables
in master.yaml.task_container_defaults: environment_variables: - ENROOT_RUNTIME_PATH=/tmp/$(whoami)
If using Slurm, provide an
ENROOT_RUNTIME_PATH
definition in your experiment configuration.
Unlike Singularity or PodMan, you must manually download the docker image file to the local file system (
enroot import
) and then each user must create an Enroot container using that image (enroot create
). When the HPC launcher generates the enroot command for a job, it automatically applies the same transformation to the name that Enroot does on import (/
and:
characters are replaced with+
) to enable docker mage references to match the associated Enroot container. The following shell commands will download and then create an Enroot container for the current user. If other users have read access to/shared/enroot/images
, they need only perform theenroot create
step to make the container available for their use.image=determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.8-gpu-835d8b1 cd /shared/enroot/images enroot import docker://$image enroot create /shared/enroot/images/${image//[\/:]/\+}.sqsh
The Enroot container storage directory for the user
${ENROOT_CACHE_PATH}
(which defaults to$HOME/.local/share/enroot
) must be accessible on all compute nodes.