Install Determined on Slurm¶
This document describes how to deploy Determined on a Slurm cluster.
The Determined master and launcher installation packages are configured for installation on a single login or administrator Slurm cluster node.
Install Determined Master¶
After the node has been selected and the Installation Requirements have been fulfilled and configured, install and configure the Determined master:
Install the on-premises Determined master component (not including the Determined agent) as described in the Install Determined Using Linux Packages document. Perform the installation and configuration steps, but stop before starting the
determined-masterservice, and continue with the steps below.
Install the launcher.
For an example RPM-based installation, run:
sudo rpm -ivh hpe-hpc-launcher-<version>.rpm
On Debian distributions, instead run:
sudo apt install ./hpe-hpc-launcher-<version>.deb
The installation configures and enables the
launcherservice, which provides Slurm management capabilities.
If launcher dependencies are not satisfied, warning messages are displayed. Install or update missing dependencies or adjust the
ld_libary_pathin the next step to locate the dependencies.
Configure and Verify Determined Master on Slurm¶
The launcher automatically adds a prototype
resource_managersection for Slurm. Edit the provided
resource_managerconfiguration section for your particular deployment. For RPM-based installations, the configuration file is typically the
In this example, with Determined and the launcher colocated on a node named
login, the section might resemble:
port: 8080 ... resource_manager: type: slurm master_host: login master_port: 8080 host: localhost port: 8181 protocol: http container_run_type: singularity auth_file: /root/.launcher.token job_storage_root: path: tres_supported: true slot_type: cuda
The installer provides default values, however, you should explicitly configure the following cluster options:
Communication port used by the launcher. Update this value if there are conflicts with other services on your cluster.
Shared directory where job-related files are stored. This directory must be visible to the launcher and from the compute nodes.
The container type to be launched on Slurm (
podman). The default type is
Shared directory where Singularity images are hosted. Unused unless
singularity. See Provide a Singularity Images Cache for details on how this option is used.
By default, the launcher runs from the root account. Create a local account and group and update these values to enable running from another account.
If any of the launcher dependencies are not on the default path, you can override the default by updating this value.
See the slurm section of the cluster configuration reference for the full list of configuration options.
After changing values in the
resource_managersection of the
/etc/determined/master.yamlfile, restart the launcher service:
sudo systemctl restart launcher
Verify successful launcher startup using the
systemctl status launchercommand. If the launcher fails to start, check system log diagnostics, such as
journalctl --since=10m -u launcher, make the needed changes to the
/etc/determined/master.yamlfile, and restart the launcher.
If the installer reported incorrect dependencies, verify that they have been resolved by changes to the
pathin the previous step:
Reload the Determined master to get the updated configuration:
sudo systemctl restart determined-master
Verify successful determined-master startup using the
systemctl status determined-mastercommand. If the launcher fails to start, check system log diagnostics, such as
journalctl --since=10m -u determined-master, make the needed changes to the
/etc/determined/master.yamlfile, and restart the determined-master.
If the compute nodes of your cluster do not have internet connectivity to download Docker images, see Provide a Singularity Images Cache.
Verify the configuration by sanity-checking your Determined Slurm configuration:
det command run hostname
A successful configuration reports the hostname of the compute node selected by Slurm to run the job.
Run a simple distributed training job such as the PyTorch MNIST Tutorial to verify that it completes successfully. This validates Determined master and launcher communication, access to the shared filesystem, GPU scheduling, and highspeed interconnect configuration. For more complete validation, ensure that the
slots_per_trialis at least twice the number of GPUs available on a single node.
Determined should function with your existing Slurm configuration. The following steps are recommended to optimize how Determined interacts with Slurm:
Enable Slurm for GPU Scheduling.
Configure Slurm with SelectType=select/cons_tres. This enables Slurm to track GPU allocation instead of tracking only CPUs. If this is not available, you must change the slurm section
Configure GPU Generic Resources (GRES).
Determined works best when allocating GPUs. Information about what GPUs are available is available using GRES. You can use the AutoDetect feature to configure GPU GRES automatically. Otherwise, you should manually configure GRES GPUs such that Slurm can schedule nodes with the GPUs you want.
For the automatic selection of nodes with GPUs, Slurm must be configured for
GresTypes=gpuand nodes with GPUs must have properly configured GRES indicating the presence of any GPUs. If Slurm GRES cannot be properly configured, specify the slurm section
false, and it is the user’s responsibility to ensure that GPUs will be available on nodes selected for the job using other configurations such as targeting a specific resource pool with only GPU nodes, or specifying a Slurm constraint in the experiment configuration.
Ensure homogeneous Slurm partitions.
Determined maps Slurm partitions to Determined resource pools. It is recommended that the nodes within a partition are homogeneous for Determined to effectively schedule GPU jobs.
A Slurm partition with GPUs is identified as a CUDA/ROCM resource pool. The type is inherited from the
resource_manager.slot_typeconfiguration. It can be also be specified-per partition using
A Slurm partition with no GPUs is identified as an AUX resource pool.
The Determined default resource pool is set to the Slurm default partition.
Tune the Slurm configuration for Determined job preemption.
Slurm preempts jobs using signals. When a Determined job receives SIGTERM, it begins a checkpoint and graceful shutdown. To prevent unnecessary loss of work, it is recommended to set
GraceTime (secs)high enough to permit the job to complete an entire Determined
To enable GPU job preemption, use
PreemptMode=SUSPENDdoes not release GPUs so does not allow a higher-priority job to access the allocated GPU resources.