Known Issues#
Agent-Specific Scheduling Options are Ignored#
When using the HPC Launcher, Determined delegates all job scheduling and prioritization to the HPC workload manager (either Slurm or PBS) and the following experiment configuration options are ignored.
resources.max_slotsresources.priorityresources.weight
Singularity and Docker Differences#
Some constraints are due to differences in behavior between Docker and Singularity, summarized here:
Singularity tends to explicitly share resources/devices from the host compute node on which it is running which results in more opportunities for conflicts with other programs running on the cluster, or between multiple determined experiments that are launched concurrently on the same compute node.
By default
/tmpand/dev/shmare mounted from the compute node instead of private to the container. If multiple containers are running on the same node there can be more sharing than they expect. The contents of/tmppersist beyond the container lifetime and are visible to other trials. The experiment configuration might need to be updated to accommodate these issues.Determined mitigates potential file name and disk space conflicts on
/tmpcontent by automatically using space injob_storage_rootfor a per-job/tmpdirectory. You can override this behavior by providing an explicit bind mount of thecontainer_path/tmpfolder in the Singularity container.
You can restore the default Singularity behavior of sharing
/tmpon the compute node by including the following bind mount in your experiment configuration or globally by using thetask_container_defaultssection in your master configuration:bind_mounts: - host_path: /tmp container_path: /tmp
The
singularity.confoptions can also be used to change this behavior, or by using individual environment variables added to your experiment. Here are some configuration options that might be useful to tune sharing available in thesingularity.conffile:Option
Description
sessiondir max sizeControls the disk space, in MB, allocated to support directories not shared from the host compute node, such as
/tmpand/usr/tmp, depending upon your configuration.mount tmpIsolates
/tmpfrom the host compute node. The size of this area is configured by sessiondir max size.
Singularity attempts to automatically download and convert Docker images, however, the behavior is somewhat different than with Docker.
By default converted Singularity images are stored per user in
~/.singularity. Determined environment images are relatively large and this can result in excessive duplication.You likely want to predownload images under
singularity_image_rootas described in Provide a Container Image Cache or configureSINGULARITY_CACHEDIRto point to a shared directory.
Some Docker features do not have an exact replacement in Singularity, and therefore the associated Determined features are not supported.
Feature
Description
resources.devicesBy default
/devis mounted from the compute host, so all devices are available. This can be overridden by thesingularity.confmount devoption.resources.shm_sizeBy default
/dev/shmis mounted from the compute host. This can be overridden by thesingularity.confmount tmpoption. When enabled, the size can be increased using compute node/etc/fstabsettings.environment.registry_auth.serverNo equivalent setting in Singularity.
environment.registry_auth.emailNo equivalent setting in Singularity.
Singularity Known Issues#
Launching a PBS job with an experiment configuration that includes an embedded double quote character (”) may cause the job to fail unless you have Singularity 3.10 or greater or Apptainer 1.1 or greater. For example, the error might be the json.decoder.JSONDecodeError or the experiment log may contain
source: /.inject-singularity-env.sh:224:1563: "export" must be followed by names or assignmentsandRuntimeError: missing environment keys [DET_MASTER, DET_CLUSTER_ID, DET_AGENT_ID, DET_SLOT_IDS, DET_TASK_ID, DET_ALLOCATION_ID, DET_SESSION_TOKEN, DET_TASK_TYPE], is this running on-cluster?The version of Singularity is detected by the HPC Launcher invoking the singularity command and checking for the--no-evaloption. If the singularity command is not on the path for the HPC launcher or is of an inconsistent version with the compute nodes, embedded double quote characters may still not work.When launching a shell or distributed experiment the
sshdinside the container may log the errorchown(/dev/pts/0, 100165687, 5) failed: Invalid argumentand the shell or experiment may hang. This can be resolved by installing thesingularity-suidinstallation package.
Apptainer Known Issues#
Starting with Apptainer version 1.1.0 some changes may trigger permission problems inside of Determined containers for shells, TensorBoards, and experiments. For example, a TensorBoard log may contain
ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device, or a shell may fail to function and the shell logs contain the messagechown(/dev/pts/1, 63200, 5) failed: Invalid argument, or an experiment may fail to launch due toFATAL: container creation failed: mount /var/tmp->/var/tmp error: while mounting /var/tmp: could not mount /var/tmp: operation not supported. This likely indicates an installation or configuration error for unprivileged containers. Review the Installing Apptainer documentation. These errors are sometimes resolved by additionally installing theapptainer-setuidpackage.When launching a shell or distributed experiment the
sshdinside the container may log the errorchown(/dev/pts/0, 100165687, 5) failed: Invalid argumentand the shell or experiment may hang. This can be resolved by installing thesingularity-suidinstallation package.
Podman Known Issues#
Determined uses Podman in rootless mode. There are several configuration errors that may be encountered:
stat /run/user/NNN: no such file or directorylikely indicates that the environment variableXDG_RUNTIME_DIRis referencing a directory that does not exist.stat /run/user/NNN: permission deniedmay indicate a problem with default therunrootconfiguration.Error: A network file system with user namespaces is not supported. Please use a mount_program: backing file system is unsupported for this graph driverindicates that thegraphrootreferences a distributed file system.
Refer to Podman Requirements for recommendations.
On a Slurm cluster, it is common to rely upon
/etc/hosts(instead of DNS) to resolve the addresses of the login node and other compute nodes in the cluster. If jobs are unable to resolve the address of the Determined master or other compute nodes in the job and you are relying on/etc/hosts, check the following:Ensure that the
/etc/hostsfile is being mounted in the container by a bind mount in thetask_container_defaultssection of your master configuration as shown below. Unlike Singularity, Podman V4.0+ no longer maps/etc/hostsfrom the host into the running container by default. On the initial startup, the Determined Slurm launcher automatically adds thetask_container_defaultsfragment below when adding theresource_managersection. If, however, you have since changed the file you may need to manually add the bind mount to ensure that jobs can resolve all host addresses in the cluster:task_container_defaults: bind_mounts: - host_path: /etc/hosts container_path: /etc/hosts
Ensure that the names and addresses of the login node, admin node, and all compute nodes are consistently available in
/etc/hostson all nodes.
Podman containers only inherit environment variables that have been explicitly specified. Determined adds Podman arguments to provide any Determined-configured environment variables, and the launcher enables inheritance of the following variables:
SLURM_*,CUDA_VISIBLE_DEVICES,NVIDIA_VISIBLE_DEVICES,ROCR_VISIBLE_DEVICES,HIP_VISIBLE_DEVICES. You may enable the inheritance of additional variables from the host environment by specifying the variable name with an empty value in theenvironment_variablesof your experiment configuration or task container defaults.environment_variables: - INHERITED_ENV_VAR=
Terminating a Determined AI job may cause the following conditions to occur:
Compute nodes go into drain state.
Processes inside the container continue to run.
An attempt to run another job results in
Running a job gets the error level=error msg="invalid internal status, try resetting the pause process with \"/usr/local/bin/podman system migrate\": could not find any running process: no such process".
Podman creates several processes when running a container, such as podman, conmon, and catatonit. When a user terminates a Determined AI job, Slurm will send a SIGTERM to the podman processes. However, sometimes the container will continue running, even after the SIGTERM has been sent.
On Slurm versions prior to version 22, Slurm will place the node in the
drainstate, requiring the use of thescontrolcommand to set the node back to theidlestate. It may also requirepodman system migrateto be run to clean up the running containers.To ensure the container associated with the job is stopped when a Determined AI job is terminated, create a Slurm task epilog script to stop the container.
Set the Task Epilog script in the
slurm.conffile, as shown below, to point to a script that resides in a shared filesystem accessible from all compute nodes.TaskEpilog=/path/to/task_epilog.sh
Set the contents of the Task Epilog script as shown below.
#!/usr/bin/env bash slurm_job_name_suffix=$(echo ${SLURM_JOB_NAME} | sed 's/^\S\+-\([a-z0-9]\+-[a-z0-9]\+\)$/\1/') if ps -fe | grep -E "[p]odman run .*-name ${SLURM_JOB_USER}-\S+-${slurm_job_name_suffix}" > /dev/null then timeout -k 15s 15s bash -c "while ps -fe | grep -E \"[c]onmon .*-n ${SLURM_JOB_USER}-\S+-${slurm_job_name_suffix}\" > /dev/null 2>&1; do sleep 1; done" podman_container_stop_command="podman container stop --filter name='.+-${slurm_job_name_suffix}'" echo "$(date):$0: Running \"${podman_container_stop_command}\"" 1>&2 eval ${podman_container_stop_command} fi exit 0
Restart the
slurmddaemon on all compute nodes.
Enroot Known Issues#
Enroot uses
XDG_RUNTIME_DIRwhich is not provided to the compute jobs by Slurm/PBS by default. The errormkdir: cannot create directory ‘/run/enroot’: Permission deniedindicates that the environment variableXDG_RUNTIME_DIRis not defined on the compute nodes. See Podman Requirements for recommendations.Enroot requires manual download and creation of containers. The error
[ERROR] No such file or directory: /home/users/test/.local/share/enroot/determinedai+environments+cuda-11.1-base-gpu-mpi-0.18.5indicates the usertesthas not created an Enroot container for Docker imagedeterminedai/environments:cuda-11.1-base-gpu-mpi-0.18.5. Check the available containers using theenroot listcommand. See Enroot Requirements for guidance on creating Enroot containers.Enroot does not provide a mechanism for sharing containers. Each user must create any containers needed by their Determined experiments prior to creating the experiment.
Some Docker features do not have an exact replacement in Enroot, and therefore the associated Determined features are not supported.
Feature
Description
resources.devicesManaged via Enroot configuration files.
resources.shm_sizeManaged via Enroot configuration files.
environment.registry_auth.serverNo equivalent setting in Enroot.
environment.registry_auth.emailNo equivalent setting in Enroot.
Slurm Known Issues#
Jobs may fail to submit with Slurm version 22.05.5 through 22.05.8 with the message
error: Unable to allocate resources: Requested node configuration is not available.Slurm 22.05.5 through 22.05.8 are not supported due to Slurm Bug 15857. The bug was addressed in 22.05.09 or 23.02.00.
A Determined experiment remains
QUEUEUEDfor an extended period:If Slurm provides a reason code for the
QUEUEUEDstate of the job, the reason description from JOB REASON CODES will be added to the experiment/task log as an informational message such as:INFO: HPC job waiting to be scheduled: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions
In some cases, it may be helpful to inspect the details of your queued jobs using the Slurm
scontrol show jobscommand using theHPC Job IDdisplayed in the experiment/task log. An example of the command output is shown below.$ scontrol show job 109084 JobId=109084 JobName=det-ai_exp-2221-trial-15853-2221.33b6fcca-564d-47a7-ab2e-0d2a4a90a0f1.1 UserId=user(1234) GroupId=users(100) MCS_label=N/A Priority=4294866349 Nice=0 Account=(null) QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2023-07-03T16:01:35 EligibleTime=2023-07-03T16:01:35 AccrueTime=2023-07-03T16:01:35 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-03T16:06:15 Scheduler=Backfill:* Partition=mlde_rocm AllocNode:Sid=o184i054:755599 ReqNodeList=o186i[122-123] ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=256G,node=1,billing=1,gres/gpu=1 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/cstor/determined/o184i054-jobs/jobs/environments/vishnu/2221.33b6fcca-564d-47a7-ab2e-0d2a4a90a0f1.1/ai_exp-2221-trial-15853-job.sh WorkDir=/var/tmp StdErr=/cstor/determined/o184i054-jobs/jobs/environments/vishnu/2221.33b6fcca-564d-47a7-ab2e-0d2a4a90a0f1.1/ai_exp-2221-trial-15853-error.log StdIn=/dev/null StdOut=/cstor/determined/o184i054-jobs/jobs/environments/vishnu/2221.33b6fcca-564d-47a7-ab2e-0d2a4a90a0f1.1/ai_exp-2221-trial-15853-output.log Power= CpusPerTres=gres:gpu:64 MemPerTres=gres:gpu:262144 TresPerJob=gres:gpu:1
The Slurm job state (See JOB STATE CODES) may help identify the delay in scheduling. If the Slurm job state is
PENDING, review the resources being requested and theReasoncode to identify the cause. To better understand how resource requests are derived by Determined, see HPC Launching Architecture. Some common reason codes forPENDINGare:PartitionNodeLimit: Ensure that the job is not requesting more nodes thanMaxNodesof the partition.Ensure that the
MaxNodessetting for the partition is at least as high as the number of GPUs in the partition. TheMaxNodesvalue for a partition can be viewed in theJOBS_SIZEcolumn of the command:sinfo -O Partition,Size,Gres,OverSubscribe,NodeList,StateComplete,Reason PARTITION JOB_SIZE GRES OVERSUBSCRIBE NODELIST STATECOMPLETE REASON defq* 1-infinite gpu:tesla:4 NO node002 idle none
Until scheduled, the job’s
NumNodesis shown as the range 1-slots_per_trial. Ensure theslots_per_trialshown is not larger than the value shown in theJOB_SIZEcolumn for the partition.A second potential cause of
PartitionNodeLimitis submitting CPU experiments (or when the Determined cluster is configured withgres_supported: false), without specifyingslurm.slots_per_nodeto enable multiple CPUs to be used on each node. Withoutslurm.slots_per_nodethe job will requestslots_per_trialnodes.Priority: One or more higher priority jobs exist for this partition or advanced reservation.Resources: Expected when resources are in use by other jobs. Otherwise, verify you have not requested more resources (GPUs, CPUs, nodes, memory) than are available in your cluster.
PBS Known Issues#
If the
Clustertab in the WebUI does not display the GPU information, there may be an issue with the PBS configuration. Visit Ensure the ngpus resource is defined with the correct values section to ensure PBS is properly configured.Jobs are treated as successful even in the presence of a failure when PBS job history is not enabled. Without job history enabled, the launcher is unable to obtain the exit status of jobs and therefore they are all reported as successful. This will prevent failed jobs from automatically restarting, and in the case of a job that fails to start running at all, it may be reported as completed with no error message reported. Refer to PBS Requirements.
AMD/ROCm Known Issues#
AMD/ROCm support is available only with Singularity containers. While Determined does add the proper Podman arguments to enable ROCm GPU support, the capabilities have not yet been verified.
Launching experiments with
slot_type: rocm, may fail with the errorRuntimeError: No HIP GPUs are available. Ensure that the compute nodes are providing ROCm drivers and libraries compatible with the environment image that you are using and that they are available in the default locations, or are added to thepathand/orld_library_pathvariables in the slurm configuration. Depending upon your system configuration, you may need to select a different ROCm image. See Set Environment Images for the images available.Launching experiments with
slot_type: rocm, may fail in the AMD/ROCm libraries with with the errorterminate called after throwing an instance of 'boost::filesystem::filesystem_error' what(): boost::filesystem::remove: Directory not empty: "/tmp/miopen-.... A potential workaround is to disable the per-container/tmpby adding the following bind mount in your experiment configuration or globally by using thetask_container_defaultssection in your master configuration:bind_mounts: - host_path: /tmp container_path: /tmp
Determined AI Experiment Requirements#
Ensure that the following requirements are met in your experiment configuration.
Distributed jobs must allocate the same number of resources on each compute node. Slurm/PBS will
not enforce this constraint by default. It is, therefore, recommended that you include a
slots_per_node in your experiment configuration to ensure that Slurm/PBS provides a consistent
allocation on each node. Your slots_per_trial configuration should then be a multiple of
slots_per_node.
Additional Known issues#
The Determined master may fail to show HPC cluster information and report
Failed to communicate with launcher due to error:in theMaster Logstab of the Determined UI. If so, verify the following:Ensure that the launcher service is up and running.
sudo systemctl status launcher
If the full error is
Failed to communicate with launcher due to error: {401 Unauthorized}, the Determined master does not have an up-to-date authorization token to access the launcher. Restart the launcher, to ensure all configuration changes have been applied.sudo systemctl restart launcher sudo systemctl status launcher
Once it has successfully started, you should see the message
INFO: launcher server ready ..., then restart the Determined master so it will likewise load the latest configuration:sudo systemctl restart determined-master sudo systemctl status determined-master
Additional diagnostic messages may be present in the system log diagnostics, such as
/var/log/messagesorjournalctl --since=yesterday -u launcher, andjournalctl --since=yesterday -u determined-master
The SSH server process within Determined Environment images can fail with a
free(): double free detected in tcache 2message, aFatal error: glibc detected an invalid stdio handlemessage, or simply close the connection with no message. This problem has been observed when using thedet shell startcommand and when running distributed, multi-node, training jobs. It is suspected to be triggered by passwd/group configurations that use NIS/YP/LDAP accounts on the compute host. By default these settings are propagated to the Singularity container and can result insshdaborting the connection with or without an error message, depending on the exact configuration.A workaround is to specify a customized
nsswitch.conffile to the Singularity container and enable only files for passwd/group elements. This can be accomplished using the following steps:Create a file on a shared file system such as
/home/shared/determined/nsswitch.conffile with the content, potentially further tuned for your environment:passwd: files determined shadow: files determined group: files determined hosts: files dns
Update the Determined cluster configuration to supply a default bind mount to override the
/etc/nsswitch.confin the container.task_container_defaults: bind_mounts: - host_path: /home/shared/determined/nsswitch.conf container_path: /etc/nsswitch.conf
Reload the Determined master to allow it to pull in the updated configuration.
The user/group configuration is typically injected in
/etc/passwdwithin the Singularity container so disabling the NIS/YP/LDAP accounts within the container should not result in any lost capability.Determined CLI can fail with a
Your requested host "localhost" could not be resolved by DNS.message. This has been observed when thehttp_proxyorhttps_proxyenvironment variables are set but have not excluded sendinglocalhost, or the Determined master hostname, to the proxy server.Update the environment settings configured for the proxy to also include:
export no_proxy=localhost,127.0.0.1
The automated download of Docker containers by Singularity may fail with the error
loading registries configuration: reading registries.conf.d: lstat /root/.config/containers/registries.conf.d: permission deniedwhen Docker login information is not provided.This happens when access to an otherwise public container image is being blocked by the Docker Hub download rate limit, or if the container is in a private registry.
You can avoid this problem by either:
Manually downloading the container image as described in Provide a Container Image Cache.
Providing a Docker login via the experiment configuration using the
environment.registry_auth.usernameandenvironment.registry_auth.passwordoptions.
Use of NVIDIA Multi-Process Service (MPS) with Determined may trigger the error
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable.By default, MPS depends upon a shared
/tmpdirectory between the compute node and the container to function properly. As noted in Singularity and Docker Differences, sharing/tmpbetween the compute node and the container is not the default behavior for Determined Slurm integration. When using MPS, use one of the following workarounds:If the capabilities of MPS are not required, disable or uninstall the MPS service. See nvidia-cuda-mps-control or the relevant documentation associated with your installation package.
Configure the MPS variable
CUDA_MPS_PIPE_DIRECTORYto use a directory other than/tmp(e.g./dev/shm).Restore the sharing of
/tmpbetween the compute node and the container as described in Singularity and Docker Differences.
For more information on MPS, refer to the NVIDIA Multi-Process Service (MPS) Documentation.
Experiments on CPU-only clusters will fail when the requested slot count exceeds the maximum number of CPUs on any single node. This behavior is due to a limitation of the Slurm workload manager. Slurm does not provide an option to request a certain number of CPUs without specifying the number of nodes/tasks. To overcome this limitation of Slurm, Determined will set a default value of 1 for the number of nodes. With this workaround, when the users launch an experiment on a CPU-only cluster, Slurm tries to identify a single node that can completely satisfy the requested number of slots (CPUs). If such a node is available, Slurm will allocate the resources and continue the execution of the experiment. Otherwise, Slurm will error stating the resource request could not be satisfied, as shown in the below example.
ERROR: task failed without an associated exit code: sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available.
A job may fail with the message
resources failed with non-zero exit code, Determined reports the exit code in the experiment logs. For example, the experiment logs containsrun: error: node002: task 0: Exited with exit code 7.The
det slot enableanddet slot disablecommands are not supported. Use of these commands will print an error message.det slot listwill not display the name of any active Determined tasks.
Package Verification#
The launcher installation package supports the verification of both RPM and DEB packages. There will be several configuration files that the package manager will identify as modified, and with RPM-based installs, some files will show user/group modifications.
For an RPM-based installation, run sudo rpm -V hpe-hpc-launcher which should produce output
similar to that shown below:
S.5....T. c /etc/launcher/launcher.conf
S.5....T. /etc/launcher/suid.conf
S.5....T. /etc/sudoers.d/zz_launcher
.....U... /opt/launcher/bin/capsules-dev-keytool.jar
.....U... /opt/launcher/bin/dev-keytool
.....U... /opt/launcher/bin/user-keytool
.....U... /opt/launcher/jetty/base/etc/keystore
S.5....T. /opt/launcher/jetty/base/resources/dispatcher.properties
.....U... /opt/launcher/sbin
......G.. /opt/launcher/sbin/suid
INFO: The following file modifications are expected:
/etc/launcher/launcher.conf
/etc/launcher/suid.conf
/etc/sudoers.d/zz_launcher
/opt/launcher/jetty/base/resources/dispatcher.properties
INFO: The following file owner/group changes are expected:
/opt/launcher/bin/capsules-dev-keytool.jar
/opt/launcher/bin/dev-keytool
/opt/launcher/bin/user-keytool
/opt/launcher/sbin
/opt/launcher/sbin/suid
On Debian distributions, run sudo dpkg -V hpe-hpc-launcher which should produce output similar
to that shown below:
??5?????? c /etc/launcher/launcher.conf
??5?????? c /etc/launcher/suid.conf
??5?????? c /etc/sudoers.d/zz_launcher
??5?????? /opt/launcher/jetty/base/resources/dispatcher.properties