Provide a Container Image Cache#

When the cluster does not have Internet access or if you want to provide a local cache of container images to improve performance, you can download the desired container images to a shared directory and then reference them using file system paths instead of Docker registry references.

There are two mechanisms you can use to reference cached container images depending upon the container runtime in use.

Default Docker Images#

Each version of Determined utilizes specifically-tagged Docker containers. The image tags referenced by default in this version of Determined are described below.

Environment

File Name

CPUs

determinedai/pytorch-ngc:0.38.0

NVIDIA GPUs

determinedai/pytorch-ngc:0.38.0

AMD GPUs

determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-622d512

See Set Environment Images for the images Docker Hub location, and add each tagged image needed by your experiments to the image cache.

Referencing Local Image Paths#

Singularity and Podman each support various local container file formats and reference them using a slightly different syntax. Utilize a cached image by referencing a local path using the experiment configuration environment.image. When using this strategy, the local directory needs to be accessible on all compute nodes.

When using Podman, you could save images in OCI archive format to files in a local directory /shared/containers

podman save determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730 \
   --format=oci-archive \
   -o /shared/containers/cuda-11.3-pytorch-1.10-tf-2.8-gpu

and then reference the image in your experiment configuration using the syntax below.

environment:
   image: oci-archive:/shared/containers/cuda-11.3-pytorch-1.10-tf-2.8-gpu

When using Singularity, you could save SIF files in a local directory /shared/containers

singularity pull /shared/containers/cuda-11.3-pytorch-1.10-tf-2.8-gpu \
   determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730

and then reference in your experiment configuration using a full path using the syntax below.

environment:
   image: /shared/containers/cuda-11.3-pytorch-1.10-tf-2.8-gpu.sif

Set these image file references above as the default for all jobs by specifying them in the task_container_defaults section of the /etc/determined/master.yaml file.

Note: If you specify an image using task_container_defaults, you prevent new environment container image versions from being adopted on each update of Determined.

Configuring an Apptainer/Singularity Image Cache Directory#

When using Apptainer/Singularity, you may use Referencing Local Image Paths as described above, or you may instead configure a directory tree of images to be searched. To utilize this capability, configure a shared directory in resource_manager.singularity_image_root. The shared directory needs to be accessible to the launcher and on all compute nodes. Whenever an image is referenced, it is translated to a local file path as described in environment.image. If found, the local path is substituted in the singularity run command to avoid the need for Singularity to download and convert the image for each user.

You can manually manage the content of this directory tree, or you may use the manage-singularity-cache script which automates those same steps. To manually populate the cache, add each tagged image required by your environment and the needs of your experiments to the image cache using the following steps:

  1. Create a directory path using the same prefix as the image name referenced in the singularity_image_root directory. For example, the image determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730 is added in the directory determinedai.

    cd $singularity_image_root
    mkdir determinedai
    
  2. If your system has internet access, you can download images directly into the cache.

    cd $singularity_image_root
    image="determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730"
    singularity pull $image docker://$image
    
  3. Otherwise, from an internet-connected system, download the desired image using the Singularity pull command then copy it to the determinedai folder under singularity_image_root.

    singularity pull \
          temporary-image \
          docker://$image
    scp temporary-image mycluster:$singularity_image_root/$image
    

Managing the Singularity Image Cache using the manage-singularity-cache script#

A convenience script, /usr/bin/manage-singularity-cache, is provided by the HPC launcher installation to simplify the management of the Singularity image cache. The script simplifies the management of the Singularity image cache directory content and helps ensure proper name, placement, and permissions of content added to the cache. Adding container images to the Singularity image cache avoids the overhead of downloading the images and allows for sharing of images between multiple users. It provides the following features:

  • Download the Determined default cuda, cpu, or rocm environment images

  • Download an arbitrary Docker image reference

  • Copy a local Singularity image file into the cache

  • List the currently available images in the cache

If your system has internet access, you can download images directly into the cache. Use the --cuda, --cpu, or --rocm options to download the current default CUDA, CPU, or ROCM environment container image into the cache. For example, to download the default CUDA container image, use the following command:

manage-singularity-cache --cuda

If your system has internet access, you can download any desired Docker container image (e.g. determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-096d730) into the cache using the command:

manage-singularity-cache determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-096d730

Otherwise, from an internet-connected system, download the desired image using the Singularity pull command, then copy it to a system with access to the singularity_image_root folder. You can then add the image to the cache by specifying the local file name using -i and the Docker image reference which determines the name to be added to the cache.

manage-singularity-cache -i localfile.sif determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-096d730

You can view the current set of Docker image names in the cache with the -l option.

manage-singularity-cache -l
determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-096d730
determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730

Managing the Enroot Image Cache using the manage-enroot-cache script#

This script, /usr/bin/manage-enroot-cache, simplifies the management of a set of shared Enroot .sqsh file downloads and then creates an Enroot container for use by the current user. It provides the following features:

  • Download the Determined default cuda, cpu, or rocm environment images

  • Download an arbitrary Docker image reference

  • Share a directory of re-usable imported .sqsh files

  • Optionally, create a per-user container from a shared .sqsh file

  • List the currently available images in the shared .sqsh file cache

When using manage-enroot-cache you must provide a temporary directory via the -s option which is used to download (enroot import) the associated enroot .sqsh file. The .sqsh file is read by the enroot create command to generate the container. The directory need only be accessible on the local host. If the directory you specify is shared with other users, the script will re-use any downloaded .sqsh files and directly enroot create an enroot container without needing a separate download.

Download the shared cache .sqsh file for the current default Determined CUDA and CPU images (enroot import), and then create the associated containers from them for the current user (enroot create) use the following command:

manage-enroot-cache -s /shared/enroot --cuda --cpu

Download the shared cache .sqsh file for an arbitrary docker image (enroot import), and then create a container from it for the current user (enroot create) use the following command:

manage-enroot-cache -s /shared/enroot determinedai/environments:cuda-10.2-base-gpu-mpi-0.19.4

If you only want the sharable .sqsh file without the overhead of container creation, use the --nocreate option:

manage-enroot-cache -s /shared/enroot --nocreate determinedai/environments:cuda-10.2-base-gpu-mpi-0.19.4

To optionally configure credentials for image downloads, follow the enroot documentation. Specify the user name with the --username option:

manage-enroot-cache -s /shared/enroot --username <username-here> --cuda --cpu

--username is positional – if used it should appear before any image reference.

You can view the current set of Docker image names in the cache with the -l option.

manage-enroot-cache -s /shared/enroot -l

Singularity/Apptainer Shells#

When using Determined AI with Singularity/Apptainer for interactive shells, there are some important behaviors to understand, particularly regarding home directories and file locations.

Home Directory Binding#

Singularity/Apptainer automatically binds the user’s home directory to the container. This means:

  • When you start a shell, you will be placed in your actual system home directory, rather than the working directory where files are copied.

  • This behavior differs from Docker containers and can be confusing if you’re expecting to see your files immediately.

Usage Example#

To start a shell and access your copied files:

det shell start --config-file config.yaml --context .
cd /run/determined/workdir

Note

Remember that your initial working directory will be your home directory, not where the files were copied. Always navigate to the correct directory to find your copied files.

For more information on shell configuration options, refer to Job Configuration Reference.

File Locations#

When using the det shell start command with the --context option:

  • Files are copied to the container to /run/determined/workdir.

  • Due to the home directory binding, you won’t see these files in your initial working directory.

  • To access your files, navigate to /run/determined/workdir once your shell starts.