Provide a Container Image Cache

When the cluster does not have Internet access or if you want to provide a local cache of container images to improve performance, you can download the desired container images to a shared directory and then reference them using file system paths instead of docker registry references.

There are two mechanisms you can use to reference cached container images depending upon the container runtime in use.

Default Docker Images

Each version of Determined utilizes specifically-tagged Docker containers. The image tags referenced by default in this version of Determined are described below.

Environment

File Name

CPUs

determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-24586f0

Nvidia GPUs

determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-24586f0

AMD GPUs

determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-24586f0

See Set Environment Images for the images Docker Hub location, and add each tagged image needed by your experiments to the image cache.

Referencing Local Image Paths

Singularity and PodMan each support various local container file formats and reference them using a slightly different syntax. Utilize a cached image by referencing a local path using the experiment configuration environment.image. When using this strategy, the local diretory needs to be accessible on all compute nodes.

When using PodMan, you could save images in OCI archive format to files in a local directory /shared/containers

podman save determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730 \
  --format=oci-archive \
  -o /shared/containers/cuda-11.3-pytorch-1.10-tf-2.8-gpu

and then reference the image in your experiment configuration using the syntax below.

environment:
   image: oci-archive:/shared/containers/cuda-11.3-pytorch-1.10-tf-2.8-gpu

When using Singularity, you could save SIF files in a local directory /shared/containers

singularity pull /shared/containers/cuda-11.3-pytorch-1.10-tf-2.8-gpu \
   determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730

and then reference in your experiment configuration using a full path using the syntax below.

environment:
   image: /shared/containers/cuda-11.3-pytorch-1.10-tf-2.8-gpu.sif

Set these image file references above as the default for all jobs by specifying them in the task_container_defaults section of the /etc/determined/master.yaml file.

Note: If you specify an image using task_container_defaults, you prevent new environment container image versions from being adopted on each update of Determined.

Configuring a Singularity Image Cache Directory

When using Singularity, you may use Referencing Local Image Paths as described above, or you may instead configure a directory tree of images to be searched. To utilize this capability, configure a shared directory in resource_manager.singularity_image_root. The shared directory needs to be accessible to the launcher and on all compute nodes. Whenever an image is referenced, it is translated to a local file path as described in environment.image. If found, the local path is substituted in the singularity run command to avoid the need for Singularity to download and convert the image for each user.

You can manually manage the content of this directory tree, or you may use the manage-singularity-cache script which automates those same steps. To manually populate the cache, add each tagged image required by your environment and the needs of your experiments to the image cache using the following steps:

  1. Create a directory path using the same prefix as the image name referenced in the singularity_image_root directory. For example, the image determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730 is added in the directory determinedai.

    cd $singularity_image_root
    mkdir determinedai
    
  2. If your system has internet access, you can download images directly into the cache.

    cd $singularity_image_root
    image="determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730"
    singularity pull $image docker://$image
    
  3. Otherwise, from an internet-connected system, download the desired image using the Singularity pull command then copy it to the determinedai folder under singularity_image_root.

    singularity pull \
          temporary-image \
          docker://$image
    scp temporary-image mycluster:$singularity_image_root/$image
    

Managing the Singularity Image Cache using the manage-singularity-cache script

A convenience script, /usr/bin/manage-singularity-cache, is provided by the HPC launcher installation to simplify the management of the Singularity image cache. The script simplifies the management of the Singularity image cache directory content and helps ensure proper name, placement, and permissions of content added to the cache. Adding container images to the Singularity image cache avoids the overhead of downloading the images and allows for sharing of images between multiple users. It provides the following features:

  • Download the Determined default cuda, cpu, or rocm environment images

  • Download an arbitrary docker image reference

  • Copy a local Singularity image file into the cache

  • List the currently available images in the cache

If your system has internet access, you can download images directly into the cache. Use the --cuda, --cpu, or --rocm options to download the current default CUDA, CPU, or ROCM environment container image into the cache. For example, to download the default CUDA container image, use the following command:

manage-singularity-cache --cuda

If your system has internet access, you can download any desired docker container image (e.g. determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-096d730) into the cache using the command:

manage-singularity-cache determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-096d730

Otherwise, from an internet-connected system, download the desired image using the Singularity pull command, then copy it to a system with access to the singularity_image_root folder. You can then add the image to the cache by specifying the local file name using -i and the docker image reference which determines the name to be added to the cache.

manage-singularity-cache -i localfile.sif determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-096d730

You can view the current set of docker image names in the cache with the -l option.

manage-singularity-cache -l
determinedai/environments:py-3.8-pytorch-1.10-tf-2.8-cpu-096d730
determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-096d730