Upgrades

Warning

There are occasionally incompatible changes introduced in new versions of Determined – for example, the format of the master and agent configuration files might change. While we try to preserve backward compatibility whenever possible, you should read the Release Notes for a description of recent changes before upgrading Determined.

Upgrading an existing Determined installation requires the same steps as installing Determined for the first time. Before starting an upgrade, first follow the steps below to safely shut down the cluster. Once the upgrade is complete and Determined is restarted, all suspended experiments will be resumed automatically.

  1. Disable all Determined agents in the cluster:

    det -m <MASTER_ADDRESS> agent disable --all
    

    where MASTER_ADDRESS is the IP address or host name where the Determined master can be found. This will cause all tasks running on those agents to be checkpointed and terminated. The checkpoint process might take some time to complete; you can monitor which tasks are still running via det slot list.

  2. Take a backup of the Determined database using pg_dump. This is a safety precaution in case any problems occur after upgrading Determined.

All users should also upgrade the CLI by running

pip install determined-cli

Be sure to do this for every user or virtualenv that has installed the old version of the CLI.

Troubleshooting Tips

Validating Nvidia Container Toolkit

To verify that a Determined agent instance can run containers that use GPUs, run:

docker run --gpus all --rm debian:10-slim nvidia-smi

You should see output that describes the GPUs available on the agent instance, such as:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 56%   84C    P2   177W / 250W |  10729MiB / 11176MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
| 28%   62C    P0    56W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 31%   64C    P0    57W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 20%   36C    P0    57W / 250W |      0MiB / 12196MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4638      C   python3                                    10719MiB |
+-----------------------------------------------------------------------------+

Error messages

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=35777 /var/lib/docker/devicemapper/mnt/7b5b6d59cd4fe9307b7523f1cc9ce3bc37438cc793ff4a5a18a0c0824ec03982/rootfs]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\\\n\\\"\"": unknown.

If you see the above error message, the GPU hardware and/or NVIDIA drivers installed on the agent are not compatible with CUDA 10, but you are trying to run a Docker image that depends on CUDA 10. Please run the commands below; if the first succeeds and the second fails, you should be able to use Determined as long as you use Docker images based on CUDA 9.

docker run --gpus all --rm nvidia/cuda:9.0-runtime nvidia-smi
docker run --gpus all --rm nvidia/cuda:10.0-runtime nvidia-smi