Upgrades
Warning
Newer versions of master configuration might not be compatible with older versions. Please see the breaking changes in the Release Notes for upgrading the configuration.
Upgrading an existing Determined installation requires the same steps as installing Determined for the first time. You should additionally follow the steps below to safely shut down the cluster before beginning an upgrade. Once the upgrade is complete and Determined is restarted, all suspended experiments will be resumed automatically.
Disable all Determined agents in the cluster:
where
MASTER_ADDRESS
is the IP address or host name where the Determined master can be found. This will cause all tasks running on those agents to be checkpointed and terminated. The checkpoint process might take some time to complete; you can monitor which tasks are still running viadet slot list
.Take a backup of the Determined database using pg_dump. This is a safety precaution in case any problems occur after upgrading Determined.
All users should also upgrade the CLI by running
Be sure to do this for every user or virtualenv that has installed the old version of the CLI.
Troubleshooting Tips
Validating Nvidia Container Toolkit
To verify that a Determined agent instance can run containers that use GPUs, run:
You should see output that describes the GPUs available on the agent instance, such as:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 56% 84C P2 177W / 250W | 10729MiB / 11176MiB | 76% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:06:00.0 Off | N/A |
| 28% 62C P0 56W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 31% 64C P0 57W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:0A:00.0 Off | N/A |
| 20% 36C P0 57W / 250W | 0MiB / 12196MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4638 C python3.6 10719MiB |
+-----------------------------------------------------------------------------+
Error messages
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=35777 /var/lib/docker/devicemapper/mnt/7b5b6d59cd4fe9307b7523f1cc9ce3bc37438cc793ff4a5a18a0c0824ec03982/rootfs]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\\\n\\\"\"": unknown.
If you see the above error message, the GPU hardware and/or NVIDIA drivers installed on the agent are not compatible with CUDA 10, but you are trying to run a Docker image that depends on CUDA 10. Please run the commands below; if the first succeeds and the second fails, you should be able to use Determined as long as you use Docker images based on CUDA 9.