Install Determined Using Docker¶
Preliminary Setup¶
Install Docker on all machines in the cluster. If the agent machines have GPUs, ensure that the Nvidia Container Toolkit on each one is working as expected.
Pull the official Docker image for PostgreSQL. We recommend using the version listed below.
docker pull postgres:10
This image is not provided by Determined AI; please see its Docker Hub page for more information.
Pull the Docker image for the master or agent on each machine where these services will run. There is a single master container running in a Determined cluster, and typically there is one agent container running on a given machine. A single machine can host both the master container and an agent container. Run the commands below, replacing
VERSION
with a valid Determined version, such as the current version, 0.13.7:docker pull determinedai/determined-master:VERSION docker pull determinedai/determined-agent:VERSION
Assuming you are running multiple containerized services on the Determined master machine, optionally create the
determined
network that we will use in the service startup commands below:docker network create determined
Configuring and Starting the Cluster¶
PostgreSQL¶
The following command starts the PostgreSQL container on the master
using the determined
network created above:
docker run \
--name determined-db \
--network determined \
-p 5432:5432 \
-v determined_db:/var/lib/postgresql/data \
-e POSTGRES_DB=determined \
-e POSTGRES_PASSWORD=<DB password> \
postgres:10
If the master will connect to PostgreSQL via Docker networking, exposing
port 5432 via the -p
argument isn’t necessary; however, you may
still want to expose it for administrative or debugging purposes. In
order to expose the port only on the master machine’s loopback network
interface, pass -p 127.0.0.1:5432:5432
instead of -p 5432:5432
.
Determined Master¶
Determined master configuration values can come from a file, environment variables, or command-line arguments.
To start the master with a configuration file, we recommend starting
from our default master configuration file,
which contains a listing of the available options and descriptions for
them. Download and edit the master.yaml
configuration file as
appropriate and start the master container with the edited
configuration:
docker run \
-v "$PWD"/master.yaml:/etc/determined/master.yaml \
determinedai/determined-master:VERSION
To start the master with environment variables instead of a configuration file:
docker run \
--name determined-master \
--network determined \
-p 8080:8080 \
-e DET_DB_HOST=determined-db \
-e DET_DB_NAME=determined \
-e DET_DB_PORT=5432 \
-e DET_DB_USER=postgres \
-e DET_DB_PASSWORD=<DB password> \
determinedai/determined-master:VERSION
Note that this references the PostgreSQL container running on the
determined
Docker network created above. If you are running
PostgreSQL externally, specify the hostname for the PostgreSQL server in
place of determined-db
.
In order to prevent the master from listening on port 8080 on all
network interfaces on the master machine, you may specify the loopback
interface in the published port mapping, i.e., -p
127.0.0.1:8080:8080
.
Determined Agents¶
As is the case for the master, Determined agent configuration values can come from a file, environment variables, or command-line arguments.
To start the agent with a configuration file, we recommend starting from
our default agent configuration file,
which contains a listing of the available options and descriptions for
them. Download and edit the agent.yaml
configuration file as
appropriate and start the agent container with the edited configuration:
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v "$PWD"/agent.yaml:/etc/determined/agent.yaml \
determinedai/determined-agent:VERSION
Note that the agent container must bind mount the host’s Docker daemon socket. This allows the agent container to orchestrate the containers that execute trials and other tasks.
If you are providing command-line arguments to the container (e.g.,
using --master-port
as opposed to the DET_MASTER_PORT
environment variable), run
must be provided as the first argument:
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v "$PWD"/agent.yaml:/etc/determined/agent.yaml \
determinedai/determined-agent:VERSION \
run --master-port=8080
To start an agent container with environment variables instead of a configuration file:
docker run \
--name determined-agent \
--network determined \
-e DET_MASTER_HOST=<Determined master hostname or IP> \
-e DET_MASTER_PORT=8080 \
determinedai/determined-agent:VERSION
Note that if an agent container is running on the same machine as the
master container, you may use determined-master
(or whatever the
name of the master container is) as the DET_MASTER_HOST
.
Selecting GPUs¶
The --gpus
flag should be used to specify which GPUs the agent
container will have access to; without it, the agent will not have
access to any GPUs. For example:
# Use all GPUs.
docker run --gpus all ...
# Use any four GPUs (selected by Docker).
docker run --gpus 4 ...
# Use the GPUs with the given IDs or UUIDs.
docker run --gpus '"device=1,3"' ...
GPUs can also be disabled and enabled at runtime using the det slot
disable
and det slot enable
CLI commands, respectively.
Docker Networking for Master and Agents¶
As with any Docker container, the networking mode of the master and
agent containers can be changed using the --network
option to
docker run
. In particular, host mode networking (--network host
)
can be useful to optimize performance and in situations where a
container needs to handle a large range of ports, as it does not require
network address translation (NAT) and no “userland-proxy” is created for
each port.
The host networking driver only works on Linux hosts, and is not supported on Docker Desktop for Mac, Docker Desktop for Windows, or Docker EE for Windows Server.
See Docker’s documentation for more details.
Managing the Cluster¶
By default, docker run
will run in the foreground, so that a
container can be stopped simply by pressing Control-C. If you wish to
keep Determined running for the long term, consider running the
containers detached and/or
with restart policies.
Using our deployment tool is also an
option.