This document describes how to install and upgrade PEDL. PEDL consists of several components:

  • a master that schedules workloads and stores metadata
  • one or more agents that run workloads, typically using GPUs

The PEDL master and agents should typically be installed and configured by a system administrator. Each user of PEDL should also install a copy of the command-line tools, as described here.

These instructions describe how to install PEDL in "bare metal" mode. PEDL can also run on top of Kubernetes; for instructions on installing and using PEDL with Kubernetes, see the documentation.

System Requirements


  • PEDL agent and master nodes must be configured with either Ubuntu 16.04 LTS or CentOS 7.

  • To run jobs with GPUs, the Nvidia drivers must be installed on each PEDL agent. PEDL requires version >= 384.81 of the Nvidia drivers. (The Nvidia drivers can be installed as part of installing CUDA, but the rest of the CUDA toolkit is not required.)


  • The PEDL master node should be configured with >= 4 CPUs (Intel Broadwell or later), 8GB of RAM, and 200GB of disk storage. Note that the PEDL master can be run on a machine that does not have GPUs.

  • Each PEDL agent node should be configured with >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 50GB of disk space. If using GPUs, Nvidia GPUs with compute capability 3.7 or greater are required (e.g., K80, P100, V100, GTX 1080, GTX 1080 Ti, TITAN, TITAN XP).


These instructions describe how to install PEDL for the first time; for directions on how to upgrade an existing PEDL installation, see the Upgrades section below.

On each machine in the PEDL cluster (both master and agents), do the following:

  1. Install the latest release of Docker and nvidia-docker2.

On Ubuntu:

sudo apt-get update && sudo apt-get install -y software-properties-common
curl -fsSL | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] $(lsb_release -cs) stable"

curl -fsSL | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y --no-install-recommends docker-ce nvidia-docker2
sudo systemctl reload docker
sudo usermod -aG docker $USER
On CentOS:
sudo yum install -y yum-utils device-mapper-persistent-data lvm2
sudo yum-config-manager --add-repo

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo yum install -y docker-ce nvidia-docker2
sudo systemctl start docker
2. Logout and start a new terminal session. Verify that the current user is in the docker group and that the nvidia Docker runtime is installed:
docker info | grep Runtimes
3. Copy the PEDL installation tarball to the machine, extract it, download the necessary Docker images, and run make install as root.
tar xzvf pedl-0.10.10.tar.gz
cd pedl-0.10.10
sudo make install
4. If using CentOS 7, enable the persistent storage of journalctl log messages so that logs are saved on machine reboot.
sudo mkdir /var/log/journal
sudo systemd-tmpfiles --create --prefix /var/log/journal
sudo systemctl restart systemd-journald
5. (Optional) Add /usr/local/pedl/bin to your PATH.


These instructions describe how to configure PEDL-managed cluster. See Cluster Configuration for details.

Docker Networking for Master and Agents

The Docker networking of the master and the agent can be configured by editing /usr/local/pedl/etc/network.conf. By setting PEDL_NETWORK in the configuration, the master and the agent can be set to use the specified Docker networking. To use Docker host-mode networking for the master, set PEDL_NETWORK to be host. The trial runner Docker networking can also be specified by setting TRIAL_RUNNER_NETWORK.


Host mode networking can be useful to optimize performance, and in situations where a container needs to handle a large range of ports, as it does not require network address translation (NAT), and no “userland-proxy” is created for each port. The host networking driver only works on Linux hosts, and is not supported on Docker Desktop for Mac, Docker Desktop for Windows, or Docker EE for Windows Server. See Use host networking for details.

Master Port

By default, the master listens on TCP port 8080. This can be configured via the PEDL_MASTER_HTTP_PORT environment variable. When the master is managed by systemd (as described above), this variable can be set by creating a /etc/systemd/system/pedl-master.service.d/override.conf file with the following content:

where 80 is the desired port number.


The master is capable of serving over HTTPS in addition to HTTP. Doing so requires a TLS private key and certificate; to configure that, set the environment variables PEDL_SECURITY_TLS_CERT and PEDL_SECURITY_TLS_KEY to paths to a PEM-encoded TLS private key and certificate, respectively. The variable PEDL_MASTER_HTTPS_PORT can be specified to change the HTTPS port (default 8443).

Agent Network Proxies

There are several variables that control how the agent proxies network connections while building images or running tasks:


They can be set as environment variables, as command line options to the agent, or by editing /usr/local/pedl/etc/agent.conf.

The HTTP_PROXY, HTTPS_PROXY and FTP_PROXY variables control the proxy that will be used for HTTP, HTTPS and FTP connections respectively. NO_PROXY is a comma-separated list of hosts or IP addresses that will not be proxied. The agent passes all these values to any process it creates as environment variables (with both uppercase and lowercase variable names).

The RUNTIME and BUILDTIME variables can be used if a specific proxy configuration should be used only when a container is running or only when an image is being built. Otherwise, the value of the variable without RUNTIME or BUILDTIME will be used for both cases.

For example,

  • PEDL_HTTP_PROXY= will set HTTP_PROXY/http_proxy in running containers and when Docker images are built.
  • PEDL_RUNTIME_HTTP_PROXY= will set HTTP_PROXY/http_proxy only when containers are running.
  • PEDL_BUILDTIME_HTTP_PROXY= will set HTTP_PROXY/http_proxy only when Docker images are built.

Configuring Trial Runner Networking

The master is capable of selecting the network interface that trial runners will use to communicate when performing distributed (multi-machine) training. The network interface can be configured by editing /usr/local/pedl/etc/master.yaml. If left unspecified, which is the default setting, PEDL will auto-discover a common network interface shared by the trial runners.

Additionally, the ports used by the GLOO and NCCL libraries, which are used during distributed (multi-machine) training can be configured to fall within user-defined ranges. If left unspecified, ports will be chosen randomly from the unprivileged port range (1024-65535).

Default Checkpoint Storage

See Checkpoints for details.

Starting up the cluster


On the master machine, run:

sudo make enable-master
This enables and starts the PEDL master and database services. These services will now be started when the machine boots.


PEDL experiment metadata is stored in a Docker volume named pedl-db-volume. By default, all Docker volumes are stored in the /var/lib/docker/volumes/ directory on the host file system; you should ensure that there is enough free space for all experiment metadata (~100GB should be safe).


If the PEDL agent has GPUs, validate the nvidia-docker2 installation is working as expected.

Next, on each agent machine, edit /usr/local/pedl/etc/agent.conf and set MASTER_ADDRESS to the IP address or host name where the PEDL master can be found. Note that localhost or should not be used.

Next, run:

sudo make enable-agent
This enables and starts the PEDL agent service. These services will now be started when the machine boots.

By default, the agent will use all the GPUs on the machine to run PEDL tasks. To configure the agent to only use specific GPUs, set the GPU_LIST variable in agent.conf. GPUs can also be disabled and enabled using via the pedl slot disable and pedl slot enable CLI commands, respectively.

Command-Line Tools

See here for instructions on installing the PEDL command-line tools.


PEDL supports an optional user system, which allows teams of machine learning developers to organize the assets they create inside PEDL. The user subsystem should be configured by the system administrator; see the user documentation for more details.



Newer versions of master configuration might not be compatible with older versions. Please see the breaking changes in the Release Notes for upgrading the configuration.

Upgrading an existing PEDL installation requires roughly similar steps to installing PEDL for the first time. During the upgrade process, all running experiments will be checkpointed and temporarily suspended. Once the upgrade is complete, all suspended experiments will be resumed automatically.

  1. Disable all PEDL agents in the cluster:

    pedl -m <MASTER_ADDRESS> agent disable --all
    where MASTER_ADDRESS is the IP address or host name where the PEDL master can be found. This will cause all tasks running on those agents to be checkpointed and terminated. The checkpoint process might take some time to complete; you can monitor which tasks are still running via pedl slot list.

  2. Shutdown the PEDL agent running on each agent machine:

    sudo systemctl stop pedl-agent
    You can verify whether the agent has stopped using systemctl is-active pedl-agent.

  3. Take a backup of the PEDL database:

    pedl-db-backup pedl-db-`date "+%m-%d-%y"`.dump
    This is a safety precaution in case any problems occur after upgrading PEDL.

  4. Shutdown the PEDL master and database service:

    sudo systemctl stop pedl-master
    sudo systemctl stop pedl-db

  5. Follow the instructions in the Installation section to install the new version of PEDL. In particular, copy the new version of PEDL to each machine, download the updated Docker images (bin/pedl-pull-images), and run make install.

  6. Reload the systemd manager configuration (sudo systemctl daemon-reload) and then enable the PEDL master and agent services on each host, as appropriate. When the master is restarted, all previously active experiments will be resumed.

  7. Upgrade the CLI by installing (pip install -U) the new Python wheel. Be sure to do this for every user or virtualenv that has installed the old version of the CLI.

Troubleshooting Tips

To view the logs associated with any of the PEDL services (master, agent, or metadata DB), use:

journalctl -u pedl-agent
journalctl -u pedl-db
journalctl -u pedl-master

To reset the content of the PEDL metadata DB, use pedl-db-reset.


Using pedl-db-reset will result in deleting the entire PEDL database; it should only be used in extreme circumstances.

Validating the nvidia-docker2 installation

To verify that a PEDL agent instance can run containers that use GPUs, run:

docker run --runtime=nvidia --rm nvidia/cuda:10.0-runtime nvidia-smi

You should see an output that describes the GPUs available on the agent instance, such as:

| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 56%   84C    P2   177W / 250W |  10729MiB / 11176MiB |     76%      Default |
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
| 28%   62C    P0    56W / 250W |      0MiB / 11178MiB |      0%      Default |
|   2  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 31%   64C    P0    57W / 250W |      0MiB / 11178MiB |      0%      Default |
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 20%   36C    P0    57W / 250W |      0MiB / 12196MiB |      6%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      4638      C   python3.6                                  10719MiB |

Error: nvidia-container-cli: requirement error: unsatisfied condition

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=35777 /var/lib/docker/devicemapper/mnt/7b5b6d59cd4fe9307b7523f1cc9ce3bc37438cc793ff4a5a18a0c0824ec03982/rootfs]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\\\n\\\"\"": unknown.

The GPU hardware and/or NVIDIA drivers installed on the agent are not compatible with CUDA 10.0. Please try re-running the Docker command with a version of CUDA that is compatible with the hardware and driver set-up, e.g. the following for CUDA 9.0:

docker run --runtime=nvidia --rm nvidia/cuda:9.0-runtime nvidia-smi