Installation

This document describes how to install and upgrade PEDL. PEDL consists of several components:

  • a master that schedules workloads and stores metadata
  • one or more agents that run workloads, typically using GPUs

The PEDL master and agents should typically be installed and configured by a system administrator. Each user of PEDL should also install a copy of the command-line tools, as described here.

These instructions describe how to install PEDL in "bare metal" mode. PEDL can also run on top of Kubernetes; for instructions on installing and using PEDL with Kubernetes, see the documentation.

System Requirements

Software

  • PEDL agent and master nodes must be configured with either Ubuntu 16.04 LTS or CentOS 7.

  • To run jobs with GPUs, the Nvidia drivers must be installed on each PEDL agent. PEDL requires version >= 384.81 of the Nvidia drivers. (The Nvidia drivers can be installed as part of installing CUDA, but the rest of the CUDA toolkit is not required.)

Hardware

  • The PEDL master node should be configured with >= 4 CPUs (Intel Broadwell or later), 8GB of RAM, and 200GB of disk storage. Note that the PEDL master can be run on a machine that does not have GPUs.

  • Each PEDL agent node should be configured with >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 50GB of disk space. If using GPUs, Nvidia GPUs with compute capability 3.7 or greater are required (e.g., K80, P100, V100, GTX 1080, GTX 1080 Ti, TITAN, TITAN XP).

Installation

These instructions describe how to install PEDL for the first time; for directions on how to upgrade an existing PEDL installation, see the Upgrades section below.

On each machine in the PEDL cluster (both master and agents), do the following:

  1. Install the latest release of Docker and nvidia-docker2.

On Ubuntu:

sudo apt-get update && sudo apt-get install -y software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y --no-install-recommends docker-ce nvidia-docker2
sudo systemctl reload docker
sudo usermod -aG docker $USER
On CentOS:
sudo yum install -y yum-utils device-mapper-persistent-data lvm2
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo yum install -y docker-ce nvidia-docker2
sudo systemctl start docker
2. Logout and start a new terminal session. Verify that the current user is in the docker group and that the nvidia Docker runtime is installed:
groups
docker info | grep Runtimes
3. Copy the PEDL installation tarball to the machine, extract it, download the necessary Docker images, and run make install as root.
tar xzvf pedl-0.9.6.tar.gz
cd pedl-0.9.6
./bin/pedl-pull-images
sudo make install
4. If using CentOS 7, enable the persistent storage of journalctl log messages so that logs are saved on machine reboot.
sudo mkdir /var/log/journal
sudo systemd-tmpfiles --create --prefix /var/log/journal
sudo systemctl restart systemd-journald
5. (Optional) Add /usr/local/pedl/bin to your PATH.

Master

On the master machine, run:

sudo make enable-master
This enables and starts the PEDL master and database services. These services will now be started when the machine boots.

Note

PEDL experiment metadata is stored in a Docker volume named pedl-db-volume. By default, all Docker volumes are stored in the /var/lib/docker/volumes/ directory on the host file system; you should ensure that there is enough free space for all experiment metadata (~100GB should be safe).

Agent

If the PEDL agent has GPUs, validate the nvidia-docker2 installation is working as expected.

Next, on each agent machine, edit /usr/local/pedl/etc/agent.conf and set MASTER_ADDRESS to the IP address or host name where the PEDL master can be found. Note that localhost or 127.0.0.1 should not be used.

Next, run:

sudo make enable-agent
This enables and starts the PEDL agent service. These services will now be started when the machine boots. By default, the agent will use all the GPUs on the machine to run PEDL tasks. To configure the agent to only use specific GPUs, set the GPU_LIST variable in agent.conf. GPUs can also be disabled and enabled using via the pedl slot disable and pedl slot enable CLI commands, respectively.

Configuring an HTTP Proxy

There are eight proxy environment variables which are treated specially by the agent(s):

  • HTTP_PROXY
  • HTTPS_PROXY
  • http_proxy
  • https_proxy
  • FTP_PROXY
  • ftp_proxy
  • NO_PROXY
  • no_proxy

Setting one of the above proxy variables, either by configuring the environment in which the agent will run or using /usr/local/pedl/etc/agent.conf, will affect three different contexts:

  • it will be set in the environment of the agent,
  • it will be passed as a predefined build arg to the docker build command, and
  • it will be set in the environment of running containers.

Each proxy variable can be set differently in each of the three contexts, by setting the variable with an appropriate prefix in agent.conf. For example:

  • http_proxy=proxy.com will set http_proxy in all three contexts
  • PEDL_AGENT_http_proxy=agent.proxy.com will overwrite http_proxy, but only for the agent itself
  • PEDL_BUILDTIME_http_proxy=build.proxy.com will overwrite http_proxy, but only during the docker build command
  • PEDL_RUNTIME_http_proxy=run.proxy.com will overwrite http_proxy, but only in the environment of running containers

Command-Line Tools

See here for instructions on installing the PEDL command-line tools.

Upgrades

Upgrading an existing PEDL installation requires roughly similar steps to installing PEDL for the first time. During the upgrade process, all running experiments will be checkpointed and temporarily suspended. Once the upgrade is complete, all suspended experiments will be resumed automatically.

  1. Disable all PEDL agents in the cluster:

    pedl -m <MASTER_ADDRESS> agent disable --all
    
    where MASTER_ADDRESS is the IP address or host name where the PEDL master can be found. This will cause all tasks running on those agents to be checkpointed and terminated. The checkpoint process might take some time to complete; you can monitor which tasks are still running via pedl slot list.

  2. Shutdown the PEDL agent running on each agent machine:

    sudo systemctl stop pedl-agent
    
    You can verify whether the agent has stopped using systemctl is-active pedl-agent.

  3. Take a backup of the PEDL database:

    pedl-db-backup pedl-db-`date "+%m-%d-%y"`.dump
    
    This is a safety precaution in case any problems occur after upgrading PEDL.

  4. Shutdown the PEDL master and database service:

    sudo systemctl stop pedl-master
    sudo systemctl stop pedl-db
    

  5. Follow the instructions in the Installation section to install the new version of PEDL. In particular, copy the new version of PEDL to each machine, download the updated Docker images (bin/pedl-pull-images), and run make install.

  6. Reload the systemd manager configuration (sudo systemctl daemon-reload) and then enable the PEDL master and agent services on each host, as appropriate. When the master is restarted, all previously active experiments will be resumed.

  7. Upgrade the CLI by installing (pip install -U) the new Python wheel. Be sure to do this for every user or virtualenv that has installed the old version of the CLI.

Troubleshooting Tips

To view the logs associated with any of the PEDL services (master, agent, or metadata DB), use:

journalctl -u pedl-agent
journalctl -u pedl-db
journalctl -u pedl-master

To reset the content of the PEDL metadata DB, use pedl-db-reset.

Warning

Using pedl-db-reset will result in deleting the entire PEDL database; it should only be used in extreme circumstances.

Validating the nvidia-docker2 installation

To verify that a PEDL agent instance can run containers that use GPUs, run:

docker run --runtime=nvidia --rm nvidia/cuda:10.0-runtime nvidia-smi

You should see an output that describes the GPUs available on the agent instance, such as:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 56%   84C    P2   177W / 250W |  10729MiB / 11176MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
| 28%   62C    P0    56W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 31%   64C    P0    57W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 20%   36C    P0    57W / 250W |      0MiB / 12196MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4638      C   python3.6                                  10719MiB |
+-----------------------------------------------------------------------------+

Error: nvidia-container-cli: requirement error: unsatisfied condition

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=35777 /var/lib/docker/devicemapper/mnt/7b5b6d59cd4fe9307b7523f1cc9ce3bc37438cc793ff4a5a18a0c0824ec03982/rootfs]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\\\n\\\"\"": unknown.

The GPU hardware and/or NVIDIA drivers installed on the agent are not compatible with CUDA 10.0. Please try re-running the Docker command with a version of CUDA that is compatible with the hardware and driver set-up, e.g. the following for CUDA 9.0:

docker run --runtime=nvidia --rm nvidia/cuda:9.0-runtime nvidia-smi