This document describes how to install and upgrade PEDL. PEDL consists of several components:
- a master that schedules workloads and stores metadata
- one or more agents that run workloads, typically using GPUs
The PEDL master and agents should typically be installed and configured by a system administrator. Each user of PEDL should also install a copy of the command-line tools, as described below.
These instructions describe how to install PEDL in "bare metal" mode. PEDL can also run on top of Kubernetes; for instructions on installing and using PEDL with Kubernetes, see the documentation.
PEDL agent and master nodes must be configured with either Ubuntu 16.04 LTS or CentOS 7.
To run jobs with GPUs, the Nvidia drivers must be installed on each PEDL agent. PEDL requires version >= 384.81 of the Nvidia drivers. (The Nvidia drivers can be installed as part of installing CUDA, but the rest of the CUDA toolkit is not required.)
The PEDL master node should be configured with >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 100GB of disk storage. Note that the PEDL master can be run on a machine that does not have GPUs.
Each PEDL agent node should be configured with >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 50GB of disk space. If using GPUs, Nvidia GPUs with compute capability 3.7 or greater are required (e.g., K80, P100, V100, GTX 1080, GTX 1080 Ti, TITAN, TITAN XP).
These instructions describe how to install PEDL for the first time; for directions on how to upgrade an existing PEDL installation, see the Upgrades section below.
On each machine in the PEDL cluster (both master and agents), do the following:
- Install the latest release of Docker and nvidia-docker2.
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -fsSL https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y --no-install-recommends docker-ce nvidia-docker2 sudo systemctl reload docker sudo usermod -aG docker $USER
sudo yum install -y yum-utils device-mapper-persistent-data lvm2 sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo sudo yum install -y docker-ce nvidia-docker2 sudo systemctl start docker
dockergroup and that the
nvidiaDocker runtime is installed:
groups docker info | grep Runtimes
make installas root.
tar xzvf pedl-0.8.16.tar.gz cd pedl-0.8.16 ./bin/pedl-pull-images sudo make install
On the master machine, run:
sudo make enable-master
PEDL experiment metadata is stored in a Docker volume named
pedl-db-volume. By default, all Docker volumes are stored in the
/var/lib/docker/volumes/ directory on the host file system; you
should ensure that there is enough free space for all experiment
metadata (~100GB should be safe).
On each agent machine, edit
/usr/local/pedl/etc/agent.conf and set
MASTER_ADDRESS to the IP address or host name where the PEDL master
can be found. Note that
127.0.0.1 should not be used.
sudo make enable-agent
agent.conf. GPUs can also be disabled and enabled using via the
pedl slot disableand
pedl slot enableCLI commands, respectively.
PEDL also includes a CLI for interacting with the system from the command-line. This is distributed as a Python wheel; you can install this wheel into your Python installation or into one or more virtualenvs of your choosing:
pip install pedl-*.whl
pedl --helpfor usage notes.
Upgrading an existing PEDL installation requires roughly similar steps to installing PEDL for the first time. During the upgrade process, all running experiments will be checkpointed and temporarily suspended. Once the upgrade is complete, all suspended experiments will be resumed automatically.
Disable all PEDL agents in the cluster:This will cause all tasks running on those agents to be checkpointed and terminated. The checkpoint process might take some time to complete; you can monitor which tasks are still running via
pedl agent disable --all
pedl slot list.
Shutdown the PEDL agent running on each agent machine:You can verify whether the agent has stopped using
sudo systemctl stop pedl-agent
systemctl is-active pedl-agent.
Take a backup of the PEDL database:This is a safety precaution in case any problems occur after upgrading PEDL.
pedl-db-backup pedl-db-`date "+%m-%d-%y"`.dump
Shutdown the PEDL master and database service:
sudo systemctl stop pedl-master sudo systemctl stop pedl-db
Follow the instructions above to copy the new version of PEDL to each machine, download the updated Docker images, and run
make install. Reload the daemon (
sudo systemctl daemon-reload) and then enable the PEDL master and agent services on each host, as appropriate. When the master is restarted, all previously active experiments will be resumed.
Upgrade the CLI by installing (
pip install -U) the new Python wheel. Be sure to do this for every user or virtualenv that has installed the old version of the CLI.
To view the logs associated with any of the PEDL services (master, agent, or metadata DB), use:
journalctl -u pedl-agent journalctl -u pedl-db journalctl -u pedl-master
To verify that a host can run containers that use GPUs, run:
docker run --runtime=nvidia --rm nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04 nvidia-smi
To reset the content of the PEDL metadata DB, use
pedl-db-reset will result in deleting the entire PEDL
database; it should only be used in extreme circumstances.