Skip to content

Installation

This document describes how to install and upgrade PEDL. PEDL consists of several components:

  • a master that schedules workloads and stores metadata
  • one or more agents that run workloads, typically using GPUs

The PEDL master and agents should typically be installed and configured by a system administrator. Each user of PEDL should also install a copy of the command-line tools, as described below.

These instructions describe how to install PEDL in "bare metal" mode. PEDL can also run on top of Kubernetes; for instructions on installing and using PEDL with Kubernetes, see the documentation.

System Requirements

Software

  • PEDL agent and master nodes must be configured with either Ubuntu 16.04 LTS or CentOS 7.

  • To run jobs with GPUs, the Nvidia drivers must be installed on each PEDL agent. PEDL requires version >= 384.81 of the Nvidia drivers. (The Nvidia drivers can be installed as part of installing CUDA, but the rest of the CUDA toolkit is not required.)

Hardware

  • The PEDL master node should be configured with >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 100GB of disk storage. Note that the PEDL master can be run on a machine that does not have GPUs.

  • Each PEDL agent node should be configured with >= 2 CPUs (Intel Broadwell or later), 4GB of RAM, and 50GB of disk space. If using GPUs, Nvidia GPUs with compute capability 3.7 or greater are required (e.g., K80, P100, V100, GTX 1080, GTX 1080 Ti, TITAN, TITAN XP).

Installation

These instructions describe how to install PEDL for the first time; for directions on how to upgrade an existing PEDL installation, see the Upgrades section below.

On each machine in the PEDL cluster (both master and agents), do the following:

  1. Install the latest release of Docker and nvidia-docker2.

On Ubuntu:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -fsSL https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y --no-install-recommends docker-ce nvidia-docker2
sudo systemctl reload docker
sudo usermod -aG docker $USER
On CentOS:
sudo yum install -y yum-utils device-mapper-persistent-data lvm2
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo yum install -y docker-ce nvidia-docker2
sudo systemctl start docker
2. Logout and start a new terminal session. Verify that the current user is in the docker group and that the nvidia Docker runtime is installed:
groups
docker info | grep Runtimes
3. Copy the PEDL installation tarball to the machine, extract it, download the necessary Docker images, and run make install as root.
tar xzvf pedl-0.8.8.tar.gz
cd pedl-0.8.8
./bin/pedl-pull-images
sudo make install
If desired, add /usr/local/pedl/bin to your PATH.

Master

On the master machine, run:

sudo make enable-master
This enables and starts the PEDL master and database services. These services will now be started when the machine boots.

Note

PEDL experiment metadata is stored in a Docker volume named pedl-db-volume. By default, all Docker volumes are stored in the /var/lib/docker/volumes/ directory on the host file system; you should ensure that there is enough free space for all experiment metadata (~100GB should be safe).

Agent

On each agent machine, edit /usr/local/pedl/etc/agent.conf and set MASTER_ADDRESS to the IP address or host name where the PEDL master can be found. Note that localhost or 127.0.0.1 should not be used.

Next, run:

sudo make enable-agent
This enables and starts the PEDL agent service. These services will now be started when the machine boots. By default, the agent will use all the GPUs on the machine to run PEDL tasks. To configure the agent to only use specific GPUs, set the GPU_LIST variable in agent.conf. GPUs can also be disabled and enabled using via the pedl slot disable and pedl slot enable CLI commands, respectively.

Command-Line Tools

PEDL also includes a CLI for interacting with the system from the command-line. This is distributed as a Python wheel; you can install this wheel into your Python installation or into one or more virtualenvs of your choosing:

pip install pedl-*.whl
Once the CLI has been installed, see pedl --help for usage notes.

Upgrades

Upgrading an existing PEDL installation requires roughly similar steps to installing PEDL for the first time. During the upgrade process, all running experiments will be checkpointed and temporarily suspended. Once the upgrade is complete, all suspended experiments will be resumed automatically.

  1. Disable all PEDL agents in the cluster:

    pedl agent disable --all
    
    This will cause all tasks running on those agents to be checkpointed and terminated. The checkpoint process might take some time to complete; you can monitor which tasks are still running via pedl slot list.

  2. Shutdown the PEDL agent running on each agent machine:

    sudo systemctl stop pedl-agent
    
    You can verify whether the agent has stopped using systemctl is-active pedl-agent.

  3. Shutdown the PEDL master and database service:

    sudo systemctl stop pedl-master
    sudo systemctl stop pedl-db
    

  4. Follow the instructions above to copy the new version of PEDL to each machine, download the updated Docker images, and run make install. Reload the daemon (sudo systemctl daemon-reload) and then enable the PEDL master and agent services on each host, as appropriate. When the master is restarted, all previously active experiments will be resumed.

  5. Upgrade the CLI by installing (pip install -U) the new Python wheel. Be sure to do this for every user or virtualenv that has installed the old version of the CLI.

Troubleshooting Tips

To view the logs associated with any of the PEDL services (master, agent, or metadata DB), use:

journalctl -u pedl-agent
journalctl -u pedl-db
journalctl -u pedl-master

To verify that a host can run containers that use GPUs, run:

docker run --runtime=nvidia --rm nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04 nvidia-smi

To reset the content of the PEDL metadata DB, use pedl-db-reset.

Warning

Using pedl-db-reset will result in deleting the entire PEDL database; it should only be used in extreme circumstances.