Prometheus & Grafana#

Supported Versions

Grafana 8.3.0+

Prometheus 2.14.0

Determined 0.17.6+

Discover how to enable a Grafana dashboard to monitor Determined hardware and system metrics. Determined provides a Prometheus endpoint that contains mappings between internal task, GPU, and container definitions, which are used by Prometheus to collect relevant metrics on a cluster running Determined. The endpoint is enabled by default but can be disabled in the master configuration file.

Visit Kubernetes Observability for instructions on enabling observability in Kubernetes environments.

Reference#

Grafana

Prometheus

cAdvisor

dcgm-exporter

Constraints#

The Determined Prometheus endpoint is configured to work with cAdvisor for CPU metrics and dcgm-exporter (DCGM) for GPU metrics. The default ports for cAdvisor, 8080, and dcgm-exporter, 9400, are hardcoded into the Prometheus endpoint. If an agent is running on the same machine as the master, the master must be running on a port other than 8080 for cAdvisor metrics to be scraped. Although other monitoring tools can be used with the setup, this guide describes only cAdvisor and DCGM tool configuration. Prometheus queries on metrics collected by other tools can differ, depending on the format and organization of the returned metrics.

Prerequisites#

  • A Grafana installation for dashboard monitoring.

  • An on-cluster Prometheus instance for time-series data collection.

Configure Determined#

Install and run Determined on a cluster. When launching the master instance, the Prometheus endpoints are enabled by default in versions 0.32.0 and later. To manually enable the Prometheus endpoints, add a flag to the master.yaml configuration file:

observability:
    enable_prometheus: true

This enables the following two Prometheus API endpoints on the instance.

  • {$DET_MASTER_ADDR}/prom/det-state-metrics:

    The det-state-metrics endpoint includes various machine-level label mappings to internal Determined entities, such as GPU UUIDs and container IDs to task, allocation, and experiment labels, which are used by PromQL to join vectors.

  • {$DET_MASTER_ADDR}/prom/det-http-sd-config:

    The det-http-sd-config endpoint contains address and resource pool information for currently active agents. These are used by Prometheus as targets for scraping. This endpoint is configured to support running default cAdvisor, port 8080, and DCGM, port 9400, monitoring. Other tools exposing Prometheus metrics can be used instead of cAdvisor and DCGM if they are running on these ports.

Configure cAdvisor and dcgm-exporter#

The cAdvisor and dcgm-exporter monitoring tools must be running on the cluster agents to be monitored. These can be installed manually or run as individual Docker containers.

To configure dynamic agents to start up with cAdvisor and dcgm-exporter, add the following startup script to the master.yaml file:

- pool_name: compute-pool
    provider:
      startup_script: |
        # Run dcgm-exporter on 9400
        docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.3.2-2.6.3-ubuntu20.04

        # Run cAdvisor on 8080
        VERSION=v0.36.0
        docker run \
          --volume=/:/rootfs:ro \
          --volume=/var/run:/var/run:ro \
          --volume=/sys:/sys:ro \
          --volume=/var/lib/docker/:/var/lib/docker:ro \
          --volume=/dev/disk/:/dev/disk:ro \
          --publish=8080:8080 \
          --detach=true \
          --name=cadvisor \
          --privileged \
          --device=/dev/kmsg \
          gcr.io/cadvisor/cadvisor:$VERSION

This example startup script includes the default setup docker commands provided by dcgm-exporter and cAdvisor.

Configure Prometheus#

  1. Install Prometheus on any node in the monitored cluster.

  2. Launch Prometheus with the provided prometheus.yml configuration file.

  3. To replace the placeholder master address, you’ll need to edit the Prometheus configuration file.

    • The metric_relabel_configs parameter edits certain label names in jobs for joining in PromQL.

    • The scrape_interval parameter values can be modified to optimize for resolution/size/time.

    • The $PATH_TO_TOKEN specifies a path to an authorization token for the Determined master. This can be kept in a local file by running the token-refresh.sh script in the same directory with a CRON job (set to run daily).

Configure Grafana#

A Grafana instance can be installed on any machine that adds the above Prometheus address as a data source. After the Grafana server is running and the Web UI is accessible, follow these steps:

  1. Add a Prometheus data source in Grafana -> Configuration -> Data Sources -> Add data source.

    Configure the Prometheus data source set up in the previous section by setting the URL to your running Prometheus server address. By default, this is the machine address on port 9090.

  2. After the Prometheus data source connects, import the Determined Hardware Metrics dashboard JSON file in Grafana -> Create -> Import -> Import using panel JSON.

Example#

Following the above configuration steps and after submitting experiments on the cluster, you should see populated panels in the imported Grafana dashboard: Grafana -> Dashboards.

Grafana Dashboard

Each panel in the dashboard is powered by one or more Prometheus queries and tracks a specific metric on the cluster as a percentage of total capacity. Results can be further filtered using tags and resource pool and time range in Grafana.

Alerts#

The det-master-api-server provides a metric, determined_healthy, that can be used to set up alerts. This metric will return 1 when Determined can access its major dependencies and 0 when it cannot. On Kubernetes, inability to access the Kubernetes API server will cause this metric to return 0. On Slurm, failure to access the launcher will also cause this metric to return 0. If the database is down, it is possible Prometheus will be unable to scrape this metric.

To create an alert in Grafana, navigate to the Alert Rules page and use the Prometheus data source configured earlier. You can use the following query to set up the alert.

1 - determined_healthy{job="det-master-api-server"}

Grafana Alert Configuration

Since Prometheus may be unable to scrape Determined under certain circumstances, it is recommended to set Alert state if no data or all values are null to Alerting.

For more information on using Grafana alerts, visit the Grafana documentation.