Advanced Installation#

Using Determined requires a training environment. Your training environment can be a local development machine, an on-premise GPU cluster, or cloud resources.

This checklist helps you get started setting up a new training environment for your organization. After completing these steps, your users will be able to see and access your Determined cluster.

Prerequisites#

To complete the items in this checklist, ensure your system meets Advanced Installation Requirements.

About Offline Installations#

  • If your master and compute nodes are offline, you’ll need a local private registry that can satisfy necessary images (PostgreSQL + task container images).

  • You can install the Determined CLI package on your client machines and then take them offline again.

  • In addition, a local PyPi mirror for packages is highly recommended for installing packages from the internet in your task environments. See also: Infrastructure Considerations.

Set Up PostgreSQL#

Determined uses a PostgreSQL database to store experiment and trial metadata. Choose the installation method that best fits your environment and requirements.

Note

Kubernetes

If you are using Kubernetes, you can skip this step. Installing Determined on Kubernetes uses the Determined Helm Chart which includes deployment of a PostgreSQL database.

Note

Cloud Services

  • AWS. The Determined CLI manages the process of provisioning an Amazon RDS instance for PostgreSQL.

  • GCP. The Determined CLI manages the setup of Google Cloud SQL instances for PostgreSQL.

Installing Determined using Linux Packages pulls in the official Docker image for PostgreSQL.

Install Determined#

Once PostgreSQL is set up, you’ll install Determined. This includes deploying the Determined master, configuring checkpoint storage, setting up resource pools, and configuring the cluster.

Deploy Determined Master#

To install Determined, decide if you want to deploy the Determined master on premises or on cloud.

If the Determined agent is your compute resource, you’ll install the Determined agent along with the Determined master. The preferred method for installing the Agent is to use Linux packages. The recommended alternative to Linux packages is Docker.

To install the Determined master and agent on premises, you’ll first need to meet the installation requirements:

Once you’ve met the installation requirements, install the Determined Master and Agent:

These instructions include editing the YAML configuration files for the master and each agent and for configuring and starting the cluster.

Configure Checkpoint Storage#

A checkpoint contains the architecture and weights of the model being trained. If checkpoint_storage is not specified, the experiment will default to the checkpoint storage configured in the master configuration.

To learn more about configuring checkpoint storage, visit Checkpoint Storage.

Configure Resource Pools#

When deploying the Determined master and compute resources (such as a Determined agent), you must also configure resource pools.

How Resource Pools Work

Both the Determined master and the compute resources, such as the Determined agents, come with their individual configuration files. Among other things, these files define the resource pools and specify how resources communicate and are allocated.

For instance, a Determined agent, which is a kind of compute resource, is part of a resource pool. Its configuration file not only helps it communicate with the Determined master but also dictates which resource pool it should connect to. By default, an agent will attempt to connect to the “default” pool. However, if the “default” pool doesn’t exist, the agent will remain unconnected.

Setting Up an On-Prem Determined Agent

For an on-prem Determined agent installation, the process involves the following steps:

  • Configure resource pools. These resource pools enable the segregation of tasks based on their resource requirements.

  • Configure the agents to establish a connection to the Determined master. Then link the agents with their respective resource pools. For reference, visit resource_pool under Agent Configuration Reference.

Configure the Cluster#

Once you have set up the necessary components for your environment, configure the cluster. When configuring your cluster, you’ll need to keep the following resources handy:

Configure Security#

After installing Determined, set up your security features.

Attention

Security features, with the exception of TLS, are only available on Determined Enterprise Edition (Determined EE).

TLS#

The use of Transport Layer Security (TLS) requires Determined EE and is highly recommended.

User Authentication (SSO)#

Determined offers several options for user authentication:

Feature

Description

OAuth 2.0 Configuration

Enable, list, and remove OAuth clients.

OpenID Connect Integration

Integrate OpenID Connect, with and Okta example.

SAML Integration

Integrate Security Assertion Markup Language (SAML) authentication to use single sign-on (SSO) with your organizationidentity provider (IdP).

SCIM Integration

Integrate System for Cross-domain Identity Management (SCIM) for administrators to easily and securely provision users and groups.

Note

For Kubernetes deployments, you modify the master-related configurations through the helm chart.

Non-Root Containers#

You can enhance security and limit potential malicious activity by running containers as non-root users. Determined allows you to run tasks as specific agent users and run unprivileged tasks by default.

Important

Red Hat® OpenShift® users should not follow these instructions for configuring non-root containers, as OpenShift’s configuration conflicts with the approach described here.

To run containers as non-root users, you’ll first need to set up your non-root user:

  • Choose a Determined user for configuration, preferably one who has not undergone the det user link-with-agent-user process and one you plan to eventually link with an agent user. If no suitable Determined user exists, consider creating a test user for this purpose, one which can be disabled afterwards.

  • Link this user to the actual username/UID and groupname/GID. One way to do this is to use the following command (you can also use the WebUI):

    det user link-with-agent-user \
       --agent-user $THE_USER \
       --agent-uid $THE_UID \
       --agent-group $THE_GROUP \
       --agent-gid $THE_GID \
       $THE_DETERMINED_USER
    
  • Start a shell as the specified user:

    det -u $THE_DETERMINED_USER shell start
    
  • In the shell, verify the username/UID and groupname/GID with id -a.

  • After confirming the non-root containers are operational, you’ll need to perform a test run of each training job you normally run as the modified Determined user. This ensures the training jobs run successfully without root privileges.

Note

For Kubernetes deployments, configure the security context for running containers as a non-root user.

Configure Role-Based Access Control (RBAC)#

Consider configuring role-based access control (RBAC) before creating workspaces and projects. To configure RBAC, visit RBAC.

Attention

RBAC is only available on Determined Enterprise Edition.

Infrastructure Considerations#

When setting up Determined, you can adjust certain configurations for enhanced security and performance. While these are particularly crucial for offline installations, they can also benefit online installations by ensuring faster package retrieval and increased security.

Configure Local Docker Image Repositories#

Configuring local Docker image repositories can enhance security and optimize performance. Learn how to configure local Docker image repositories in Customizing Your Environment.

Configure Local PyPi Mirrors#

It’s recommended to consider configuring local PyPi mirrors for:

  • Security: An airgapped cluster, isolated from the public internet, mandates local mirrors for proper functionality. This also safeguards against potential vulnerabilities associated with fetching packages from external sources.

  • Performance: Local mirrors can substantially reduce the time taken to fetch packages, eliminating potential lags due to network issues or external server overloads.

Additional Options#

Create Workspaces and Projects#

Determined lets you organize and control access to your experiments by team or department. To do this, you can create Workspaces and Projects based on your RBAC groups. Once your workspaces are set up, you can bind resource pools to them.

Set Up Monitoring Tools#

To set up your monitoring tools, visit Prometheus & Grafana.

Configure Infiniband#

You may choose to configure InfiniBand when connecting multiple data streams in a single connection.

Set Up Clients#

You can set up clients for interacting with the Determined master through the CLI to provide users with efficient access for task execution without having to go through the WebUI.

Test Your Setup#

Test your setup to ensure it is functioning correctly.

Test that you can run a single CPU/GPU training job.

  1. Download the mnist_pytorch.tgz file to a local directory.

  2. Open a terminal window, extract the files, and cd into the mnist_pytorch directory:

    tar xzvf mnist_pytorch.tgz
    cd mnist_pytorch
    
  3. In the mnist_pytorch directory, create an experiment specifying the const.yaml configuration file:

    det experiment create const.yaml .
    

    You should receive confirmation that the experiment is created:

    Preparing files (.../mnist_pytorch) to send to master... 8.6KB and 7 files
    Created experiment 1
    
  4. Enter the cluster address in the browser address bar to view experiment progress in the WebUI.

    You should be able to see your experiment ID and its status.

Next Steps#

Congratulations! You have set up your Determined environment! Your users should be able to see and connect to the Determined master.