Shortcuts

Data Access for Experiments

Nearly all experiments require training and validation data. In Determined, data access depends on the cluster infrastructure and the data storage.

At the end of this tutorial, the reader should know:
  1. Data Access Options

  2. How To Access Your Data

Overview

Data is an important piece to model development. Each user has different storage options depending on their data characteristics, such as privacy, internal infrastructure, and data size. Determined supports the most common data storage infrastructures to allow integration into the user development environment. This tutorial highlights the three main methods to access data:

  1. Data in the cloud (Recommended): Data stored in cloud providers, GCP or AWS

  2. Data on premises: Data available on the Determined cluster

  3. Small Data: Data smaller than 96MB

The following sections will demonstrate each approach to access data.

Data On-Premises

This section describes how to configure our experiment workflows to reference data stored in on-premises infrastructure. This option makes the following assumptions:

  • The Determined cluster and data storage are both managed on-premises.

  • All Determined agent instances have the desired data available at the same filesystem path. In Determined clusters with more than a single agent, Network-attached storage typically exposes this data to each Determined agent instance at the same path.

Similar to network attached storage, the network attached storage is mounted to the trial containers. The Experiment Configuration contains a bind_mounts sections. Each bind mount contains a host_path and container_path. The host_path specify the absolute path to the filesystem path where the data is available on each agent instance. In otherwords, the host_path points to where the network attached storage is located on the instance. The container_path specifies where the model definition source code can access the data from inside the container filesystem. It points to where the data should be mounted in the trial container.

For easier management, set the container_path to match the host_path to the same locations. It is recommended to set each bind mount setting read_only to true on each bind mount, for guarantees that the experiment should only _read_ this data instead of modifying it.

The following example assumes a Determined cluster is configured with some data available at /mnt/data on each agent. We configure our experiment configuration as follows:

bind_mounts:
  - host_path: /mnt/data
    container_path: /mnt/data
    read_only: true

Now, we can write Python code in our Model Definitions to access any data under the /mnt/data folder as follows:

def build_training_data_loader(self):
    return make_data_loader(data_path="/mnt/data/training", ...)

def build_validation_data_loader(self):
    return make_data_loader(data_path="/mnt/data/validation", ...)

Embed in Model Definition

Determined requires a submission to have a defined Model Definition directory. The model definition directory references all the training code and must be less than 96MB. However, Determined does not restricted file types on submission, such as csv or pickle. Therefore, data less than 96MB can be included in the model definition directory.

For example, the model definition directory below contains the model definition and a data .csv file.

.
├── data.csv (5 KB)
├── __init__.py (0)
└── model_def.py (4.1 K)

As seen above, the data is 5KB and the entire contents of the directory is less than 96MB. Therefore can be submitted with the model definition using the command:

det create experiment const.yaml .

When a new experiment is created, Determined injects the model definition directory’s contents on each agent for each trial to use. Any data placed in the directory can then be accessed through relative filepath. The code below uses the pandas library to load in the data.csv through relative filepath.

df = pandas.read_csv('data.csv')