Prepare Data

Data plays a fundamental role in machine learning model development. The best way to load data into your ML models depends on several factors, including whether you are running on-premise or in the cloud, the size of your data sets, and your security requirements. Accordingly, Determined supports a variety of methods for accessing data.

This tutorial discusses three methods for accessing data in Determined:

  1. Object Storage: Data stored in object stores such as Amazon S3.

  2. Distributed File Systems: Data stored on distributed file systems such as NFS or Ceph.

  3. Small Data: Small data sets can be uploaded as part of the model definition.

Object Storage

Object stores manage data as a collection of key-value pairs. Object storage is particularly popular in cloud environments – for example, Amazon’s Simple Storage Service (S3) and Google Cloud Storage (GCS) are both object stores. When running Determined in the cloud, it is highly recommended that you store your data using the same cloud provider being used for the Determined cluster itself.

Unless you are accessing a publicly available data set, you will need to ensure that Determined trial containers can access data in the object storage service you are using. This can be done by configuring a custom environment with the appropriate credentials. When using Dynamic Agents on GCP, a system administrator will need to configure a valid service account with read credentials. When using Dynamic Agents on AWS, the system administrator will need to configure an iam_instance_profile_arn with read credentials.

Once security access has been configured, we can use open-source libraries such as boto3 or gcsfs to access data from object storage. The simplest way to do this is for your model definition code to download the entire data set whenever a trial container starts up.

Downloading from Object Storage

The example below demonstrates how to download data from S3 using boto. The S3 bucket name is specified in the experiment config file (using a field named data.bucket). The download_directory variable defines where data that is downloaded from S3 will be stored. Note that we include self.context.distributed.get_rank() in the name of this directory: when doing distributed training, multiple processes might be downloading data concurrently (one process per GPU), so embedding the rank in the directory name ensures that these processes do not conflict with one another. For more detail, see the Distributed Training How-To Guide.

Once the download directory has been created, s3.download_file(s3_bucket, data_file, filepath) fetches the file from S3 and stores it at the specified location. The data can then be accessed in the download_directory.

import boto3
import os


def download_data_from_s3(self):
    s3_bucket = self.context.get_data_config()["bucket"]
    download_directory = f"/tmp/data-rank{self.context.distributed.get_rank()}"
    data_file = "data.csv"

    s3 = boto3.client("s3")
    os.makedirs(download_directory, exist_ok=True)
    filepath = os.path.join(download_directory, data_file)
    if not os.path.exists(filepath):
        s3.download_file(s3_bucket, data_file, filepath)
    return download_directory

To use this in your trial class, start by calling download_data_from_s3 in the trial’s __init__ function. Next, implement the build_training_data_loader and build_validation_data_loader functions to load the training and validation data sets, respectively, from the downloaded data.

Streaming from Object Storage

Rather than downloading the entire training data set from object storage during trial startup, another way to load data is to stream batches of data from the training and validation sets as needed. This has several advantages:

  • It avoids downloading the entire data set during trial startup, allowing training tasks to start more quickly.

  • If a container doesn’t need to access the entire data set, streaming can result in downloading less data. For example, when during hyperparameter search, many trials can often be terminated after having been trained for less than a full epoch.

  • If the data set is extremely large, streaming can avoid the need to store the entire data set on disk.

  • Streaming can allow model training and data downloading to happen in parallel, improving performance.

To perform streaming data loading, the data must be stored in a format that allows efficient random access, so that the model code can fetch a specific batch of training or validation data. One way to do this is to store each batch of data as a separate object in the object store. Alternatively, if the data set consists of fixed-size records, you can use a single object and then read the appropriate byte range from it.

To stream data, a custom torch.utils.data.Dataset or tf.keras.utils.Sequence object is required, depending on whether you are using PyTorch or TensorFlow Keras, respectively. These classes require a __getitem__ method that is passed an index and returns the associated batch or record of data. When streaming data, the implementation of __getitem__ should fetch the required data from the object store.

The code below demonstrates a custom tf.keras.utils.Sequence class that streams data from Amazon S3. In the __getitem__ method, boto3 is used to fetch the data based on the provided bucket and key.

import boto3


class ObjectStorageSequence(tf.keras.utils.Sequence):
    ...

    def __init__(self):
        self.s3_client = boto3.client("s3")

    def __getitem__(self, idx):
        bucket, key = get_s3_loc_for_batch(idx)
        blob_data = self.s3_client.get_object(Bucket=bucket, Key=key)["Body"].read()
        return data_to_batch(blob_data)

Distributed File System

Another way to store data is to use a distributed file system, which enables a cluster of machines to access a shared data set via the familiar POSIX file system interface. Amazon’s Elastic File System and Google’s Cloud Filestore are examples of distributed file systems that are available in cloud environments. For on-premise deployments, popular distributed file systems include Ceph, GlusterFS, and NFS.

To access data on a distributed file system, you should first ensure that the file system is mounted at the same mount point on every Determined agent. For cloud deployments, this can be done by configuring provisioner.startup_script in master.yaml to point to a script that mounts the distributed file system. An example of how to do this on GCP can be found here.

Next, you will need to ensure the file system is accessible to each trial container. This can be done by configuring a bind mount in the experiment configuration file. Each bind mount consists of a host_path and a container_path; the host path specifies the absolute path where the distributed file system has been mounted on the agent, while the container path specifies the path within the container’s file system where the distributed file system will be accessible.

To avoid confusion, you may wish to set the container_path to be equal to the host_path. You may also want to set read_only to true for each bind mount, to ensure that data sets are not modified by training code.

The following example assumes a Determined cluster is configured with a distributed file system mounted at /mnt/data on each agent. To access data on this file system, we use an experiment configuration file as follows:

bind_mounts:
  - host_path: /mnt/data
    container_path: /mnt/data
    read_only: true

Our model definition code can then access data in the /mnt/data directory as follows:

def build_training_data_loader(self):
    return make_data_loader(data_path="/mnt/data/training", ...)


def build_validation_data_loader(self):
    return make_data_loader(data_path="/mnt/data/validation", ...)

Embed in Model Definition

In Determined, each experiment has an associated model definition directory. The model definition directory must include the model’s source code, but it can also include other files related to the model, such as a data set. The size of this directory must not exceed 96MB, so this method is only appropriate when the size of the data set is small.

For example, the model definition directory below contains the model definition, an experiment configuration file, and a CSV data file. All three files are small and hence the total size of the directory is much smaller than the 96MB limit:

.
├── const.yaml (0.3 KB)
├── data.csv (5 KB)
└── model_def.py (4.1 KB)

The data can be submitted along with the model definition using the command:

det create experiment const.yaml .

Determined injects the contents of the model definition directory into each trial container that is launched for the experiment. Any file in that directory can then be accessed by your model code, e.g., by relative path (the model definition directory is the initial working directory for each trial container).

For example, the code below uses Pandas to load data.csv into a DataFrame:

df = pandas.read_csv("data.csv")

See the full list of documents.