Skip to content

Data Loaders

In order to feed data to their models, in most cases users must define a function called make_data_loaders(). This function should return a pair of objects (one for training and one for validation) which implement the BatchLoader interface, e.g. have a get_batch() function that returns Batches of data and labels. This section covers how to write that function and the basic concepts of data loading in PEDL.

(The exceptions are when using the Simple Model Definition or TensorFlow Estimator interfaces; refer to the linked docs for details.)

Records and Batches

Conceptually, a data set consists of a number of records or examples. A record is a pair of some data (e.g. an image) and a label (e.g. a class). In some cases a record may have multiple named data (e.g. an image with metadata) and/or multiple named labels (e.g. a collection of binary labels).

A (mini-)batch is a small (e.g. 32 or 64) collection of records. We typically train deep learning models using (mini-)batches. A batch loader is an interface to read batches of records from the data set.

Batches in PEDL are instances of the pedl.data.Batch class. Internally the records are stored as a pair of data and label dictionaries, whose keys are the data and label names, and whose values are numpy ndarrays. The Batch conveniently packages these arrays together and ensures that their first dimensions match (have the same number of records). In the usual single-input-single-output case, the data array's name is set by default to input and the label array's name is set by default to output.

BatchLoaders and make_data_loaders

Batchs are loaded from a data set via the pedl.data.BatchLoader interface. This is a lightweight interface that represents a sequence of records. Batches can be loaded via the get_batch(self, start, end) method.

PEDL contains a number of convenience functions to speed implementation of BatchLoaders. The ArrayBatchLoader class wraps in-memory numpy array data and labels in the BatchLoader interface. The SliceBatchLoader class exposes a slice of the data in another BatchLoader.

The make_data_loaders(experiment_config, hparams) function should return a pair of pedl.data.BatchLoaders, one for the training data and the second for the validation data. The arguments experiment_config and hparams are fed into this function directly by PEDL based on the information in the experiment configuration file, and specify the raw data source and hyperparameter settings that could potentially influence data preprocessing.

Tutorial

In this section we will implement a custom BatchLoader to load CIFAR-10 images, and a make_data_loaders() function to load training and validation data. First download and unpack the data set:

wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar xzf cifar-10-python.tar.gz

This creates a directory cifar-10-batches-py. The CIFAR-10 data set is contained in six Python pickle files inside the new directory, each containing 10,000 records. The first five are training data (data_batch_1, data_batch_2, etc.) and the last is the validation data (test_batch). Our loader should support loading both training and validation data, so the constructor takes the list of files to load from:

import functools
import glob
import os
import pickle

import numpy as np

import pedl.data


IMAGE_SIZE = 32 * 32 * 3


class CIFAR10BatchLoader(pedl.data.BatchLoader):
    def __init__(self, pickle_files, file_length=10000):
        """
        pickle_files is the list of input pickle files. file_length is the
        number of records in each pickle file (defaults to 10,000).
        """
        self.pickle_files = pickle_files
        self.file_length = file_length

The __len__() function is straightforward:

    def __len__(self):
        return len(self.pickle_files) * self.file_length

The only other method that needs to be implemented is get_batch(self, start, end). We compute which pickle files we need to load and read the subset of the records from each file we need.

    @functools.lru_cache(maxsize=2)
    def load_file(self, idx):
        """
        Load one of the pickle files, by index. This is cached (via
        `functools.lru_cache`) to minimize I/O.
        """
        assert 0 <= idx < len(self.pickle_files)
        return pickle.load(open(self.pickle_files[idx], "rb"),
                           encoding="bytes")

    def get_batch(self, start, end):
        """
        Reads a Batch from the CIFAR-10 pickle files.
        """
        assert 0 <= start <= end <= len(self)
        start_file, start_off = divmod(start, self.file_length)
        end_file, end_off = divmod(end - 1, self.file_length)
        data = np.empty((0, IMAGE_SIZE), dtype="uint8")
        labels = []
        # Walk through the pickle files and accumulate the requested records.
        while start_file <= end_file:
            records = self.load_file(start_file)
            file_end_off = (end_off + 1 if end_file == start_file
                            else self.file_length)
            data = np.concatenate([
                data,
                records[b"data"][start_off:file_end_off, :]])
            labels.extend(records[b"labels"][start_off:file_end_off])
            start_file, start_off = start_file + 1, 0
        return pedl.data.Batch(data, np.array(labels))

Note we wrap load_file in functools.lru_cache to speed up access, at the cost of some memory. For situations where the data loading is dominated by compute time, PEDL offers pedl.util.memo_pickle, which caches the results of computations to disk, allowing them to be shared by multiple trials.

Finally, we can implement make_data_loaders():

def make_data_loaders(experiment_config, hparams):
    """
    Returns training and validation CIFAR10BatchLoaders. The path where the
    CIFAR-10 pickle files are located should be in
    experiment_config["data"]["path"].
    """
    training_loader = CIFAR10BatchLoader(sorted(glob.glob(os.path.join(
        experiment_config["data"]["path"], "data_batch_*"))))
    validation_loader = CIFAR10BatchLoader([os.path.join(
        experiment_config["data"]["path"], "test_batch")])
    return training_loader, validation_loader

Using ArrayBatchLoader

If memory is not a concern, you can load your entire data set into memory and use pedl.data.ArrayBatchLoader:

def make_data_loaders(experiment_config, hparams):
    """
    Loads CIFAR-10 training and validation data into in-memory
    ArrayBatchLoaders. The path where the CIFAR-10 pickle files are located
    should be in experiment_config["data"]["path"].
    """
    # Load and concatenate all of the training data.
    training_data = np.empty((0, IMAGE_SIZE), dtype="uint8")
    training_labels = []
    for pkl in sorted(glob.glob(os.path.join(
            experiment_config["data"]["path"], "data_batch_*"))):
        records = pickle.load(open(pkl, "rb"), encoding="bytes")
        training_data = np.concatenate([
            training_data,
            records[b"data"]])
        training_labels.extend(records[b"labels"])

    training_loader = pedl.data.ArrayBatchLoader(training_data,
                                                 np.array(training_labels))

    # Load validation data.
    validation_pkl = os.path.join(experiment_config["data"]["path"],
                                  "test_batch")
    records = pickle.load(open(validation_pkl, "rb"), encoding="bytes")

    validation_loader = pedl.data.ArrayBatchLoader(
        records[b"data"], np.array(records[b"labels"]))

    return training_loader, validation_loader

Using SliceBatchloader

Often one has a single data set and must perform a training-validation split. SliceBatchLoader can take slices of an existing BatchLoader to do this. In the following example, we load the training and validation data into a single ArrayBatchLoader and split it into training and validation data loaders:

def make_data_loaders(experiment_config, hparams):
    """
    Loads all CIFAR-10 data into a single in-memory ArrayBatchLoader, and
    return training and validation slices using SliceBatchLoader. The path
    where the CIFAR-10 pickle files are located should be in
    experiment_config["data"]["path"].
    """
    # Load and concatenate all of the data.
    data = np.empty((0, IMAGE_SIZE), dtype="uint8")
    labels = []
    for pkl in sorted(glob.glob(os.path.join(
            experiment_config["data"]["path"], "*_batch*"))):
        records = pickle.load(open(pkl, "rb"), encoding="bytes")
        data = np.concatenate([
            data,
            records[b"data"]])
        labels.extend(records[b"labels"])

    data_loader = pedl.data.ArrayBatchLoader(data, np.array(labels))

    # Split into training and validation slices.
    training_loader = pedl.data.SliceBatchLoader(data_loader, 0, 50000)
    validation_loader = pedl.data.SliceBatchLoader(data_loader, 50000, 10000)

    return training_loader, validation_loader