Skip to content

Data Loaders

In order to feed data to their models, in most cases users must define a function called make_data_loaders(). This function should return a pair of objects (one for training and one for validation) which implement the BatchLoader interface, e.g. have a get_batch() function that returns Batches of data and labels. This section covers how to write that function and the basic concepts of data loading in PEDL.

(The exceptions are when using the Simple Model Definition or TensorFlow Estimator interfaces; refer to the linked docs for details.)

Records and Batches

Conceptually, a data set consists of a number of records or examples. A record has a model input (e.g. an image) and a target output (e.g. a label). In some cases a record may have multiple named inputs (e.g. an image with metadata) and/or multiple named outputs (e.g. a collection of binary labels).

A (mini-)batch is a small (e.g. 32 or 64) collection of records. We typically train deep learning models using (mini-)batches. A batch loader is an interface to read batches of records from the data set.

Batches in PEDL are instances of the class. Internally the records are stored as a pair of input and output dictionaries, whose keys are the input / output names, and whose values are numpy ndarrays representing the collection of input data or output labels. The Batch conveniently packages these arrays together and ensures that their first dimensions match (have the same number of records). In the usual single-input-single-output case, the input key is assigned by default to input and the output key is assigned by default to output.

BatchLoaders and make_data_loaders

Batchs are loaded from a data set via the interface. This is a lightweight interface that represents a sequence of records. Batches can be loaded via the get_batch(self, start, end) method.

PEDL contains a number of convenience functions to speed implementation of BatchLoaders. The ArrayBatchLoader class wraps in-memory numpy array inputs and outputs in the BatchLoader interface. The SliceBatchLoader class exposes a slice of the data in another BatchLoader

The make_data_loaders(experiment_config, hparams) function should return a pair of, one for the training data and the second for the validation data. The arguments experiment_config and hparams are fed into this function directly by PEDL based on the information in the experiment configuration file, and specify the raw data source and hyperparameter settings that could potentially influence data preprocessing.


In this section we will implement a custom BatchLoader to load CIFAR-10 images, and a make_data_loaders() function to load training and validation data. First download the data set and unpack:

tar xzf cifar-10-python.tar.gz

This creates a directory cifar-10-batches-py. The CIFAR-10 data set is contained in six Python pickle files, each with 10K records. The first five are training data (data_batch_1, data_batch_2, etc) and the last is the validation data (test_batch). Our loader should support loading both training and validation data so the constructor takes the list of files to load from:

import functools
import glob
import os
import pickle

import numpy as np


IMAGE_SIZE = 32 * 32 * 3

class CIFAR10BatchLoader(
    def __init__(self, pickle_files, file_length=10000):
        pickle_files is the list of input pickle files.  file_length is the
        number of records in each pickle file (defaults to 10,000).
        self.pickle_files = pickle_files
        self.file_length = file_length

The __len__() function is straightforward:

    def __len__(self):
        return len(self.pickle_files) * self.file_length

The only other method that needs to be implemented is get_batch(self, start, end). We compute which pickle files we need to load and read the subset of the records from each file we need.

    def load_file(self, idx):
        Load one of the pickle files, by index.  This is cached (via
        `functools.lru_cache`) to minimize I/O.
        assert 0 <= idx < len(self.pickle_files)
        return pickle.load(open(self.pickle_files[idx], "rb"),

    def get_batch(self, start, end):
        Reads a Batch from the CIFAR-10 pickle files.
        assert 0 <= start <= end <= len(self)
        start_file, start_off = divmod(start, self.file_length)
        end_file, end_off = divmod(end - 1, self.file_length)
        inputs = np.empty((0, IMAGE_SIZE), dtype="uint8")
        labels = []
        # Walk through the pickle files and accumulate the requested records.
        while start_file <= end_file:
            records = self.load_file(start_file)
            file_end_off = (end_off + 1 if end_file == start_file
                            else self.file_length)
            inputs = np.concatenate([
                records[b"data"][start_off:file_end_off, :]])
            start_file, start_off = start_file + 1, 0
        return, np.array(labels))

Note we wrap load_file in functools.lru_cache to speed up access, at the cost of some memory. For situations where the data loading is dominated by compute time, PEDL offers pedl.util.memo_pickle, which caches the result of computations to disk, where they can be shared by multiple trials.

Finally, we can implement make_data_loaders():

def make_data_loaders(experiment_config, hparams):
    Returns training and validation CIFAR10BatchLoaders.  The path where the
    CIFAR-10 pickle files are located should be in
    training_loader = CIFAR10BatchLoader(sorted(glob.glob(os.path.join(
        experiment_config["data"]["path"], "data_batch_*"))))
    validation_loader = CIFAR10BatchLoader([os.path.join(
        experiment_config["data"]["path"], "test_batch")])
    return training_loader, validation_loader

Using ArrayBatchLoader

If memory is not a concern, you can load your entire data set into memory and use

def make_data_loaders(experiment_config, hparams):
    Loads CIFAR-10 training and validation data into in-memory
    ArrayBatchLoaders.  The path where the CIFAR-10 pickle files are located
    should be in experiment_config["data"]["path"].
    # Load and concatenate all of the training data.
    training_inputs = np.empty((0, IMAGE_SIZE), dtype="uint8")
    training_labels = []
    for pkl in sorted(glob.glob(os.path.join(
            experiment_config["data"]["path"], "data_batch_*"))):
        records = pickle.load(open(pkl, "rb"), encoding="bytes")
        training_inputs = np.concatenate([

    training_loader =,

    # Load validation data.
    validation_pkl = os.path.join(experiment_config["data"]["path"],
    records = pickle.load(open(validation_pkl, "rb"), encoding="bytes")

    validation_loader =
        records[b"data"], np.array(records[b"labels"]))

    return training_loader, validation_loader

Using SliceBatchloader

Often one has a single data set and must perform a training-validation split. SliceBatchLoader can take slices of an existing BatchLoader to do this. In the following example we load the training and validation data into a single ArrayBatchLoader and split it into training and validation data loaders:

def make_data_loaders(experiment_config, hparams):
    Loads all CIFAR-10 data into a single in-memory ArrayBatchLoader, and
    return training and validation slices using SliceBatchLoader.  The path
    where the CIFAR-10 pickle files are located should be in
    # Load and concatenate all of the data.
    inputs = np.empty((0, IMAGE_SIZE), dtype="uint8")
    labels = []
    for pkl in sorted(glob.glob(os.path.join(
            experiment_config["data"]["path"], "*_batch*"))):
        records = pickle.load(open(pkl, "rb"), encoding="bytes")
        inputs = np.concatenate([

    data_loader =, np.array(labels))

    # Split into training and validation slices.
    training_loader =, 0, 50000)
    validation_loader =, 50000, 10000)

    return training_loader, validation_loader