The data layer in Determined enables high-performance data loading for deep learning training. Modern deep learning data-loading APIs do not always provide both a random-access layer and a sequential layer (see: TF Dataset). This makes them inefficient solutions for performing many common deep learning workflows including:
Determined integrates the data layer with YogaDL to provide a better approach to data-loading. The data layer enables efficient data access for common deep learning workflows such as hyperparameter tuning and distributed training. It also enables dataset versioning. This how-to guide covers:
How the data layer works.
How to use the data layer API in your code.
How the Data Layer Works¶
The data layer in Determined enables a random-access layer by caching datasets. A dataset is cached by iterating over it before the start of training and storing the output to an LMDB file. Using this random-access layer Determined efficiently accesses training and validation data, adapting to different experiment types. For example, during distributed training each GPU in Determined only needs access to a portion of the training data.
The data layer in Determined uses
version keys as
unique identifiers for dataset caches. When a dataset cache with
matching keys is found it’s reused, allowing users to skip
Configuring Data Layer Storage¶
The data layer can be configured to cache datasets in
or on the agent’s file system. The storage medium can be configured in
the experiment config. For
GCS, the data layer maintains a cloud copy and a local copy of the
cached dataset. If the timestamp of the local cache matches the
timestamp of the one in the cloud, the local copy is used; otherwise,
the local copy is overwritten.
In order to reuse the locally cached datasets when using
GCS across trials and experiments, users should configure
local_cache_container_path, which bind
mounts the directories and reuses them across the containers running the
different trials and experiments. When using
shared_fs (a local
filesystem) as the storage medium, users should configure
container_storage_path to reuse cached
datasets across trials and experiments.
Automatically deleting cached datasets is not currently supported in Determined. If users want to delete a cached dataset, they should do so manually. Dataset caches are located under:
Deleting a cache that is in use by an active experiment will result in undefined behavior.
Using the Data Layer API¶
The data layer API requires users to place their dataset creation code within a function and to decorate that function with Determined-provided decorators that can be accessed via the context object:
self.context.experimental.cache_train_dataset(dataset_id: str, dataset_version: str, shuffle: bool = False, skip_shuffle_end_of_epoch: bool = False)for training data.
self.context.experimental.cache_validation_dataset(dataset_id: str, dataset_version: str, shuffle: bool = False)for validation data.
dataset_version don’t match an existing
cached dataset, the dataset is written to a new cache. If there is a
match, the caching process is skipped. Once the dataset is cached,
Determined returns a
tf.data.Dataset object containing the required
data. By creating the
tf.data.Dataset object from the cache,
Determined is able to populate it with the appropriate data. For
example, if resuming training mid-epoch, the dataset will start from the
This is an example of how to use the data layer API:
def make_train_dataset(self): @self.context.experimental.cache_train_dataset("range_dataset", "v1", shuffle=True) def make_dataset() -> tf.data.Dataset: ds = tf.data.Dataset.range(1000) return ds # Returns a tf.data.Dataset. dataset = make_dataset() # Perform batching and run-time augmentation outside the cache. dataset = dataset.batch(self.context.get_per_slot_batch_size()) dataset = dataset.map(lambda x: x + 1) return dataset
The first time this code is executed, the dataset is cached. In
subsequent experiments, as long as the cache for this dataset is still
present, the decorated
make_dataset() function will not be executed
again. Instead, the dataset will be read directly from the cache.
Users are encouraged to place experiment-specific dataset operations,
such as batching and runtime augmentation, outside the
make_dataset() function, as is done in the example above. This
allows users to reuse the cached dataset across a wide range of
examples. For example, using the example above, users can experiment
with a wide range of batch sizes. If
dataset.batch(32) were included
make_dataset(), users would always have a batch size of 32 when
reusing the cached dataset.
dataset.repeat() is called within
make_dataset(), the data
layer will write records from the dataset until it fills up the
entire disk. In fact, users never need to call