Data Layer¶
The data layer in Determined enables high-performance data loading for deep learning training. Modern deep learning data-loading APIs do not always provide both a random-access layer and a sequential layer (see: TF Dataset). This makes them inefficient solutions for performing many common deep learning workflows including:
Resuming training mid-epoch
Determined integrates the data layer with YogaDL to provide a better approach to data-loading. The data layer enables efficient data access for common deep learning workflows such as hyperparameter tuning and distributed training. It also enables dataset versioning. This how-to guide covers:
How the data layer works.
How to configure a local file-system, AWS S3, or GCS as the storage medium.
How to use the data layer API in your code.
Note
The data layer is an experimental feature and its API is not
considered stable. Currently it supports only tf.data.Dataset
inputs, which can be used in Keras and
Estimators.
How the Data Layer Works¶
The data layer in Determined enables a random-access layer by caching datasets. A dataset is cached by iterating over it before the start of training and storing the output to an LMDB file. Using this random-access layer Determined efficiently accesses training and validation data, adapting to different experiment types. For example, during distributed training each GPU in Determined only needs access to a portion of the training data.
The data layer in Determined uses name
and version
keys as
unique identifiers for dataset caches. When a dataset cache with
matching keys is found it’s reused, allowing users to skip
pre-processing steps.
Configuring Data Layer Storage¶
The data layer can be configured to cache datasets in S3
, GCS
,
or on the agent’s file system. The storage medium can be configured in
the experiment config. For S3
and
GCS
, the data layer maintains a cloud copy and a local copy of the
cached dataset. If the timestamp of the local cache matches the
timestamp of the one in the cloud, the local copy is used; otherwise,
the local copy is overwritten.
In order to reuse the locally cached datasets when using S3
or
GCS
across trials and experiments, users should configure
local_cache_host_path
and local_cache_container_path
, which bind
mounts the directories and reuses them across the containers running the
different trials and experiments. When using shared_fs
(a local
filesystem) as the storage medium, users should configure
host_storage_path
and container_storage_path
to reuse cached
datasets across trials and experiments.
Automatically deleting cached datasets is not currently supported in Determined. If users want to delete a cached dataset, they should do so manually. Dataset caches are located under:
bucket/bucket_directory_path/dataset_id/dataset_version/
onS3
andGCS
local_cache_host_path/yogadl_local_cache/dataset_id/dataset_version/
locally forS3
andGCS
host_storage_path/dataset_id/dataset_version/
locally forshared_fs
Warning
Deleting a cache that is in use by an active experiment will result in undefined behavior.
Using the Data Layer API¶
The data layer API requires users to place their dataset creation code within a function and to decorate that function with Determined-provided decorators that can be accessed via the context object:
self.context.experimental.cache_train_dataset(dataset_id: str, dataset_version: str, shuffle: bool = False, skip_shuffle_end_of_epoch: bool = False)
for training data.self.context.experimental.cache_validation_dataset(dataset_id: str, dataset_version: str, shuffle: bool = False)
for validation data.
If the dataset_id
and dataset_version
don’t match an existing
cached dataset, the dataset is written to a new cache. If there is a
match, the caching process is skipped. Once the dataset is cached,
Determined returns a tf.data.Dataset
object containing the required
data. By creating the tf.data.Dataset
object from the cache,
Determined is able to populate it with the appropriate data. For
example, if resuming training mid-epoch, the dataset will start from the
appropriate offset.
This is an example of how to use the data layer API:
def make_train_dataset(self):
@self.context.experimental.cache_train_dataset("range_dataset", "v1", shuffle=True)
def make_dataset() -> tf.data.Dataset:
ds = tf.data.Dataset.range(1000)
return ds
# Returns a tf.data.Dataset.
dataset = make_dataset()
# Perform batching and run-time augmentation outside the cache.
dataset = dataset.batch(self.context.get_per_slot_batch_size())
dataset = dataset.map(lambda x: x + 1)
return dataset
The first time this code is executed, the dataset is cached. In
subsequent experiments, as long as the cache for this dataset is still
present, the decorated make_dataset()
function will not be executed
again. Instead, the dataset will be read directly from the cache.
Users are encouraged to place experiment-specific dataset operations,
such as batching and runtime augmentation, outside the
make_dataset()
function, as is done in the example above. This
allows users to reuse the cached dataset across a wide range of
examples. For example, using the example above, users can experiment
with a wide range of batch sizes. If dataset.batch(32)
were included
in make_dataset()
, users would always have a batch size of 32 when
reusing the cached dataset.
Warning
If dataset.repeat()
is called within make_dataset()
, the data
layer will write records from the dataset until it fills up the
entire disk. In fact, users never need to call dataset.repeat()
in Determined.