Data Layer¶
The data layer in Determined enables high-performance data loading for deep learning training. Modern deep learning data-loading APIs do not always provide both a random-access layer and a sequential layer (see: TF Dataset). This makes them inefficient solutions for performing many common deep learning workflows including:
Resuming training mid-epoch
Determined integrates the data layer with YogaDL to provide a better approach to data-loading. The data layer enables efficient data access for common deep learning workflows such as hyperparameter tuning and distributed training. It also enables dataset versioning. This how-to guide covers:
How the data layer works.
How to configure a local file-system, AWS S3, or GCS as the storage medium.
How to use the data layer API in your code.
Note
The data layer is an experimental feature and its API is not considered stable. Currently it supports only
tf.data.Dataset
inputs, which can be used in Keras
and Estimators.
How the Data Layer Works¶
The data layer in Determined enables a random-access layer by caching datasets. A dataset is cached by iterating over it before the start of training and storing the output to an LMDB file. Using this random-access layer Determined efficiently accesses training and validation data, adapting to different experiment types. For example, during distributed training each GPU in Determined only needs access to a portion of the training data.
The data layer in Determined uses name
and version
keys as unique identifiers for dataset
caches. When a dataset cache with matching keys is found it’s reused, allowing users to skip
pre-processing steps.
Configuring Data Layer Storage¶
The data layer can be configured to cache datasets in S3
, GCS
, or on the agent’s file system.
The storage medium can be configured in the experiment config. For S3
and GCS
,
the data layer maintains a cloud copy and a local copy of the cached dataset. If the
timestamp of the local cache matches the timestamp of the one in the cloud, the local copy is used;
otherwise, the local copy is overwritten.
In order to reuse the locally cached datasets when using S3
or GCS
across trials and experiments,
users should configure local_cache_host_path
and local_cache_container_path
, which bind mounts the directories and reuses them
across the containers running the different trials and experiments. When using shared_fs
(a local filesystem)
as the storage medium, users should configure host_storage_path
and container_storage_path
to reuse cached
datasets across trials and experiments.
Automatically deleting cached datasets is not currently supported in Determined. If users want to delete a cached dataset, they should do so manually. Dataset caches are located under:
bucket/bucket_directory_path/dataset_id/dataset_version/
onS3
andGCS
local_cache_host_path/yogadl_local_cache/dataset_id/dataset_version/
locally forS3
andGCS
host_storage_path/dataset_id/dataset_version/
locally forshared_fs
Warning
Deleting a cache that is in use by an active experiment will result in undefined behavior.
Using the Data Layer API¶
The data layer API requires users to place their dataset creation code within a function and to decorate that function with Determined-provided decorators that can be accessed via the context object:
self.context.experimental.cache_train_dataset(dataset_id: str, dataset_version: str, shuffle: bool = False, skip_shuffle_end_of_epoch: bool = False)
for training data.self.context.experimental.cache_validation_dataset(dataset_id: str, dataset_version: str, shuffle: bool = False)
for validation data.
If the dataset_id
and dataset_version
don’t match an existing cached dataset, the dataset is
written to a new cache. If there is a match, the caching process is skipped.
Once the dataset is cached, Determined returns a tf.data.Dataset
object containing the required
data. By creating the tf.data.Dataset
object from the cache, Determined is able to populate it with
the appropriate data. For example, if resuming training mid-epoch, the dataset will start from
the appropriate offset.
This is an example of how to use the data layer API:
def make_train_dataset(self):
@self.context.experimental.cache_train_dataset("range_dataset", "v1", shuffle=True)
def make_dataset() -> tf.data.Dataset:
ds = tf.data.Dataset.range(1000)
return ds
# Returns a tf.data.Dataset.
dataset = make_dataset()
# Perform batching and run-time augmentation outside the cache.
dataset = dataset.batch(self.context.get_per_slot_batch_size())
dataset = dataset.map(lambda x: x + 1)
return dataset
The first time this code is executed, the dataset is cached. In subsequent experiments,
as long as the cache for this dataset is still present, the decorated make_dataset()
function will
not be executed again. Instead, the dataset will be read directly from the cache.
Users are encouraged to place experiment-specific dataset operations, such as batching and runtime augmentation,
outside the make_dataset()
function, as is done in the example above.
This allows users to reuse the cached dataset across a wide range of examples. For example, using the example above,
users can experiment with a wide range of batch sizes. If dataset.batch(32)
were included in make_dataset()
,
users would always have a batch size of 32 when reusing the cached dataset.
Warning
If dataset.repeat()
is called within make_dataset()
, the data layer will write records from the dataset until it
fills up the entire disk. In fact, users never need to call dataset.repeat()
in Determined.