Data DownloadingΒΆ

When doing Multi-GPU Training training, a single process is created for each GPU being used on a given agent. Each of these processes will invoke the framework specific function related to data preparation, in most cases these calls will happen concurrently. If each copy of these calls downloads the entire data set, this causes two problems:

  1. the data set will be downloaded multiple times

  2. if storing the data set on disk, different copies of the download might overwrite or conflict with one another.

PEDL provides an optional API for downloading data as part of the training process. If the developer implements a download_data() API function, this function will be invoked once on each machine, before any data loaders are created. This function can be used to download a single copy of the data set, and should return the path of a directory on disk containing the data set. This path can be fetched by calling pedl.get_download_data_dir(), which is commonly done in make_data_loaders().

Function signature: download_data(experiment_config: Dict[str, Any], hparams: Dict[str, Any]) -> str