Release Notes

Version 0.11

Version 0.11.2

Release Date: February 19, 2020

  • The “pedl agent list” command now includes a column displaying the agents label.

  • Add the ability to configure whether to average gradients accumulated across batches by the aggregation_frequency (when aggregation_frequency > 1).

  • Fix bug when specifying integer hyperparameters with grid search.

  • The PEDL CLI and PEDL harness code has been split up into separate packages.

  • Breaking Change: Require a step_mode argument for LRScheduler.

  • Add “pedl trial download” command to download checkpoints for a given trial.

  • Add “pedl.frameworks.pytorch.checkpoint.load” for loading PyTorch checkpoints into a Python process in downstream applications.

Version 0.11.1

Release Date: February 19, 2020

  • Breaking Change: Remove support for the distributed and optimized_parallel flags in the experiment configurations. For multi-GPU training, you need only to set the number of slots to a number greater than 1.


    The number of slots must be a multiple of the number of GPUs per machine (e.g., 4, 8, or 12 for 4-GPU machines).

  • Breaking Change: The method of deploying PEDL in cloud settings has changed from using AMIs versioned by the PEDL version to AMIs versioned by the PEDL environments version. Please contact the Determined AI team for help with the migration.

  • Breaking Change: Remove support for environment.debug in the experiment configurations. To enable debug mode, users should specify debug: true as a top-level field in their experiment configuration.

  • The PEDL CLI and PEDL harness code has been split up into separate packages. Please contact the Determined AI team for help with the new install approach.

  • Training and validation metrics are now distinguished with an identifier in the metrics view.

  • Fix bug that prevented some types of PyTorch metrics from being displayed in TensorBoard.

  • Remove support for generators in TF Keras models.

  • Switch to the MLflow model checkpoint format for PyTorch checkpoints.

  • Add CLI reference page to the documentation.

Version 0.11.0

Release Date: February 6, 2020

  • Breaking Change: Change the way environments are configured. If you need to configure the default environment, you can do so by specifying a script at the root directory of your experiment. If you’d like to use a custom Docker image, please contact us.

    Experiments defined prior to v0.11.0 will require migration to the new schema prior to submission. Please contact us before upgrading.

  • Remove KerasTrial, KerasFunctionalTrial, and KerasSimpleTrial (deprecated in 0.10.10). Please use TFKerasTrial instead.

  • Upgrade PyTorch version to 1.4.

  • WebUI: Add the ability to download full trial logs as a text file.

Version 0.10

Version 0.10.11

Release Date: January 24, 2020

  • Add support for PyTorch 1.3.

  • Add Tensor Fusion optimization for multi-GPU training. For more information, check out our experiment configuration reference guide.

  • Add support for Amazon RDS as a PEDL database.

  • Add support for preemptible instances on GCP.

    Known issue: If max_restarts > 0 while the preemptible instances feature is enabled, the task will hang if the instance is preempted. This will be fixed in a future release.

  • Fix modulo by zero error in PyTorch learning rate scheduler.

  • Check that slots_per_trial > 1 when distributed or optimized_parallel is set to True.

  • Include PEDL CLI in AWS/GCP images.

  • WebUI: Add option to display desired training and evaluation metrics in tabular form.

  • WebUI: Improve checkpoint visibility to show best overall and best per step in experiement and trial details.

  • WebUI: Add ability to copy experiment configuration directly to clipboard.

  • WebUI: Allow sort order of experiments page to be linked via query parameter.

  • WebUI: Show the total storage consumed by a single trial.

  • WebUI: Support shell viewing and management.

Version 0.10.10

Release Date: December 19, 2019

  • Breaking Change: Move aggregation frequency, gradient compression, and mixed precision training configurations from hyperparameters to optimizations in the experiment config.

  • Breaking Change: Move pedl.trial.get_trial_seed() to pedl.get_trial_seed().

  • Add documentation for TFKerasTrial.

  • Deprecation Warning: KerasTrial, KerasFunctionalTrial, and Simple Keras interfaces have been deprecated and will be removed in a future PEDL version. Please use TFKerasTrial.

  • Move the Python API reference documentation pages from /docs/reference to /docs/api.

  • Add the utility class.

  • Add support for TensorFlow 1.15.0. TensorFlow 1.14.0 remains the default because in 1.15.0, TFKerasTrial does not work with data loaders.

  • Add TensorBoard support for experiments using HDFS storage.

  • Support configuration of ports used by GLOO and NCCL ports used during distributed training.

  • Support configuring checkpoint storage in the cluster configuration. This is the default storage used for new experiments. See Checkpoints for details.

  • Support using master.yaml to configure the network interface used for distributed training. Specifying the network interface in this way should reduce the start-up time for distributed training.

  • Update PyTorchTrial documentation to match the new PyTorchTrial API.

  • Web UI: Added logs for commands, notebooks, shells, and TensorBoards.

  • Web UI: Add a “copy to clipboard” button across all log views.

  • Web UI: Persist filter selections on the experiment list page as query parameters.

  • Web UI: Fix bug where the plot view for experiment and trial details would show on a second line in Firefox.

  • Web UI: Fix bug where description fields were being incorrectly sorted in tables.

  • Web UI, CLI: Fix bug where retrieved master logs were missing some logged fields.

  • CLI: Support downloading checkpoints from Google Cloud Storage (GCS).

Version 0.10.9

Release Date: December 6, 2019

  • Breaking Change: Remove custom definition of learning rate scheduling in favor of direct support of PyTorch learning rate schedulers.

  • Breaking Change: The pedl.callback.Callback interface has been deprecated.

  • Add option to set /dev/shm on a per-experiment basis in the experiment config.

  • Support launching an elastic agent with an IAM Instance Profile attached on AWS.

  • Support AWS authentication for checkpoints with IAM roles or environment variables. For more information, see

  • WebUI: Allow control- and command-clicking to open experiments and trial pages in new browser tabs.

  • WebUI: Add the ability to bulk kill experiments from experiment list view.

  • WebUI: Add confirmation prompt to terminal actions and actions affecting more than one entity.

  • WebUI: Add more detailed information the trial page:

    Show detailed checkpoint information for each trial.

    Show timing information for training, validation, and checkpointing for each trial.

  • WebUI: Display the plotted trial metric in numerical format in the steps table.

Version 0.10.8

Release Date: November 22, 2019

  • Breaking Change: Update PyTorchTrial API in a backward-incompatible way.

The following API changes have been made:

1. Remove `hparams` for function args; please use `pedl.get_hyperparameter()` instead.
2. Remove `PyTorchTrial.losses()` and `PyTorchTrial.training_metrics()` which are replaced with `PyTorchTrial.train_batch(self, batch: TorchData, model: nn.Module, epoch_idx: int, batch_idx: int)` which performs the forward pass and returns the loss and other training metrics.
3. Remove `PyTorchTrial.validation_metrics()` which is replaced by `PyTorchTrial.evaluate_batch(self, batch: TorchData, model: nn.Module)`, which returns the validation metrics.

Existing PyTorch model definitions will need to be updated to work with this release of PEDL.

  • Breaking Change: Support PyTorch DataLoaders in a PyTorch 1.3 compatible way in preparation for PyTorch 1.3 support. Please see the PyTorch documentation for more details.

  • Breaking Change: Simplify master configuration for dynamic agents.

    The master URL of dynamic agents is now configured using the key master_url, replacing the previous key master_address. A valid master URL is in the format of scheme://hostname:port. The master URL defaults to http as scheme, the local IP address of the master as hostname, and 8080 as port if the master runs on cloud. The hostname can still be configured using alias.

    For AWS dynamic agents, the region now defaults to the region of the master instance, and support for the ec2.region alias has been removed. tag_key and tag_value now default to managed-by and an identifier determined by whether the master runs on EC2.

    For GCP dynamic agents, the project and zone now defaults to the project ID and zone of the master instance, and support for the gce.project-id alias has been removed. label_key and label_value now default to managed-by and an identifier determined by whether the master runs on GCP.

    See Dynamic Agents on AWS and Dynamic Agents on GCP for details.

  • New Feature: Support a user-defined startup script for dynamic agents. The startup script runs as root on all dynamic agents during instance startup. See Dynamic Agents on AWS and Dynamic Agents on GCP for details.

  • New Feature: Support GCP native API for dynamic agents. The instance resource base configuration that will be merged with the other fields in the configuration to construct the instance inserting request. See Dynamic Agents on GCP for details.

  • Support attaching a second disk to GCP dynamic agents by configuring the instance base configuration and the user startup script. See the GCP dynamic agents documentation for details.

  • Fix bug that prevented AWS dynamic agents from working if the subnet_id is not specified but security_group_id is specified in the configuration.

  • Add support for in TFKerasTrial.

    Known issue: It currently does not support pause/restart for distributed training.

  • WebUI: Improve “Show Configuration” functionality on the experiment detail page.

  • WebUI: Improve messaging on the Cluster page when there are currently no agents running.

Version 0.10.7

Release Date: November 14, 2019

  • Add support for encrypting communications between the PEDL Master, CLI, and WebUI over HTTPS.

    Refer to Security for configuration instructions.

  • Add documentation for network requirements and recommendations for PEDL clusters.

  • Improve distributed training for EstimatorTrial and added support for optimized_parallel.

  • WebUI: Allow viewing master logs from the Cluster page.

  • Fix bug that prevented pulling images from repositories that are not

Version 0.10.6

Release Date: November 11, 2019

Version 0.10.5

Release Date: November 8, 2019

  • BREAKING CHANGE: Add hyperparameters to PyTorch checkpoints and modify the format to comply with best practices. See PyTorch checkpoints for more information.

  • BREAKING CHANGE: Support more fine-grain configuration of the VPC networking of Dynamic Agents on AWS. Previously, the desired security group was indicated with the field security_group, and only it was configurable. Now, the network subnet and public IP is also configurable, and the security group is specified with the field security_group_id.

  • Add documentation for tf.keras checkpoints.

  • Bind mounts can be specified with a relative container_path and will be placed in the working directory of the container. See Experiment Configuration for details.

  • CLI: The PEDL CLI now prints a warning if the CLI version differs from the master version.

  • WebUI: Improve the default sorting behavior of validation metrics.

  • Update the PEDL_HPARAMS environment variable with the per-GPU batch size for distributed and parallel training.

  • Fix bug where user-provided conda environments were being ignored.

  • Fix bug where exit codes of pedl cmd run were being ignored.

Version 0.10.4

Release Date: November 1, 2019

  • The Docker network of dynamic agents can be configured. See Dynamic Agents on GCP and Dynamic Agents on AWS for details.

  • WebUI: Reduce the number of trials show in the experiment detail page.

  • WebUI: CPU-only notebooks can be launched in the Notebooks section of PEDL.

  • Fix bug in adaptive search where the target steps may be greater than step budget.

  • Improve logging of GCP operations with dynamic agents.

Version 0.10.3

Release Date: October 24, 2019

  • New Feature: Distributed training support for PyTorch.

  • Add a new API for downloading data in PyTorch models.

    When doing distributed training of a PyTorch model, a single process is created for each GPU being used on a given agent. Each of these processes will invoke the make_data_loaders() function; in most cases these calls will happen concurrently. If each copy of the training set data loader downloads the entire data set, this causes two problems: (1) the data set will be downloaded multiple times (2) if storing the data set on disk, different copies of the download might overwrite or conflict with one another.

    To address these concerns, this release of PEDL introduces a new optional API for PyTorch models. If the developer implements a download_data() API function, this function will be invoked once per machine, before any data loaders are created. This function can be used to download a single copy of the data set; it should return the path of a directory on disk containing the data set. This path will then be passed when make_data_loaders() is invoked.

  • Optimize performance of single-machine, multi-GPU training with PyTorch.

    This new code path can be enabled by specifying optimized_parallel: True in the experiment config. This can result in substantially improved multi-GPU training performance (more than 5x faster in some cases). Optimized parallel performance current requires full use of all GPUs on an agent. When this option is enabled, slots_per_trial (in the experiment config, under the resources key) must be set equal to the total of GPUs on an agent.

  • Support default value for service account scopes in GCP dynamic agents configuration.

  • Improve tracking and error reporting for GCP operations when inserting and deleting instances.

    Previous versions of PEDL did not emit error messages when GCP operations (e.g., provisioning new dynamic agents) failed. These operations are now tracked more accurately and errors are reported in the master log.

  • Reduce size of metadata database by cleaning up fault tolerance metadata more aggressively.

  • Fix bug that prevented agents from reconnecting to the master in some situations.

  • CLI: Automatically connect via port 80 when the master address is specified as http://address/.

    Previous releases of PEDL defaulted to port 8080, which is inconsistent with the default port number for HTTP URLs.

  • WebUI: By default, sort tables first by state and then by creation time.

  • WebUI: Add the ability to close “fork experiment” modals with the escape key.

  • WebUI: Add the ability to kill commands.

Version 0.10.2

Release Date: October 18, 2019

  • Breaking Change: Support configuring GCP network and subnetwork to use a different project than the project of the PEDL master instance.

    Previously, the network and the subnetwork of the dynamic agents should be set to be the name of the network and the subnetwork. Now, it is required to specify the full path of the configuration. A valid full path for a network should include the project ID and be in the format of projects/<project>/global/networks/<network>. Likewise, a valid subnetwork should be in the format of projects/<project>/regions/<region>/subnetworks/<subnetwork>. See Dynamic Agents on GCP for details.

  • New Feature: Introduce a new version of TensorBoard support.

    The architecture of our TensorBoard support has been overhauled. The new TensorBoard support features live updating and automatic serialization of tfevents files for PEDL batch metrics.

    This release changes the format used for storing TensorBoard metrics. Experiments created using previous releases of PEDL use the old metric format, which is not compatible with this new version of TensorBoard support. As such, TensorBoard cannot be launched on old experiments. If you would like to use TensorBoard on experiments created in previous versions of PEDL, please contact the Determined AI team.

  • Support configuring the Docker network used by masters, agents, and tasks.

    By default, PEDL creates a Docker network named pedl on both master and agent machines and uses this network for masters, agents, and task containers. This behavior can be configured via the network.conf config file: the PEDL_NETWORK variable defines the network used by the master and agent containers, while the TRIAL_RUNNER_NETWORK variable controls the network used by task containers. These variables can be set to the name of a Docker network to use (this network will be created automatically); alternatively, the special value host can be used, which causes PEDL to start containers using host-mode networking.

  • Support configuring shared memory size for trial runners and task containers on a per-agent basis.

    In previous releases of PEDL, trial runner containers used 4GB of shared memory (/dev/shm), whereas task containers used the Docker default for /dev/shm (64MB). In this release of PEDL, both trial runners and task containers now default to using 4GB of shared memory. This value can now be configured on a per-agent basis by setting TRIAL_RUNNER_SHM_SIZE in agent.conf.

  • Improve performance of single-machine, multi-GPU training with tf.keras and Tensorpack.

    This new code path can be enabled by specifying optimized_parallel in the experiment configuration. This can result in substantially improved multi-GPU training performance (more than 5x faster in some cases). When this option is enabled, trials from non-distributed, multi-slot experiments must use all the GPUs on the agent. The scheduler will automatically apply this constraint – for example, if the PEDL cluster consists of agents with 4 GPUs and 8 GPUs, experiments that are configured to use 2 slots_per_trial will never be scheduled, and the scheduler will automatically place experiments that use slots_per_trial: 4 and slots_per_trial: 8 on the respective agents.

  • Add ability to link PEDL users to a Unix user and group on agents.

    The Unix user/group associated with a PEDL user account can be set via a new CLI command, pedl user link-with-agent-user. When configured, tasks launched by the PEDL user will run as the linked Unix user and group. See Running tasks as particular agent users for more details.

  • Add support for storing checkpoints on Google Cloud Storage (GCS).

  • Add PEDL CLI to notebooks and commands.

  • Add documentation for TensorpackTrial.

  • Do not inject .bashrc when using custom container images.

    This avoids overwriting any .bashrc file that might exist in the custom image.

  • Fix bug in merging behavior for configuration templates for commands, notebooks, and tensorboards.

  • Fix bug in the handling of experiments in a stopping state on master restart.

  • CLI: Fix error when listing trials of an experiment with no trials.

Version 0.10.1

Release Date: October 10, 2019

  • Breaking Change: Simplify the interface of tf.keras trial.

    The interface of TFKerasTrial has been simplified to a single required function: build_model(). Unlike in previous versions of PEDL, the implementation of build_model() is required to compile the tf.keras model object before returning it. In the experiment configuration, the specified searcher.metric key is expected to adopt the naming convention used by tf.keras: val_<function_name>, e.g. val_categorical_accuracy.

  • Support killing pending notebooks, shells, and commands.

    Previously, it was only possible to terminate a workload once that workload had started up successfully.

  • Undefined experiment descriptions will default to a random petname.

  • Upgrade PyTorch support to 1.2.0.

  • Add support for CPU-only notebooks.

    In previous releases of PEDL, notebook tasks were always allocated a GPU. In this release of PEDL, CPU-only notebooks are now supported. This can be done by setting resources.slots to 0 in the configuration when launching the notebook.

  • Improve TensorBoard documentation.

  • Support configuring service account for dynamic agents on GCP.

    You can specify the name and the scopes of the service account in the master configuration. For more details, see the documentation for Dynamic Agents on GCP.

  • Make the master HTTP port configurable.

  • Add a new example with tf.keras: cifar10_cnn_tf_keras

  • cli: Add a new command “pedl user list”

  • Reduce size of container images.

  • Improve pre-built task environments.

    The default task environment, which is used for experiment, notebook, and command environments, now includes a larger set of packages by default. It now includes scikit-learn, matplotlib, pandas, OpenCV, pillow, and xgboost. This reduces the need for users to specify additional workload dependencies in configuration files. The default task environment also is pre-built each release, so if there are no additional packages or commands in the experiment configuration, the task environment image will simply be downloaded directly rather than built from scratch. This should improve the time to launch experiments, notebooks, and commands.

  • Improve progress reporting of adaptive and simple adaptive experiments.

  • Support configuring the master address in Kubernetes Helm charts.

  • Add persistent id for master to database; added cluster_id table with single cluster_id uuid field.

  • Fix incorrect error message in PBT validation.

  • Fix a bug that prevented printing out correct master configuration in the logs.

  • Fix bug in Horovod startup where Horovod may take longer to start than expected.

  • WebUI: Fix an issue where opening a trial with huge number of log events would result in logs never getting loaded.

  • WebUI: Fix a regression that caused classic Jupyter notebooks to be opened instead of JupyterLab notebooks.

  • WebUI: Fix an issue where the initially selected metric for trial plots could be different from the actual plotted one if there was two “loss” metrics.

  • WebUI: Corrected labels for “Best validation” and “Latest validation” columns in WebUI.

  • WebUI: Display different types of resources (CPU vs GPU) as separate bars in cluster page.

  • WebUI: Add support for launching TensorBoard for a specific trial from the trial detail page.

  • WebUI: Display the plotted validation metric’s name in experiment detail view.

  • WebUI: Add bulk pausing and archiving of experiments to the experiment list page.

    This makes it easier to apply the same operation to multiple experiments at the same time.

Version 0.10.0

Release Date: September 27, 2019

  • New Feature: Introduce a new version of the WebUI.

    The visual design of the WebUI has been overhauled. The new WebUI also features improved performance for large experiments and a refactored internal architecture.

  • New Feature: Add native support for Anaconda environments.

    This makes it easier for users to create PEDL experiments that have Conda-based dependencies. See the Environment Configuration documentation for details.

  • Breaking Change: Improvements to the TensorpackTrial interface.

    Remove support for train_dataflow and validation_dataflow and support make_data_loaders in TensorpackTrial.

  • Breaking Change: Change the way instance providers are configured for dynamic agents.

    Previously, the cloud provider was configured using the cloud configuration field. This field has now been renamed to provider. For more details, see the documentation for Dynamic Agents on AWS and Dynamic Agents on GCP.

  • Fix a bug that caused trials to crash when the master was restarted.

  • Improved distributed training performance for tf.keras models.

  • Improve performance of models trained with EstimatorTrial interface.

    Previously, every training and validation step would pay the overhead cost of initializing a TensorFlow graph. In this release of PEDL, this per-step initialization overhead is reduced to once per trial container.

  • Reduce number of proxy configuration options.

    See Agent Network Proxies for details.

  • cli: Fix bug when creating experiments on Windows.

  • Upgrade to PyTorch 1.1.0

  • Fix bug where pausing an experiment that had not been scheduled would erroneously trigger the restart logic for its corresponding trials.

  • Improve validation of PBT configuration parameters.

  • Improve agent logging when pulling images.

  • Fix deprecation warnings when importing pedl.frameworks.tensorflow.

  • Refactor the MNIST PyTorch example to use a directory model definition.

  • Remove calls to super().__init() from example models.

Version 0.9

Version 0.9.6

Release Date: September 13, 2019

  • Simplify Kubernetes installation by auto-detecting GPUs.

  • Update MNIST Keras example.

  • Fix bug preventing dynamic agents from removing instances after tasks finish.

Version 0.9.5

Release Date: September 12, 2019

  • New Feature: Support the ability to configure scheduling fit policy.

    The scheduler can now be configured to use a worst or best-fit policy when assigning tasks to agents in the cluster. The best-fit policy ensures that tasks will be preferentially “packed” together on the smallest number of agents, rather than be placed on under-utilized agents.

  • New Feature: Support the ability to create commands, shells, and notebooks using configuration templates.

    Configuration templates can be used to reduce redundancy in the configuration files of not just the experiments. With this feature, users can move settings that are shared by many commands, shells, and notebooks into a single YAML file that can then be referenced by configurations that require those settings. See the Configuration Templates for more details.

  • Default to “best-fit” instead of “worst-fit” scheduling fit policy.

  • Update scheduling behavior of trials that use distributed training.

    Pack distributed tasks on the least number of agents possible. For example, if your cluster has two 4-slot agents and four 2-slot agents and you need to schedule an 8-slot trial, if one slot was already used, PEDL could schedule the distributed trial to use some mixture of the remaining 4 and 2-slot agents. With this change, PEDL will now wait until both 4-slot agents are free to schedule the trial.

  • Improve documentation of Elastic AI infrastructure.

  • Improve documentation of hyperparameter search methods, especially around adaptive_simple.

  • Print more helpful error message when trying to create an experiment with an invalid config.

  • Fix bug that prevents using subnetwork in the network interface of GCP dynamic agents.

  • Fix bug that prevents the master from spawning any dynamic agents if there are only zero-slot tasks running, such as TensorBoard and GC tasks.

  • Improved distributed training performance for TensorpackTrial.

Version 0.9.4

Release Date: September 6, 2019

  • Support HTTP proxy environment variables in agent.

    See agent installation documentation for details.

  • Ignore byte-compiled Python files in model definitions by default.

  • Update creation time of generated files in notebooks.

  • Disallow model definition directories named “pedl”.

  • Skip redundant evaluations during training in EstimatorTrial.

  • Fix bug preventing distributed trials from properly rendezvousing.

Version 0.9.3

Release Date: August 30, 2019

  • New Feature: Support Tensorpack framework.

  • Deprecation Warning: Deprecate calling super().__init__() in Trial subclasses. Also remove support for overwriting train_for_step and compute_validation_metrics in Trial subclasses.

  • Support configuring logging level in the master configuration file (master.yaml).

  • Improve error handling when the master configuration is malformed.

  • Reduce maximum allowable context size by 1MB to 95MB.

  • Fix rare race condition for model definitions using the TFKerasTrial API.

  • WebUI: Default to “All experiments” filter.

  • Improve error handling when keras.utils.Sequence object has zero length.

Version 0.9.2

Release Date: August 23, 2019

  • New Feature: Dynamic Agents on GCP.

    With Dynamic Agents on GCP, the PEDL master automatically provisions and terminates PEDL agent instances based on the number of slots needed by pending tasks and the current utilization of agent instances. See Dynamic Agents on GCP for details.

  • New Feature: Users.

    The new users system allows for assets (e.g., experiments, notebooks, etc.) to be organized by owner. New features are described in Users.

    NOTE: Previous versions of the PEDL CLI are incompatible with the new PEDL master. Consequently, all users must upgrade to the latest version of the PEDL CLI.

    When upgrading from an older version of PEDL, a new user account, pedl, will be automatically created; all existing assets will be assigned to it. By default, all interactions with PEDL (CLI and WebUI) will automatically authenticate as the pedl user, so no action is required of administrators for the system to function. Refer to Users for instructions on changing this behavior, including how to create individual accounts.

  • Breaking Change: Fully remove support for deprecated Trial import paths.

    pedl.frameworks.keras_trial and pedl.frameworks.tensorflow_trial have been fully deprecated. Please update all model definitions to use pedl.frameworks.keras and pedl.frameworks.tensorflow respectively.

  • Improve error handling when reading model definition or context directory that are over 96MB in size.

  • Default to the JupyterLab view when launching a Jupyter notebook.

  • Prevent TensorBoard from starting when specifying more than 100 trials.

  • Improve support for distributed training with TF Keras API.

  • Fix a bug in the handling of hyperparameters in PBT searches.

  • Fix bug that caused killing an experiment to always result in a timeout.

  • WebUI: Add a legend to the cluster allocation chart.

Version 0.9.1

Release Date: August 12, 2019

  • Internal release

Version 0.9.0

Release Date: August 8, 2019

  • New Feature: Dynamic Agents on AWS.

    With Dynamic Agents on AWS, the PEDL master automatically provisions and terminates PEDL agent instances based on the number of slots needed by pending tasks and the current utilization of agent instances. See Dynamic Agents on AWS for details.

  • Breaking Change: Change the default TensorFlow version to 1.14.0 for all configurations. The previous default TensorFlow version was 1.13.1.

  • Use a Docker user-defined network instead of using --link for compatibility with Docker 19.x. Users upgrading PEDL should make sure to run make install to install the new systemd services.

  • Make the pedl trial logs -f command exit after the trial it is logging reaches a terminal state.

  • Download prebuilt images from Docker Hub when possible, or build them from scratch otherwise.

  • Make the CLI print progress while zipping model and context directories.

  • Fix shell mode “Too many authentication failures” error by specifying identities for SSH agent authentication.

  • Limit the step budget for adaptive searches to 50000 and limit the number of trials for adaptive_simple searchers to 2000.

  • Fix that adaptive and PBT searchers ignored smaller_is_better.

  • Avoid container name collisions across different agent instances.

Version 0.8

Version 0.8.29

  • Internal release

Version 0.8.28

  • Internal release

Version 0.8.27

  • Internal release

Version 0.8.26

  • Internal release

Version 0.8.25

Release Date: July 9, 2019

  • Breaking Change: Remove support for TensorFlow 1.10 and 1.11.

  • Support logging garbage collection tasks that have failed.

  • Add support for TensorFlow 1.14. The default TensorFlow version remains 1.13.1.

  • Improve support for distributed training with the TensorFlow Estimator API when using TensorFlow 1.14.

  • Fix bug preventing the use of context directories with notebooks, commands, and shell sessions.

Version 0.8.24

  • Internal release

Version 0.8.23

Release Date: July 4, 2019

  • Remove support for Python 3.6.8. All containers will use Python 3.6.9 by default.

Version 0.8.22

Release Date: July 3, 2019

  • Fix bug that prevented Kerberos support without an HDFS checkpoint storage configuration.

Version 0.8.21

Release Date: June 27, 2019

  • Breaking Change: Specify experiment templates via the CLI rather than the experiment configuration.

    Previously, experiment configuration templates were specified via the experiment configuration key template. As of this version of PEDL, the template key in the configuration will be ignored—users should specify a template as follows: pedl experiment create --template <template-name>.

  • Sort task list in CLI by task type and creation time.

  • Permit empty Keras sequences for Keras trials.

  • Fix IMDB Keras adaptive search example.

Version 0.8.20

Release Date: June 13, 2019

  • New Feature: Support for keras.utils.Sequences and Python generators in make_data_loaders for KerasTrial and KerasFunctionalTrial. See the Data Loading for Keras trials for more details.

  • New Feature: Support for in make_data_loaders for PyTorchTrial. See the Data Loading for PyTorch trials for more details.

  • Breaking Change: BatchLoader interface support in PyTorchTrial has been removed in favor of

Version 0.8.19

Release Date: June 6, 2019

  • Fix IMDB Keras example where NumPy 1.16.3 changes the default value for allow_pickle field.

  • TensorBoard commands now verify trial/experiment existence before launching.

  • Fix bug that caused the master to exit when restoring experiments with invalid configurations.

  • Fix bug that prevented pedl experiment list-checkpoints from listing garbage collected checkpoints.

Version 0.8.18

Release Date: May 30, 2019

  • WebUI: Add TensorBoard button for experiments. It launches TensorBoard or opens a preexisting TensorBoard instance for an experiment.

  • Fix error when displaying agents for tasks in the CLI.

Version 0.8.17

Release Date: May 23, 2019

  • Fix regression that led to experiment failure if the default environment Docker images were deleted on the agent node.

  • Stop including tfevent files in checkpoints when experiments use the EstimatorTrial interface.

  • Improve stability and reduce memory footprint during master restarts.

  • WebUI: Fix bug that caused trial logs to always jump to the bottom of the logs, even after manually scrolling up.

Version 0.8.16

Release Date: May 16, 2019

  • New Feature: PEDL now offers the ability to create experiments using configuration templates. Configuration templates can be used to reduce redundancy in experiment configuration files. With this feature, users can move settings that are shared by many experiments into a single YAML file that can then be referenced by configurations that require those settings. See the “Configuration Templates” section of the documentation for more details.

  • Fixed bug where an agent would pull all tags for a custom image instead of the latest tag.

  • Improve readability of trial log messages by inserting a delimiter “===” on trial start.

  • Print slot id on trial runner startup.

  • Test for the presence of an Nvidia driver installation by checking for the driver module directly instead of running command nvidia-smi.

  • Improve the robustness of the pedl-db-backup command.

  • Remove the connection warning in trial logs when a trial runner terminates.

Version 0.8.15

Release Date: May 9, 2019

  • Minor cleanup and bug fixes.

Version 0.8.14

Release Date: May 7, 2019

  • New Feature: Trial logs in the WebUI now update in real time.

    When a trial log is initially opened, the view is scrolled to the bottom of the log (where the most recent log lines are displayed). If the trial is still active, additional log lines will continue to appear at the bottom of the view (scrolling is automatic) unless the user scrolls away from the bottom of the view. Should the user scroll up, automatic updating/scrolling will be disabled. To enable once again, a user can scroll to the bottom of the view.

  • New Feature: PEDL now offers the ability to launch TensorBoard to view trial metrics. See the “TensorBoard” section of the documentation for more details.

  • WebUI: Change the main page to display experiments, commands, and notebooks in separate tabs.

  • Fix bug causing PEDL Commands to not use prebuilt task environment images.

  • Fix bug causing Docker autoremove to result in failed experiments.

  • Fix bug causing TypeErrors in user code to be silently dropped.

  • The scheduler now uses a worst-fit policy when assigning tasks to agents in the cluster. This ensures that tasks will be placed preferentially on agents that are under-utilized, rather than “packing” tasks together on the smallest number of agents.

Version 0.8.13

Release Date: April 22, 2019

  • Fix bug that led to experiment failure if the default environment Docker images were deleted on the agent node.

Version 0.8.12

Release Date: April 19, 2019

  • Fix bug in pedl-pull-images that prevented pre-generated images from being pulled.

Version 0.8.11

Release Date: April 18, 2019

  • Breaking Change: Remove explicit trial runner Docker images.

    The pedl-tr-py3.6-tf and pedl-tr-py3.6-pytorch images are no longer distributed with PEDL because experiments create their environments on-demand with environment.

  • Breaking Change: Remove the trial_environment key from the experiment configuration (deprecated in PEDL 0.8.9).

    Instead of using the trial_environment key, an experiment configuration should use the environment key.

  • Breaking Change: Remove support for old Trial constructor interface (deprecated in PEDL 0.8.5).

  • Update version of TensorFlow.

    Experiments now support TensorFlow 1.13.1 and CUDA 10.0 by default.

Version 0.8.10

Release Date: April 5, 2019

  • New Feature: PEDL now offers the ability to launch Jupyter notebooks attached to one or more slots in the cluster. See the “Jupyter Notebooks” section of the documentation for more details.

  • Fix bug that crashed any model definitions that imported the cProfile standard Python library.

  • Breaking Change: Simplify PyTorch API.

    Within the PyTorchTrial class, the expected behavior of a model’s forward() has been altered to be more consistent with native PyTorch models. Additionally, the signatures of the training_metrics() and validation_metrics() methods have been modified, and a new losses() method has been added. See PyTorchTrial Interface for more details. PyTorch examples have been updated to use the new API.

  • Multi-GPU support for training PyTorch models.

    PEDL can now transparently train PyTorch models using multiple GPUs if an experiment is configured to use multiple slots per trial. See the experiment configuration documentation for more details.

Version 0.8.9

Release Date: March 19, 2019

  • New Feature: Configure experiment containers with the same method as PEDL commands.

    It is now possible to configure an experiment’s environment using the environment key, following the same semantics as with configuring PEDL commands. Additionally, the environment key for both experiments and PEDL commands now supports GPU- and CPU-specific tags under runtime_packages and runtime_commands.

  • New Feature: Running and pending commands now listed in the WebUI.

    The experiments overview page now lists running and pending commands in the “Commands” section right under the “Finished Experiments” section.

  • Deprecation Warning: Experiment configuration key trial_environment has been deprecated in favor of environment.

    For backwards compatibility support, trial_environment is still supported in this version of PEDL. It will be removed in a future version.

Version 0.8.8

Release Date: March 14, 2019

  • Documentation: Add data loaders tutorial and example data loaders.

  • New Feature: Add listing of running or pending commands.

    The PEDL CLI command cmd can now take the list argument which displays a list of running and pending commands.

  • Update version of PyTorch.

    The trial runners now include PyTorch 1.1.0.

Version 0.8.7

Release Date: March 7, 2019

  • New Feature: Support for grid search.

    There is a new grid option for hyperparameter search. The MNIST examples have been updated with sample grid search experiment configuration files. Please see the Hyperparameter Search: Grid for more details.

  • New Feature: Quick start guide in documentation.

    Check out the quick start guide!

  • New Feature: Support for per-experiment weights in scheduler.

    It is now possible to specify a weight for each experiment using the resources.weights field, defaulting to 1; each active experiment will be allocated a number of slots that is approximately proportional to its weight. The weight of an existing experiment can be set via the CLI (pedl experiment set weight <id> <weight>).

  • Upgrade to TensorFlow 1.12 in default TF trial runner image.

Version 0.8.6

Release Date: February 22, 2019

  • New Feature: Support specifying bind mounts for PEDL commands.

    PEDL commands now take a --volume <host path>:<container path> argument that mounts a path on the agent machine as a path in the command container (e.g., --volume /shared-fs:/shared-fs). Multiple mounts can be indicated with multiple --volume arguments.

  • New Feature: Support the ability to maintain callback state.

    To use callbacks that maintain state, please implement the save() and load() functions in the pedl.callback.Callback interface.

  • Support the ReduceLROnPlateau callback when used with Keras simple model definitions.

    The semantics of the patience and cooldown arguments to ReduceLROnPlateau are slightly modified when used in PEDL. Please see the Keras simple model definition documentation for more details.

  • Breaking Change: Support multi-input multi-output PyTorch models.

    Within the PyTorchTrial class, the loss() method has been removed and the signatures of the training_metrics() and validation_metrics() methods have been modified. PyTorch examples have been updated to use the new API. The MNIST example now contains a multi-output example as well.

  • Breaking Change: Update the CLI.

    The pedl trial list and pedl checkpoint list commands have been moved to pedl experiment list-trials and pedl experiment list-checkpoints, respectively. The new names may be abbreviated pedl e lt and pedl e lc.

  • Fix bug that prevented creating experiments with a security configuration.

  • Re-enable support for the pbt search method.

  • Update default trial runner images to use SciPy 1.2.1 and Keras-Preprocessing 1.0.9.

  • Upgrade to Postgres 10.7.

Version 0.8.5

Release Date: February 14, 2019 💕

  • New Feature: Support for non-graceful termination of active PEDL commands.

    PEDL commands can now be terminated immediately with pedl cmd kill <task_id>. The task id can be found by either listing slots (pedl slot list) or listing tasks (pedl task list).

  • Deprecation Warning: Data loaders have been decoupled from the Trial interface.

    The training_loader and validation_loader arguments have been removed from the constructors to EstimatorTrial, KerasTrial, KerasFunctionalTrial, and TensorFlowTrial. Previously, the constructors for these classes were def __init__(self, training_loader, validation_loader, hparams). Now, they should be def __init__(self, hparams).

    For backwards compatibility support, the old interface will still be supported in this version of PEDL. It will be removed in a future version.

  • Deprecation Warning: Experiment configuration key checkpoint_storage.checkpoint_path has been deprecated in favor of checkpoint_storage.storage_path.

    Limitation: pedl cmd kill currently does not support the killing of commands that are still pulling or building, i.e., killing a command may be postponed until the task starts running and after the Docker build steps.

  • Fix bug that prevented use of quoted commands with pedl cmd run.

    For example, pedl cmd run "echo hello && echo world" should now work as intended.

  • Update the default trial runner images to use TensorFlow 1.11.0 and CuDNN 7.4.

  • WebUI: Make pressing the escape key close any open modal.

  • WebUI: Remove support for creating experiments; use the CLI (pedl experiment create) to create experiments.

  • WebUI: Replace Experiments dropdown with Active Experiments.

  • Documentation: Add an FAQ section and an overview page on hyperparameter search methods.

  • Fix bug when using a non-default WORKDIR in custom Docker images with PEDL commands.

Version 0.8.4

Release Date: February 5, 2019

  • Fix scheduler bug that lead to an indefinite hang with pedl slot list or pedl agent list.

  • Display PEDL command description when starting a command with pedl cmd run.

  • Improve organization of documentation by splitting “PEDL Overview” into multiple pages.

  • cli: Return an error message from pedl experiment kill if the experiment is not active.

Version 0.8.3

Release Date: January 30, 2019

  • Breaking Change: Move namespace of TensorBoard callback to pedl.frameworks.tensorflow.TensorBoard.

  • Modify dependency installation order of operations when a custom Docker base image is specified.

    When a custom Docker base image is specified, runtime_packages and/or runtime_commands are now installed after injecting PEDL harness code and installing PEDL harness dependencies. These configurations can be used to override PEDL harness dependencies, if needed.

  • Downgrade to h5py version 2.7.1.

Version 0.8.2

Release Date: January 29, 2019

  • Fix a bug in a database migration when converting checkpoints to a new internal format.

Version 0.8.1

Release Date: January 29, 2019

  • New Feature: Track file sizes of checkpoints.

    pedl checkpoint list will now display the size of each checkpoint. The sizes for checkpoints computed before this version of PEDL are not computed retroactively and default to 0.

  • New Feature: Add ability to non-gracefully terminate an experiment with pedl experiment kill.

    Killing an experiment will immediately terminate an experiment by killing all of its associated trials. pedl experiment kill does not checkpoint each trial before terminating it, so this command should be used with care. To gracefully terminate an experiment, please use pedl experiment cancel.

  • cli: Remove device UUID from display name when showing slots via pedl slot list.

  • cli: Fix bug that displayed an incorrect response message when killing a trial via pedl trial kill.

  • cli: Fix bug in pedl slot disable.

  • Fix bug in setting a default value for checkpoint storage configuration checkpoint_path on experiment creation.

Version 0.8.0

Release Date: January 28, 2019

NOTE: This release changes the command-line syntax of the PEDL CLI. See below for details.

NOTE: This release includes significant changes to the internals of the PEDL master. As a result, running experiments cannot be upgraded from previous versions of PEDL. Before upgrading to PEDL 0.8.0, please cancel all running experiments. (Warm-starting of old experiments with the upgraded PEDL master should continue to work.)

  • Breaking Change: Port PEDL master to Go, improve scalability.

    The PEDL master has been reimplemented in Go. In addition, the master uses a new approach to managing concurrent operations. In concert, these changes should result in substantial improvements to the master’s performance and its robustness under heavy load. As noted above, running experiments cannot be upgraded from previous versions of PEDL.

  • Breaking Change: Change command-line syntax for PEDL CLI.

    The CLI has been changed to use a consistent pedl <noun> <verb> syntax. For example, creating an experiment was previously done via pedl create; the new syntax is pedl experiment create, which can be shortened to pedl e create. The previous command-line syntax is no longer supported. The new CLI also supports tab-completion. To enable it, run eval "$(register-python-argcomplete pedl)".

  • New Feature: Add support for executing arbitrary commands.

    PEDL now supports running arbitrary commands on agent machines. This feature is intended to support workflows that do not easily fit into the standard experiment workflow. Commands can be started using pedl cmd run. For more information, see the documentation.

  • Breaking Change: Reject experiment config files with unrecognized keys.

    Previously, PEDL would accept experiment configuration files with unrecognized keys. Such keys were ignored, so typos in the config file could result in confusing behavior. In this release of PEDL, unrecognized keys in configuration files will now be rejected. As a special-case, arbitrary keys are allowed under the data top-level key. Users that wish to include custom directives in their experiment configuration files should move those directives to the data section.

  • Breaking Change: Adopt Keras naming convention for validation metric names in the Simple Keras API.

    Validation metrics will automatically be prefixed with val_, for consistency with the naming convention for validation metrics used by Keras itself.

  • Add support for exporting TensorFlow Estimator trials to the SavedModel format.

    Model definitions that use the TensorFlow Estimator API can now implement an optional API, build_serving_input_receiver_fns, to support exporting the model to the SavedModel format.

  • Disable support for the pbt search method.

    Support for PBT will be reintroduced in a future release of PEDL.

  • Remove support for “system dump”.

  • Shrink size of PEDL agent container image.

  • Upgrade to Postgres 10.6.

  • Update the agent and trial runner container images to use Python 3.6.8.

Version 0.7

Version 0.7.14

Release Date: December 13, 2018

  • Improve robustness of HDFS checkpointing logic to retry-able failures.

Version 0.7.13

Release Date: December 12, 2018

  • Breaking Change: New pedl.callback.Callback interface.

    The training_step_callbacks() and validation_step_callbacks() interface for standard trial definitions have been removed with this PEDL version. In its place, the pedl.callback.Callback() API can be used to execute Python functions at the beginning and/or end of training and/or validation steps. See the “Callbacks” section in the PEDL overview documentation for more details.

  • Add pedl.callback.TensorBoard to simplify TensorBoard integration.

    See “TensorBoard Integration” in the PEDL overview documentation for an example of how to integrate TensorBoard into your workflow.

  • Remove support for pedl system-dump.

  • Fix bug in EstimatorTrial that caused long-running trial runner containers to consume unbounded disk space.

Version 0.7.12

Release Date: December 3, 2018

  • Add a workaround for a TensorFlow memory leak bug when using EstimatorTrial.

    Previously, the physical memory of a PEDL trial runner could grow unboundedly when using the EstimatorTrial API with certain types of tf.train.Optimizer instances. This version of PEDL includes a monkey-patched version of TensorFlow to address this issue until an upstream fix is merged by the TensorFlow team. Please see for a full bug report.

Version 0.7.11

Release Date: November 29, 2018

  • Add scripts to simplify backing up and restoring PEDL’s metadata database.

    These scripts are named pedl-db-backup and pedl-db-restore, respectively.

  • Upgrade to Keras 2.2.4 in the default trial runner base image.

  • Workaround bug in Keras when using multiprocessing.

    A bug in Python’s multiprocessing module resulted in hangs when used with Keras simple model definitions in some situations. This release of PEDL includes a workaround for the underlying multiprocessing bug.

  • Fix error when garbage collecting checkpoints stored on Kerberos-enabled HDFS file systems.

Version 0.7.10

Release Date: November 15th 2018

  • Prevent swallowing of the full traceback in trial logs when model definition code raises a StopIteration exception.

Version 0.7.9

Release Date: November 10, 2018

  • New Feature: Add support for TensorFlow’s tf.estimator.Estimator API.

    Users can now use the `EstimatorTrial` interface to train [Premade]( or [Custom]( `tf.estimator.Estimator`s with PEDL. A new example model definition using this interface (`mnist_estimator`) has been added to the [examples](examples) page. Please see documentation for a full description of the API.
  • New Feature: Add support for HDFS checkpointing with Kerberos enabled.

    Users can add the `kerberos: true` configuration to the `checkpoint_storage` section when `type` is `"hdfs"` to enable Kerberos mode. When using this feature, users may also need to configure the `security/kerberos/config_file` to point to a valid Kerberos configuration file location for each agent.
  • New Feature: Add support for preconfiguring the trial runner environment with a bash script.

    When PEDL detects a file named `` at the top-level of a model definition directory, it will execute this script during startup of the trial runner container. Note that this script is executed _before_ the trial runner executes any model definition code with the Python interpreter.
  • Web UI: Fix bug that prevented the trial detail modal from appearing when a metric name had certain special characters (e.g. /).

  • Support for Python 3.7 compatibility with the PEDL CLI.

Version 0.7.8

Release Date: November 5, 2018

  • WebUI: Add support for filtering experiments with canceled or errored states.

  • New Feature: Ensure that the Keras TensorBoard callback serializes validation metrics when using Simple Keras Model Definitions

    When used in previous versions of PEDL, the Keras TensorBoard callback would only serialize training metrics.

  • New Feature: Add support for validation callbacks when using Simple Keras Model Definitions.

    Previous versions of PEDL would only execute callbacks during training steps. See documentation on the KerasValidationCallback class for more details.

  • Add utility functions for referencing the current PEDL context.

    pedl.get_experiment_config(), pedl.get_trial_id(), and pedl.get_experiment_id() have been added as utility functions to be used anywhere in model code.

  • Experimental: Initial support for HDFS checkpointing.

    HDFS support is undocumented in this release of PEDL—please consult with the Determined AI team before using.

  • Update the master, agent, and trial runners to use Python 3.6.7.

  • WebUI: Simplify the “Create New Experiment” and “Continue Training Workflow” modals.

    Previous versions of PEDL displayed a richly formatted fields for each experiment configuration option, but only supported a subset of available top-level options. This release of PEDL moves to using a single large text area for the raw experiment configuration YAML that can be directly edited.

Version 0.7.7

Release Date: October 11, 2018

  • New Feature: Add support for custom base Docker images.

    This release of PEDL introduces support for specifying a custom Docker base_image in the experiment configuration. The base_image should be accessible to all agent nodes via docker pull. If a private image is used, Docker Registry credentials must be specified in the registry_auth section in the experiment configuration. The maintainer of the custom base image is responsible for installing PEDL dependencies—see the Custom Docker Base Images section in documentation for a full list of dependency requirements.

  • New Feature: Add --download-to flag to pedl list-checkpoints.

    This flag allows users to download the listed checkpoints for any experiment configured with S3 checkpoint storage. This flag can be used in tandem with the --best flag to download the top N checkpoints for an experiment.

  • WebUI: Display the best validation metric in addition to the latest validation metric for all trials.

Version 0.7.6

Release Date: October 2, 2018

  • New Feature: Add support for optionally associating Git metadata with an experiment.

    pedl create --git will look for a Git repository in the model definition directory to save metadata associated with the current Git commit and the remote URL of the current upstream branch. If an experiment is created with the --git flag, the Web UI will display the Git commit, committer, commit date, and link to the upstream remote URL. This feature assumes that any commits in the local repository also exist in the upstream remote repository.

Version 0.7.5

Release Date: September 27, 2018

  • New Feature: Add support for automatically taking checkpoints when the validation performance of an experiment improves.

    This release of PEDL introduces a new experiment config option, checkpoint_policy. Using the default policy (best), PEDL will checkpoint any trial whenever its validation performance is exceeds the previous best validation performance for this experiment. The all checkpoint policy causes PEDL to take a checkpoint after every validation operation; policy none results in no additional checkpoints being taken. Note that checkpoints might still be taken for other reasons: for example, if the min_checkpoint_period option is enabled, or if a trial is moved from one slot to another by the scheduler.

  • Change scheduler to favor spreading tasks around the cluster.

    In previous versions of PEDL, the scheduler attempted to pack tasks on a subset of the cluster. This policy has some advantages: for example, it can result in leaving entire agent machines idle, which then allows those machines to be deactivated or used for a future multi-GPU job. However, this packing behavior can also be problematic: placing additional jobs on the same machine can result in contention for other resources on that host (e.g., CPU or I/O). This release of PEDL changes the scheduler to spread tasks around the cluster when possible; two tasks will only be placed on the same machine if there are no agents that are completely idle.

  • Add support for --best flag to pedl list-checkpoints.

    If the --best N flag is specified, pedl list-checkpoints will return the “best” N checkpoints, according to the experiment’s configured validation metric. Checkpoints that do not have an associated validation operation will be omitted.

  • Improve compatibility for Keras callbacks when using the simple model API.

Version 0.7.4

Release Date: September 18, 2018

  • WebUI: Fix bug in the Continue Training workflow when using Keras simple model definitions.

  • WebUI: Fix bug in the Continue Training workflow when using nested hyperparameters.

Version 0.7.3

Release Date: September 17, 2018

  • Fix bug in Keras simple model definitions when no user-defined metrics are passed to model.compile().

Version 0.7.2

Release Date: September 13, 2018

  • New Feature: Support for population-based training (PBT).

    Refer to the Hyperparameter Search: Population-based training to see how to use PBT with PEDL.

  • Breaking Change: Validation functions for Keras models should now operate on tensors, rather than NumPy arrays.

    For trials using the KerasTrial and KerasFunctionalTrial classes, validation functions should now have TensorFlow tensors for their arguments and return types, as with the current version of TensorFlowTrial. The new API is not backward-compatible with the old API: any PEDL models that use either Keras trial class will need to be updated. The cifar10_cnn_keras and mnist_keras_functional examples demonstrate how to use the new API.

Version 0.7.1

Release Date: September 6, 2018

  • New Feature: Support for filtering experiments by multiple labels.

    In the experiment list page, it is now possible to enter multiple experiment labels at the same time; only experiments that have all of the labels be shown. Type in a label and press ‘enter’ to add it to the list of labels to filter by; when the text input is empty, press the left and right arrow keys to select an existing label and ‘backspace’ to remove it from the list.

  • WebUI: Add API reference documentation.

    This documentation is available via the “API Reference” link at the top of any page in PEDL.

  • WebUI: Fix ability to specify source trial ID in create experiment modal.

  • WebUI: Fix links to examples in the main documentation.

Version 0.7.0

Release Date: August 23, 2018

  • New Feature: Persist experiment state across master crashes.

    In previous releases of PEDL, a crash in the master would cause all running or paused experiments to enter an error state; now, trials can resume from their last checkpoints after a crash.

  • New Feature: Support for disabling and enabling agents to allow seamless cluster upgrades.

    This release adds the pedl disable-agent and pedl enable-agent CLI commands, which disable and enable scheduling of tasks on agents. Disabling all agents and waiting for existing jobs to finish allows the cluster to be restarted without losing any work.

  • New Feature: Support for previewing hyperparameter searches.

    This release adds the pedl preview-search CLI command, which simulates a run of the given searcher configuration and prints a summary of the training steps that it schedules.

  • New Feature: Support for .pedlignore files.

    If there is a file called .pedlignore in the top level of a model definition directory passed to pedl create, it is now treated as a list of patterns (in the same style as .gitignore) to exclude from the upload to the master.

  • Revamp documentation.

    We have changed how we generate our documentation to improve styling, navigation, and search.

  • Show experiment and trial IDs in the trial detail modal.

Version 0.6

Version 0.6.7

Release Date: August 9, 2018

  • New Feature, Breaking Change: New API for writing TensorFlow models.

    This version of PEDL introduces a rewrite of TensorFlowTrial, the base class for PEDL models that use TensorFlow. The new TensorFlowTrial supports models with multiple inputs and outputs, supports validation functions on tensors (improving performance), and fixes other limitations of the previous TensorFlowTrial API. The new API is not backward compatible with the old API: any PEDL models that use TensorFlow will need to be updated. The mnist_tf example distributed with PEDL has been updated to use the new API.

  • New Feature: Support for experiment labels.

    A label is an arbitrary string that can be associated with an experiment; each experiment can have a set of labels. Labels can be used to organize experiments and identify groups of experiments that have similar properties. Labels can be added and removed via the CLI (pedl label) or the Web UI.

  • Improve compatibility with recent versions of Kubernetes.

  • Cleanup and refactoring of PEDL fault tolerance logic.

Version 0.6.6

Release Date: August 6, 2018

  • Fix incompatibility in the aiodocker library to support Docker >= 18.06.0-ce.

Version 0.6.5

Release Date: August 2, 2018

  • Experimental: Support for “simple” model definitions.

    In previous releases of PEDL, model definitions were required to implement a custom Trial API. This API is how PEDL implements support for hyperparameter searches, automatic checkpointing, workload migration between agents, and metadata capture. However, this approach requires modifying model code to implement this API, which can be inconvenient when running “off-the-shelf” models.

    This release of PEDL introduces experiment support for “simple” model definitions. This feature allows PEDL to run unmodified model code: features like automatic checkpointing are supported by intercepting calls to certain framework APIs. This feature is currently only supported for models written with Keras that use the fit_generator API. To access hyperparameters, a new optional API has been introduced, pedl.get_hyperparameter(). For more information, see the documentation and the mnist_keras_simple example.

  • New Feature: Improved trial fault tolerance.

    PEDL’s support for handling trial failures has been substantially refactored. The main user-visible change is that when a trial fails, only that trial will need to be restarted; other trials in the same experiment will continue running without interruption. This change also fixes several corner-case bugs and lays the groundwork for supporting master fault tolerance in a future release of PEDL.

    This release also changes the semantics of the max_restarts configuration parameter: previously, this parameter defined the number of times that an experiment would be restarted after a failure of any one of the experiment’s trials. It now defines the maximum number of times that any one trial can fail before the entire experiment is aborted (i.e., it is now a per-trial counter, not a per-experiment counter).

  • New Feature: Add default checkpoint GC policy.

    In previous releases of PEDL, checkpoint GC was not performed by default. In this release, all experiments will have a checkpoint GC policy by default (save_experiment_best: 0, save_trial_best: 1, save_trial_latest: 1).

  • Update versions of several dependencies.

    The trial runners now include Keras 2.2.2, PyTorch 0.4.1, and NumPy 1.15.0.

Version 0.6.4

Release Date: July 26, 2018

  • Experimental: Support for PyTorch models.

    PyTorch models are written by subclassing the abstract class PyTorchTrial and specifying a base_image of determinedai/pedl-tr-py3.6-pytorch in the experiment config file. See examples/mnist_pytorch for a complete example.

    PyTorch models in PEDL currently do not support multi-GPU training.

  • New Feature: Support for abruptly killing trials.

    A new CLI sub-command, pedl kill-trial, has been added. This immediately terminates the container associated with the specified trial ID. Note that once the trial’s current container has been terminated, the trial will typically be restarted in a different container (due to PEDL’s support for automatic experiment fault tolerance).

  • In pedl describe --metrics, display all validation metrics of an experiment.

    Previously, only the metric used by the experiment’s search method was displayed.

Version 0.6.3

Release Date: July 19, 2018

  • Upgrade to TensorFlow 1.9.0 in the default trial runner.

    Note that as a result of this change, the version of tf.keras has been upgraded from 2.1.2 to 2.1.6.

  • Make base_image optional in experiment configurations.

    If not specified, the base_image defaults to determinedai/pedl-tr-py3.6-tf. Coincidentally, that is currently the only legal value for base_image.

  • Improve Python 3.5 compatibility.

Version 0.6.2

Release Date: July 13, 2018

  • Upgrade to Keras 2.2.0 in the default trial runner.

  • Fix bug in fault tolerance logic when an experiment that is being canceled encounters an error.

  • More aggressively schedule new work when an experiment’s max_slots limit is changed.

Version 0.6.1

Release Date: July 12, 2018

  • Breaking Change: When using bind mounts or shared_fs checkpoints, the specified host_path must already exist. In previous versions of PEDL, bind mounts could use host_paths that did not previously exist on the host file system.

  • Breaking Change: The mechanism for specifying read-only bind mounts has changed. In previous versions of PEDL, the mode parameter was used; in this release of PEDL, a new parameter read_only should be used instead.

  • WebUI: Support for plotting multiple training metrics in trial “detail” view.

    In previous releases of PEDL, the trial detail view only supported displaying validation metrics and training loss. As with the plot of training loss, the plot of other training metrics displays the mean value of the training metric for each step.

  • cli: Support for “test mode” when creating experiments.

    When an experiment is created using pedl create --test_mode, PEDL will run only a single trial of the experiment, and this trial will only be trained for a single step. Then validation metrics will be computed, and a checkpoint of the trial will be taken. Finally, the experiment will be archived, and the experiment’s checkpoint will be garbage collected. This feature is intended to support rapid iteration during the initial phase of developing a new model.

  • cli: Support for changing an experiment’s max_slots limit on-the-fly.

  • cli: Report multiple training metrics in pedl describe.

    In previous releases of PEDL, pedl describe --metrics only reported training loss. As with training loss, the CLI will report the mean value of the training metric for each step.

  • Support saving checkpoints to arbitrary subdirectories of a shared file system.

    When saving checkpoints to a shared_fs, the new configuration parameter checkpoint_path can be used to control the subdirectory on the shared file system where checkpoints will be placed.

  • Support for configuring bind propagation for bind mounts via a new propagation configuration parameter.

Version 0.6.0

Release Date: July 3, 2018

  • New Feature: Support for recovering from trial and agent failures.

    In previous versions of PEDL, the failure of any trial within an experiment would cause the entire experiment to fail. Similarly, if an agent crashed, all experiments that were running any trials on that agent would be marked as failed.

    PEDL now supports recovering from trial and agent failures by automatically re-running failed workloads. This improves PEDL’s tolerance of transient faults, such as network failures or out-of-memory errors. If an error occurs while running a trial, PEDL will restart the execution of that trial from its most recent checkpoint (if any). Note that since deep learning workloads are not deterministic in general (see Reproducibility for more details), any metadata that was recorded after the last checkpoint will be deleted (and subsequently recomputed). The maximum number of times an experiment will be restarted to recover from failures is controlled by max_restarts, which defaults to 5. This parameter ensures that PEDL does not go into an infinite loop if an experiment encounters the same error repeatedly.

  • cli: Add pedl list-tasks subcommand.

    This is useful for examining the state of the task scheduler.

  • cli: Fix error in pedl describe --metrics.

  • WebUI: Round validation metrics to at most five decimal digits in the list of trials for an experiment.

  • WebUI: Improve scaling of the X-Axis in the experiment-level validation metric plot to match the length of the experiment.

  • WebUI: Improve responsiveness when activating, pausing, archiving, and unarchiving experiments.

  • Improve error handling when a trial returns an invalid validation metric value (e.g., math.nan).

  • Improve reporting of training metrics in KerasFunctionalTrial.

    Previously, only the weighted sum loss of a multi-loss model was reported as a training metric “loss”. Starting in this version of PEDL, the values of each loss function in addition to the weighted sum loss will be reported as training metrics. Training metric history is accessible via pedl describe --metrics. However, the Web UI trial detail view graph will continue to only display a single training metric “loss” (the weighted sum loss).

Version 0.5

Version 0.5.12

Release Date: June 21st, 2018

  • New Feature, Breaking Change: Support for multiple inputs and multiple loss functions in KerasFunctionalTrial.

    In previous versions of PEDL, models using KerasFunctionalTrial could specify multiple outputs, but only a single input and a single loss function. This limitation has been lifted.

    As a result, the API for validation metric functions in KerasFunctionalTrial has changed: previously, the second argument to a metric function was an np.ndarray containing the true labels for the validation set. In this release of PEDL, the second argument is now a dictionary that maps layer names to np.ndarray values.

  • New Feature: Support for archiving experiments.

    PEDL now supports archival of completed experiments. When an experiment is archived, all experiment metadata is preserved but the experiment is hidden by default from the WebUI and the list of experiments returned by pedl list. Both the PEDL cli and WebUI now include options to enable display of archived experiments.

  • New Feature: Support for multiple Python packages when creating experiments.

    When creating an experiment using pedl create, users can now specify one or more additional Python packages via the --package flag. Packages should be provided as source distributions—e.g., a ZIP or TAR archive created by python sdist in a Python project that uses setuptools.

    This feature allows models to use dependencies on the user’s local file system; network-accessible dependencies can also be downloaded using the runtime_packages feature supported by previous versions of PEDL.

  • New Feature: Support for plotting per-trial validation metrics in Web UI.

    In previous versions of PEDL, the “Details” modal contained a plot of per-step average training loss; this dialog has been improved to also support plotting any of the experiment’s validation metrics.

  • Fix rounding error in fair-share scheduling logic.

Version 0.5.11

Release Date: June 14, 2018

  • Add support for warm-starting from arbitrary checkpoints.

    Previously, it was only possible to warm start from the latest checkpoint associated with a particular source trial ID. In this release of PEDL, experiments can also be warm started from a specific checkpoint using the source_checkpoint_uuid configuration parameter.

  • cli: Add validation metric to pedl list-checkpoints.

    This is useful because checkpoints with an associated validation are treated differently by the garbage collector.

  • cli: Add pedl config subcommand.

    This displays the configuration of an experiment in YAML format.

  • WebUI: After creating a new experiment, navigate to it.

  • Experiments now default to the normal priority level.

    There was previously no default value for this configuration parameter.

  • Improve scalability of pedl system-dump.

  • Improve validation of priorities in experiment configuration.

  • Improve logging when processing changes to experiment priority, GC policy, and description.

Version 0.5.10

Release Date: June 8, 2018

  • Fix bug when garbage-collecting shared-fs checkpoints.

    The result of this bug is that garbage-collecting shared-fs checkpoints resulted in marking the checkpoint as DELETED in the PEDL database, but the actual checkpoint storage would not be removed correctly.

Version 0.5.9

Release Date: June 7, 2018

  • New Feature: Add a CLI command to update the checkpoint GC policy of an experiment.

    pedl set-gc-policy can be used to update the checkpoint GC policy of running or finished experiments. For example, this can be used to reduce the storage consumed by historical experiments.

  • New Feature: Add support for changing the description of an existing experiment.

    Experiment descriptions can be changed via a new CLI command, pedl set-description.

  • Improve performance of pedl list-trials CLI command.

  • Improve error handling of experiment decoding errors in the Web UI.

    In previous versions of PEDL, the WebUI failed to display if any of the experiments in the database contained an invalid configuration. In PEDL >= 0.5.9, this error handling is improved—experiments with invalid configurations will be omitted (with a user-facing error prompt) from the Web UI instead of causing a fatal error.

  • Fix rare race condition on experiment shutdown.

  • Fix bug when using multi-GPU models instantiated with the tensorflow.python.keras library.

Version 0.5.8

Release Date: June 5, 2018

NOTE: Temporarily disable experiment priorities. This release of PEDL ignores the priority field in experiment configurations. This is a temporary change; support for experiment priorities will be restored shortly.

  • Reject experiment configurations with non-default base_image.

    PEDL does not currently support experiments that use custom Docker base images.

  • Improve scalability of pedl system-dump.

  • Improve logging for training and validation data loaders.

Version 0.5.7

Release Date: May 31st, 2018

  • Properly handle containers that crash without an active workload.

  • Fix bug in master experiment shutdown logic.

  • Fix race condition in agent during container exit.

  • Improve performance when retrieving trial logs.

  • WebUI: Add metric value to tooltip in experiment validation plot.

  • Fix bug in using an adaptive search with a min_validation_period specified.

  • Fix bug when garbage collecting experiments with failed validations.

  • Avoid low-probability agent crash due to container name collision.

  • Upgrade websockets library to version 5.0.1 in trial runners.

Version 0.5.6

Release Date: May 29, 2018

  • Fix error in scheduler that occurs if the total number of cluster slots changes.

  • Upgrade websockets library to version 5.0.1.

Version 0.5.5

Release Date: May 24, 2018

  • New Feature: Add support for garbage collecting checkpoints.

    When an experiment finishes, the system may optionally delete some checkpoints to reclaim space. See save_trial_latest, save_trial_best, and save_experiment_best under checkpoint_storage in the experiment configuration documentation for details on how to configure an experiment with this feature enabled. By default, all checkpoints are saved.

  • Add support for absolute imports in multiple file model definitions.

  • Improve performance of scheduler when scheduling many large experiments with large numbers of trials.

Version 0.5.4

Release Date: May 23, 2018

  • New Feature: Add support for a system dump command in CLI.

    The new pedl system-dump command generates a large zip file with cluster logs, data, and statistics.

  • Significantly improve performance of scheduler when scheduling experiments with large numbers of trials.

  • Fix bug in using subdirectories with multi-file model definitions.

  • Fix bug in pedl describe with terminal experiments.

  • Upgrade the websockets Python library to version 5.0.0.

Version 0.5.3

Release Date: May 17, 2018

  • New Feature: Rewrite PEDL task scheduler.

    This release includes a new scheduler implementation. The major user-visible change is that experiments within a priority tier will now be fair-shared: e.g., if there are two high-priority experiments, each experiment will have the opportunity to consume half the cluster’s resources. (The previous scheduler allocated resources to experiments in the same priority tier according to FIFO order). Note that scheduler-related configuration options (e.g., max_slots, priority) have not changed.

  • Add support for using non-AWS S3 implementations to store model checkpoints.

    A new experiment configuration option, endpoint_url, has been added to allow specifying this.

  • The default trial runner base image has been upgraded to include Keras 2.1.6, NumPy 1.14.3, and SciPy 1.1.0.

  • Fix bug in handling HTTP requests with no Content-Type.

Version 0.5.2

Release Date: May 15, 2018

  • Trial runners can now be started under a non-root user ID.

    In local deployments, use TRIAL_RUNNER_UID and TRIAL_RUNNER_GID under dist/etc/agent.conf to configure a non-root user ID and optional group ID. In kubernetes deployments, use the trialRunner.uid and trialRunner.gid configuration parameters. If unspecified, the uid defaults to 0 (the root user) and the gid defaults to the root group.

  • WebUI: Incorporate PEDL documentation into the web server.

    Users can now access the full PEDL documentation on the web UI via the “Documentation” button in the navigation bar.

  • WebUI: Improve formatting of Y-Axis labels in trial “Details” visualization.

    Graphs with very large or very small training loss values will now use scientific notation for labels on the Y-Axis.

  • Fix bug in handling image build errors in the agent.

Version 0.5.1

Release Date: May 10, 2018

  • New Feature: Support viewing per-trial training loss history in WebUI.

    This release adds a “Details” button that displays a plot of the training loss for a given trial. The plot shows the mean per-step training loss.

  • WebUI: Don’t refresh experiment details for terminal experiments.

  • WebUI: Fix broken “Continue Training” button (0.4.9 regression).

  • Fix rare error when launching trials due to container name conflict (0.4.9 regression).

  • Improve error handling for experiments with misconfigured searcher metric.

    The metric field in the searcher section of the experiment config must correspond to the name of a validation metric produced by the model. When this is not the case, PEDL now detects this situation and reports an error.

  • cli: Improve error reporting for pedl logs.

Version 0.5.0

Release Date: May 5, 2018

  • Avoid WebUI error when displaying experiments with misconfigured searcher metric name.

Version 0.4

Version 0.4.9

Release Date: May 4, 2018

  • New Feature: Support for incremental computation of validation metrics.

    Previously, the API for computing validation metrics required the entire validation set to be loaded into memory. For experiments with large validation sets, this might be very expensive.

    This release of PEDL introduces a new API for that splits the computation of a validation metric into a “batch validation function”, which computes an intermediate result for a single batch, and a “reducer”, which combines all the intermediate results into a final metric value. Not all validation metrics can be expressed in this way, but for those that can, using this new (optional) API can result in reduced memory consumption.

  • New Feature: Support for warm-starting experiments that use random and adaptive search methods.

    Previously, warm-starting was only supported for single experiments. When warm starting random and adaptive experiments, the source_trial_id is used to set the initial weights for all of the trials in the experiment.

  • Introduce a more concise format for specifying constant hyperparameters.

    Example of the new format:

    batch_size: 32

    This is equivalent to the old syntax, which is still supported:

      type: const
      val: 32
  • Upgrade to YAML 1.2 format for experiment configurations.

    Notably, this allows scientific notation (e.g., 1e-4) to be used when specifying hyperparameters.

  • Fix pedl logs -f to handle Ctrl+C (KeyboardInterrupt) more cleanly.

  • Fix authentication bug when fetching trial runner images from a remote Docker registry.

  • Upgrade to Python 3.6.5 in the agent and trial-runner containers.

    The master container already used Python 3.6.5.

Version 0.4.8

Release Date: April 30, 2018

  • Fix missing dependencies in the CLI.

Version 0.4.7

Release Date: April 26, 2018

  • New Feature: Support for periodic validation computation.

    In previous versions of PEDL, validation metrics were only calculated after the final step of a trial, or after the final step of each rung when using the adaptive search method. This release of PEDL adds support for periodically computing validations in addition to those mentioned previously. A new configuration parameter, min_validation_period, specifies the maximum number of training steps that will be run since the last validation computation before a new validation computation will be initiated.

    Users should note that enabling periodic validation could slow experiment progress, depending on the cost of a validation computation. Due to this, periodic validations are not enabled for an experiment by default.

  • Fix bug with experiments that use the tensorflow.python.keras package.

    In previous versions of PEDL, experiments using the tensorflow.python.keras package would crash when attempting to save a checkpoint.

  • Fix an off-by-one error that slightly limited the integer range of trial seeds.

    Trial seeds are now randomly selected from the [0, 231) integer range, whereas in previous PEDL versions they were randomly selected from the [0, 231 - 1) integer range.

  • Improvements to trial logging.

    The agent ID and initial workload are now logged on trial runner startup.

  • Add --tail support to pedl logs

    This flag specifies the number of lines of log output to show, counting from the end of the log (analogous to tail -n).

  • Improve logging of WebSocket errors.

  • Improve error logging for CLI commands enable-slot and disable-slot.

Version 0.4.6

Release Date: April 19, 2018

  • Breaking Change: Switch to a new, backwards-incompatible checkpoint format for Keras trials.

    Previous versions of PEDL used the default Keras serialization format ( Unfortunately, this format is problematic for models that use the Keras multi_gpu_model() API.

    This release of PEDL switches to a new custom checkpoint format for Keras models. This change works around the shortcomings of the default Keras format and allows multi-GPU models to be restored from checkpoints, but the new checkpoint format is backwards-incompatible: PEDL >= 0.4.6 cannot use Keras model checkpoints (e.g., for experiment warm starts) created by PEDL < 0.4.6.

    One consequence of this change is that Keras model definitions that use custom objects no longer need to implement the custom_objects API method. As a result, this method has been removed from KerasTrial and KerasFunctionalTrial.

  • Support changing the priority of experiments on-the-fly.

    This is done using a new CLI sub-command, pedl set-priority.

  • Add container launch errors to the per-trial log.

    In previous versions of PEDL, if an error occurred when launching a container for a trial, that error was only visible in the PEDL agent log. Container launch errors are now also visible in the per-trial log (e.g., pedl logs).

  • Enforce a maximum size on model definitions.

    PEDL now rejects model definitions that are greater than 96MB in total size.

  • Display experiment progress as part of the experiment list in CLI and Web UI.

  • Fix PEDL agent crash with large model definitions.

  • Fix bug that caused the pedl-agent-stop script to hang for a long time.

  • Tweak display of experiment states in Web UI.

    Completed, active, and failed experiments are now shown in different colors.

Version 0.4.5

Release Date: April 12, 2018

  • New Feature: Support for model definitions consisting of multiple files.

    In previous versions of PEDL, experiments could only use a single model definition file. This restriction has been lifted; an experiment can now consist of a directory of files. When creating multi-file experiments, users should ensure the top-level directory is a well-formed Python package (e.g., it should contain a file). Multi-file experiments can be created via both the CLI (pedl create <experiment-config> <dir>) and the Web UI.

  • New Feature: Support for periodic trial checkpoints.

    In previous versions of PEDL, trials were only checkpointed when the trial was moved to another agent or when the experiment finished. This release of PEDL adds support for periodically checkpointing each trial of an experiment. A new configuration parameter, min_checkpoint_period, specifies the maximum number of training steps that will be run since the last checkpoint before a new checkpoint of the trial will be taken. Periodic checkpoints are not enabled for an experiment by default.

  • New Feature: Initial support for reproducible experiments.

    PEDL includes limited support for improving the reproducibility of deep learning experiments. See the Reproducibility for more details.

  • Significantly improve Web UI performance.

    The WebUI should now place much less load on the master when viewing experiments with many steps and/or trials.

  • Allow TensorFlow trials to specify a custom session configuration tf.ConfigProto.

  • Add new CLI sub-command, download-s3-checkpoint.

    This makes it easier to download trial checkpoints that are stored in S3.

  • Improve Web UI display of trials with in-progress validation operations.

    When displaying trials with in-progress validation operations, the Web UI previously displayed a blank validation metric; it will now display the last successfully computed validation metric.

  • Tweak display of experiment states in Web UI.

    These were previously displayed in red text (even for successfully completed experiments), which was confusing. All experiment states are now displayed using the same color as normal text.

  • Fix bug in KerasFunctionalTrial, when multiple training metrics specified the same output layer.

  • Fix error when warm-starting from a trial with multiple checkpoints.

  • Fix JavaScript error when activating or pausing experiments in the Web UI.

  • Raise maximum WebSocket packet length to 4MB.

Version 0.4.4

Release Date: April 6, 2018

  • Fix a Web UI crash with experiments that have misconfigured bind_mounts.

  • Fix a Web UI error that was caused by stale code for editing model definitions.

  • Update to Python 3.6.5 in the PEDL master container.

  • Print Nvidia driver version number during PEDL agent startup.

Version 0.4.3

Release Date: April 5, 2018

  • Breaking Change: Remove support for editing model definitions via the builtin editor in the Web UI.

    Previous versions of PEDL supported editing model definitions directly in the Web UI. This feature has been removed, in anticipation of support for model definitions that consist of multiple files.

  • Breaking Change: Remove support for displaying a histogram of predicted validation labels in the Web UI.

    This feature was not broadly useful and the implementation was fragile. A future version of PEDL will introduce support for custom plots as a fully supported feature.

  • Support for disabling GPUs dynamically.

    PEDL now supports two new CLI commands, pedl disable-slot and pedl enable-slot. These commands allow GPUs at an agent to be disabled and enabled, respectively. When a slot is disabled, any workload that is currently running in the slot is allowed to finish its current step; it will then be checkpointed and migrated to a different slot.

    Note that these settings are not persisted: if an agent disconnects from PEDL and reconnects, all of its GPUs / slots will be enabled. GPUs can be disabled in a persistent way by editing GPU_LIST in agent.conf, but changing GPU_LIST requires restarting the agent.

  • Increase width of log modal in Web UI.

    This makes it easier to view trial logs in the Web UI.

  • Add an “Experiment ID” column to pedl list-slots.

    This makes it easier to identify all the slots currently used by a particular experiment.

  • Reduce the number of intermediate Docker layers created for runtime_packages.

  • Fix bugs in the “Continue Training” feature in the Web UI.

    The previous implementation neglected to correctly preserve some properties of the experiment being continued from (e.g., bind_mounts).

  • Fix crash in pedl describe when the described experiment was in the midst of computing validation metrics.

  • Experimental: Reproducibility in single and random Experiments.

    PEDL now supports near-reproducible experiments when using the above search methods. There may still be some limitations around achieving perfect reproducibility to floating point precision during optimization, depending on model choice and/or underlying hardware. See Reproducibility for more details.

Version 0.4.2

Release Date: March 28, 2018

  • Experimental: Support for synchronous data-parallel training using multiple GPUs.

    PEDL now supports trials that use multiple GPUs on a single agent. This feature allows multiple GPUs to be used to train a single experiment to convergence more quickly. To enable parallel training, set the slots_per_trial field in the experiment configuration to be the number of parallel GPUs to use for each trial in the experiment. Note that enabling parallel training does not require changing your model code.

    The current implementation has a few shortcomings:

    - the user must manually configure the desired degree of parallelism
    - all trials in the experiment must use the same degree of parallelism
    - a naive communication strategy is used to share gradients between GPUs, which can result in poor performance for some models
    - multi-slot experiments are scheduled using a simplistic algorithm that can sometimes result in underutilization

    These shortcomings will be addressed in future releases of PEDL.

  • Breaking Change: The TensorFlowTrial API has changed.

    Model definitions that use TensorFlow will need to be updated: several TensorFlowTrial interface methods have been renamed and a new required interface method has been added. The examples and API docs have been updated to describe the new API.

  • Add progress reporting during training and validation of Keras trials.

    This makes it easier to observe the rate at which a Keras trial is making progress.

  • Correctly handle errors when the agent fails to launch a container.

  • Improve reporting of errors and assertion failures.

  • Fix bug in pedl list-checkpoints for in-progress experiments.

  • Rename the WAITING task state to IDLE.

    This more accurately describes what containers in this state are doing.

Version 0.4.1

Release Date: March 22, 2018

  • Add initial support for “warm starting” of experiments.

    This allows a new experiment to be created that uses the weights from a particular trial of a previous experiment. For example, this feature can be used to continue training promising trials from previous experiments for a longer period of time. Note that the new and old experiments must use the same model architecture; however, hyperparameters that don’t influence the model architecture can safely be changed.

  • Add support for checkpointing Keras trials that use custom layers and other custom objects.

    This requires KerasTrial subclasses to implement a new interface method, custom_objects().

  • Fix bug in KerasFunctionalTrial with single-output models.

  • Improve error checking for experiment configurations.

  • Improve accuracy of experiment progress indicator.

  • Fix 0.4.0 regression in WebUI: when viewing an experiment, an error occurred if the experiment changed from “active” to “completed”.

Version 0.4.0

Release Date: March 19, 2018

  • Breaking Change: The checkpoint format for TensorFlow experiments is now SavedModel.

    In previous versions of PEDL, the tf.train.Saver format was used.

  • Add support for single trial search method.

    The random searcher can be used to achieve a single trial experiment, but this new search method provides first-class support for an experiment that consists of a single trial.

  • Improve WebUI for PEDL deployments with many experiments.

  • Support filtering experiments by date range and description.

  • Fix bug with experiments that used categorical hyperparameters with numerical values.

  • Add CentOS 7 as a supported platform.

  • Upgrade to Keras 2.1.5 in the default trial runner base image.

  • Upgrade to Postgres 10.3.

Version 0.3

Version 0.3.2

Release Date: March 8, 2018

  • Breaking Change: Rename the trial_runner field in the experiment configuration file.

    The name of the base Docker container for running trials is now specified by the subfield base_image of the top-level trial_environment field. For example:

      base_image: determinedai/pedl-tr-py3.6-tf
  • Breaking Change: Remove extra Python packages from the base trial runner image.

    The image previously contained several commonly used Python libraries (e.g., joblib, pandas, zarr). These packages have been removed; the base trial runner only contains TensorFlow, Keras, and their dependencies. The runtime_packages feature (described below) has been added to support installing custom dependencies.

  • Support for customizing the trial runner container.

    The experiment configuration file supports two new subfields of trial_environment: runtime_commands and runtime_packages. These specify a list of commands to be executed and a list of Python packages to be installed into the trial runner, respectively. These customizations are applied before any workloads are run in the trial container.

  • Support for training and validation callbacks.

    Model definitions can now define callbacks that will be executed after training or validation operations. For example, this feature can be used to record training and validation metrics as TensorBoard event files, which can then be visualized using TensorBoard. A complete example of TensorBoard integration is included in the documentation.

  • In adaptive search, more aggressively mark trials as “completed”, when possible.

  • Improve reliability of starting the PEDL master via systemd.

  • WebUI: In the experiment detail page, support changing the sort order of the trial tables.

    For example, this makes it easier to see which completed or active trials have the best validation metric.

  • WebUI: In the experiment list page, support changing the sort order of the experiment tables.

Version 0.3.1

Release Date: February 27, 2018

  • New Feature: The KerasFunctionalTrial interface has been added to support the Keras Functional API.

    See the documentation for usage instructions and current limitations.

  • Breaking Change: The trial API function make_training_and_validation_loaders has been renamed to make_data_loaders.

    The old name is still supported but is deprecated, and will be removed in a future release of PEDL.

  • Add an example of how to plot PEDL experiment metadata using a Jupyter notebook (see examples/notebooks).

  • PEDL now includes support for experiment progress estimation.

    In both the CLI and the WebUI, users can view the fraction of total work for a given experiment that has been completed.

  • Improve error handling for experiments with bad hyperparameter settings.

  • Optimize training performance for TensorFlow-based experiments.

  • Change the master to reject connection attempts from agents running a different version of PEDL.

  • Fix 0.3.0 regression in WebUI: per-trial “Logs” button stopped working.

  • Upgrade to TensorFlow 1.5.0, Keras 2.1.4, and NumPy 1.14.0 in the default trial runner.

Version 0.3.0

Release Date: February 12, 2018

  • Support per-experiment resource limits.

    Users can now specify a max_slots setting in the resources section of the experiment config file.

  • Support configuring the agent to use a subset of the GPUs on a host.

    This is done via the GPU_LIST parameter in agent.conf.

  • Adopt a more friendly scheme for agent IDs.

    Agent IDs are no longer UUIDs; instead, they are user-configured strings that default to the hostname of the agent machine.

  • Bundle the API docs with the PEDL package.

  • cli: Add support for --follow / -f to pedl logs, similar to tail -f.

  • cli: Add support for --follow-first-trial to pedl create.

    Users can now start an experiment and follow the logs of the experiment’s first trial using a single command. This simplifies a common model development workflow.

  • cli: Allow the master address to be set via environment variable.

  • Fix master hang on shutdown.

  • Upgrade Postgres to 10.2.