Skip to content

Release Notes

Version 0.8.23

Release Date: July 4th, 2019

  • Remove support for Python 3.6.8. All containers will use Python 3.6.9 by default.

Version 0.8.22

Release Date: July 3rd, 2019

  • Fix bug that prevented Kerberos support without an HDFS checkpoint storage configuration.

Version 0.8.21

Release Date: June 27, 2019

  • Breaking Change: Specify experiment templates via the CLI rather than the experiment configuration.

    Previously, experiment configuration templates were specified via the experiment configuration key template. As of this version of PEDL, the template key in the configuration will be ignored—users should specify a template as follows: pedl experiment create --template <template-name>.

  • Sort task list in CLI by task type and creation time.

  • Permit empty Keras sequences for Keras trials.

  • Fix IMDB Keras adaptive search example.

Version 0.8.20

Release Date: June 13, 2019

  • New Feature: Support for keras.utils.Sequences and Python generators in make_data_loaders for KerasTrial and KerasFunctionalTrial. See the data loading documentation for Keras trials for more details.

  • New Feature: Support for torch.utils.data.DataLoader in make_data_loaders for PyTorchTrial. See the data loading documentation for PyTorch trials for more details.

  • Breaking Change: BatchLoader interface support in PyTorchTrial has been removed in favor of torch.utils.data.DataLoader.

Version 0.8.19

Release Date: June 6, 2019

  • Fix IMDB Keras example where NumPy 1.16.3 changes the default value for allow_pickle field.

  • TensorBoard commands now verify trial/experiment existence before launching.

  • Fix bug that caused the master to exit when restoring experiments with invalid configurations.

  • Fix bug that prevented pedl experiment list-checkpoints from listing garbage collected checkpoints.

Version 0.8.18

Release Date: May 30, 2019

  • WebUI: Add TensorBoard button for experiments. It launches TensorBoard or opens a preexisting TensorBoard instance for an experiment.

  • Fix error when displaying agents for tasks in the CLI.

Version 0.8.17

Release Date: May 23, 2019

  • Fix regression that led to experiment failure if the default environment Docker images were deleted on the agent node.

  • Stop including tfevent files in checkpoints when experiments use the EstimatorTrial interface.

  • Improve stability and reduce memory footprint during master restarts.

  • WebUI: Fix bug that caused trial logs to always jump to the bottom of the logs, even after manually scrolling up.

Version 0.8.16

Release Date: May 16, 2019

  • New Feature: PEDL now offers the ability to create experiments using configuration templates. Configuration templates can be used to reduce redundancy in experiment configuration files. With this feature, users can move settings that are shared by many experiments into a single YAML file that can then be referenced by configurations that require those settings. See the “Configuration Templates” section of the documentation for more details.

  • Fixed bug where an agent would pull all tags for a custom image instead of the latest tag.

  • Improve readability of trial log messages by inserting a delimiter "===" on trial start.

  • Print slot id on trial runner startup.

  • Test for the presence of an Nvidia driver installation by checking for the driver module directly instead of running command nvidia-smi.

  • Improve the robustness of the pedl-db-backup command.

  • Remove the connection warning in trial logs when a trial runner terminates.

Version 0.8.15

Release Date: May 9, 2019

  • Minor cleanup and bug fixes.

Version 0.8.14

Release Date: May 7, 2019

  • New Feature: Trial logs in the WebUI now update in real time.

    When a trial log is initially opened, the view is scrolled to the bottom of the log (where the most recent log lines are displayed). If the trial is still active, additional log lines will continue to appear at the bottom of the view (scrolling is automatic) unless the user scrolls away from the bottom of the view. Should the user scroll up, automatic updating/scrolling will be disabled. To enable once again, a user can scroll to the bottom of the view.

  • New Feature: PEDL now offers the ability to launch TensorBoard to view trial metrics. See the "TensorBoard" section of the documentation for more details.

  • WebUI: Change the main page to display experiments, commands, and notebooks in separate tabs.

  • Fix bug causing PEDL Commands to not use prebuilt task environment images.

  • Fix bug causing Docker autoremove to result in failed experiments.

  • Fix bug causing TypeErrors in user code to be silently dropped.

  • The scheduler now uses a worst-fit policy when assigning tasks to agents in the cluster. This ensures that tasks will be placed preferentially on agents that are under-utilized, rather than "packing" tasks together on the smallest number of agents.

Version 0.8.13

Release Date: April 22, 2019

  • Fix bug that led to experiment failure if the default environment Docker images were deleted on the agent node.

Version 0.8.12

Release Date: April 19, 2019

  • Fix bug in pedl-pull-images that prevented pre-generated images from being pulled.

Version 0.8.11

Release Date: April 18, 2019

  • Breaking Change: Remove explicit trial runner Docker images.

    The pedl-tr-py3.6-tf and pedl-tr-py3.6-pytorch images are no longer distributed with PEDL because experiments create their environments on-demand with environment.

  • Breaking Change: Remove the trial_environment key from the experiment configuration (deprecated in PEDL 0.8.9).

    Instead of using the trial_environment key, an experiment configuration should use the environment key.

  • Breaking Change: Remove support for old Trial constructor interface (deprecated in PEDL 0.8.5).

  • Update version of TensorFlow.

    Experiments now support TensorFlow 1.13.1 and CUDA 10.0 by default.

Version 0.8.10

Release Date: April 5, 2019

  • New Feature: PEDL now offers the ability to launch Jupyter notebooks attached to one or more slots in the cluster. See the "Jupyter Notebooks" section of the documentation for more details.

  • Fix bug that crashed any model definitions that imported the cProfile standard Python library.

  • Breaking Change: Simplify PyTorch API.

    Within the PyTorchTrial class, the expected behavior of a model's forward() has been altered to be more consistent with native PyTorch models. Additionally, the signatures of the training_metrics() and validation_metrics() methods have been modified, and a new losses() method has been added. See the model definition documentation for more details. PyTorch examples have been updated to use the new API.

  • Multi-GPU support for training PyTorch models.

    PEDL can now transparently train PyTorch models using multiple GPUs if an experiment is configured to use multiple slots per trial. See the experiment configuration documentation for more details.

Version 0.8.9

Release Date: March 19, 2019

  • New Feature: Configure experiment containers with the same method as PEDL commands.

    It is now possible to configure an experiment's environment using the environment key, following the same semantics as with configuring PEDL commands. Additionally, the environment key for both experiments and PEDL commands now supports GPU- and CPU-specific tags under runtime_packages and runtime_commands.

  • New Feature: Running and pending commands now listed in the WebUI.

    The experiments overview page now lists running and pending commands in the "Commands" section right under the "Finished Experiments" section.

  • Deprecation Warning: Experiment configuration key trial_environment has been deprecated in favor of environment.

    For backwards compatibility support, trial_environment is still supported in this version of PEDL. It will be removed in a future version.

Version 0.8.8

Release Date: March 14, 2019

  • Documentation: Add data loaders tutorial and example data loaders.

  • New Feature: Add listing of running or pending commands.

    The PEDL CLI command cmd can now take the list argument which displays a list of running and pending commands.

  • Update version of PyTorch.

    The trial runners now include PyTorch 1.0.1.post2.

Version 0.8.7

Release Date: March 7, 2019

  • New Feature: Support for grid search.

    There is a new grid option for hyperparameter search. The MNIST examples have been updated with sample grid search experiment configuration files. Please see the grid search documentation for more details.

  • New Feature: Quick start guide in documentation.

    Check out the quick start guide!

  • New Feature: Support for per-experiment weights in scheduler.

    It is now possible to specify a weight for each experiment using the resources.weights field, defaulting to 1; each active experiment will be allocated a number of slots that is approximately proportional to its weight. The weight of an existing experiment can be set via the CLI (pedl experiment set weight <id> <weight>).

  • Upgrade to TensorFlow 1.12 in default TF trial runner image.

Version 0.8.6

Release Date: February 22, 2019

  • New Feature: Support specifying bind mounts for PEDL commands.

    PEDL commands now take a --volume <host path>:<container path> argument that mounts a path on the agent machine as a path in the command container (e.g., --volume /shared-fs:/shared-fs). Multiple mounts can be indicated with multiple --volume arguments.

  • New Feature: Support the ability to maintain callback state.

    To use callbacks that maintain state, please implement the save() and load() functions in the pedl.callback.Callback interface.

  • Support the ReduceLROnPlateau callback when used with Keras simple model definitions.

    The semantics of the patience and cooldown arguments to ReduceLROnPlateau are slightly modified when used in PEDL. Please see the Keras simple model definition documentation for more details.

  • Breaking Change: Support multi-input multi-output PyTorch models.

    Within the PyTorchTrial class, the loss() method has been removed and the signatures of the training_metrics() and validation_metrics() methods have been modified. See the model definition overview for more information. PyTorch examples have been updated to use the new API. The MNIST example now contains a multi-output example as well.

  • Breaking Change: Update the CLI.

    The pedl trial list and pedl checkpoint list commands have been moved to pedl experiment list-trials and pedl experiment list-checkpoints, respectively. The new names may be abbreviated pedl e lt and pedl e lc.

  • Fix bug that prevented creating experiments with a security configuration.

  • Re-enable support for the pbt search method.

  • Update default trial runner images to use SciPy 1.2.1 and Keras-Preprocessing 1.0.9.

  • Upgrade to Postgres 10.7.

Version 0.8.5

Release Date: February 14, 2019 💕

  • New Feature: Support for non-graceful termination of active PEDL commands.

    PEDL commands can now be terminated immediately with pedl cmd kill <task_id>. The task id can be found by either listing slots (pedl slot list) or listing tasks (pedl task list).

  • Deprecation Warning: Data loaders have been decoupled from the Trial interface.

    The training_loader and validation_loader arguments have been removed from the constructors to EstimatorTrial, KerasTrial, KerasFunctionalTrial, and TensorFlowTrial. Previously, the constructors for these classes were def __init__(self, training_loader, validation_loader, hparams). Now, they should be def __init__(self, hparams).

    For backwards compatibility support, the old interface will still be supported in this version of PEDL. It will be removed in a future version.

  • Deprecation Warning: Experiment configuration key checkpoint_storage.checkpoint_path has been deprecated in favor of checkpoint_storage.storage_path.

    Limitation: pedl cmd kill currently does not support the killing of commands that are still pulling or building, i.e., killing a command may be postponed until the task starts running and after the docker build steps.

  • Fix bug that prevented use of quoted commands with pedl cmd run.

    For example, pedl cmd run "echo hello && echo world" should now work as intended.

  • Update the default trial runner images to use TensorFlow 1.11.0 and CuDNN 7.4.

  • WebUI: Make pressing the escape key close any open modal.

  • WebUI: Remove support for creating experiments; use the CLI (pedl experiment create) to create experiments.

  • WebUI: Replace Experiments dropdown with Active Experiments.

  • Documentation: Add an FAQ section and an overview page on hyperparameter search methods.

  • Fix bug when using a non-default WORKDIR in custom Docker images with PEDL commands.

Version 0.8.4

Release Date: February 5, 2019

  • Fix scheduler bug that lead to an indefinite hang with pedl slot list or pedl agent list.

  • Display PEDL command description when starting a command with pedl cmd run.

  • Improve organization of documentation by splitting "PEDL Overview" into multiple pages.

  • cli: Return an error message from pedl experiment kill if the experiment is not active.

Version 0.8.3

Release Date: January 30, 2019

  • Breaking Change: Move namespace of TensorBoard callback to pedl.frameworks.tensorflow.TensorBoard.

  • Modify dependency installation order of operations when a custom Docker base image is specified.

    When a custom Docker base image is specified, runtime_packages and/or runtime_commands are now installed after injecting PEDL harness code and installing PEDL harness dependencies. These configurations can be used to override PEDL harness dependencies, if needed.

  • Downgrade to h5py version 2.7.1.

Version 0.8.2

Release Date: January 29, 2019

  • Fix a bug in a database migration when converting checkpoints to a new internal format.

Version 0.8.1

Release Date: January 29, 2019

  • New Feature: Track file sizes of checkpoints.

    pedl checkpoint list will now display the size of each checkpoint. The sizes for checkpoints computed before this version of PEDL are not computed retroactively and default to 0.

  • New Feature: Add ability to non-gracefully terminate an experiment with pedl experiment kill.

    Killing an experiment will immediately terminate an experiment by killing all of its associated trials. pedl experiment kill does not checkpoint each trial before terminating it, so this command should be used with care. To gracefully terminate an experiment, please use pedl experiment cancel.

  • cli: Remove device UUID from display name when showing slots via pedl slot list.

  • cli: Fix bug that displayed an incorrect response message when killing a trial via pedl trial kill.

  • cli: Fix bug in pedl slot disable.

  • Fix bug in setting a default value for checkpoint storage configuration checkpoint_path on experiment creation.

Version 0.8.0

Release Date: January 28, 2019

NOTE: This release changes the command-line syntax of the PEDL CLI. See below for details.

NOTE: This release includes significant changes to the internals of the PEDL master. As a result, running experiments cannot be upgraded from previous versions of PEDL. Before upgrading to PEDL 0.8.0, please cancel all running experiments. (Warm-starting of old experiments with the upgraded PEDL master should continue to work.)

  • Breaking Change: Port PEDL master to Go, improve scalability.

    The PEDL master has been reimplemented in Go. In addition, the master uses a new approach to managing concurrent operations. In concert, these changes should result in substantial improvements to the master's performance and its robustness under heavy load. As noted above, running experiments cannot be upgraded from previous versions of PEDL.

  • Breaking Change: Change command-line syntax for PEDL CLI.

    The CLI has been changed to use a consistent pedl <noun> <verb> syntax. For example, creating an experiment was previously done via pedl create; the new syntax is pedl experiment create, which can be shortened to pedl e create. The previous command-line syntax is no longer supported. The new CLI also supports tab-completion. To enable it, run eval "$(register-python-argcomplete pedl)".

  • New Feature: Add support for executing arbitrary commands.

    PEDL now supports running arbitrary commands on agent machines. This feature is intended to support workflows that do not easily fit into the standard experiment workflow. Commands can be started using pedl cmd run. For more information, see the documentation.

  • Breaking Change: Reject experiment config files with unrecognized keys.

    Previously, PEDL would accept experiment configuration files with unrecognized keys. Such keys were ignored, so typos in the config file could result in confusing behavior. In this release of PEDL, unrecognized keys in configuration files will now be rejected. As a special-case, arbitrary keys are allowed under the data top-level key. Users that wish to include custom directives in their experiment configuration files should move those directives to the data section.

  • Breaking Change: Adopt Keras naming convention for validation metric names in the Simple Keras API.

    Validation metrics will automatically be prefixed with val_, for consistency with the naming convention for validation metrics used by Keras itself.

  • Add support for exporting TensorFlow Estimator trials to the SavedModel format.

    Model definitions that use the TensorFlow Estimator API can now implement an optional API, build_serving_input_receiver_fns, to support exporting the model to the SavedModel format.

  • Disable support for the pbt search method.

    Support for PBT will be reintroduced in a future release of PEDL.

  • Remove support for "system dump".

  • Shrink size of PEDL agent container image.

  • Upgrade to Postgres 10.6.

  • Update the agent and trial runner container images to use Python 3.6.8.

Version 0.7.14

Release Date: December 13, 2018

  • Improve robustness of HDFS checkpointing logic to retry-able failures.

Version 0.7.13

Release Date: December 12, 2018

  • Breaking Change: New pedl.callback.Callback interface.

    The training_step_callbacks() and validation_step_callbacks() interface for standard trial definitions have been removed with this PEDL version. In its place, the pedl.callback.Callback() API can be used to execute Python functions at the beginning and/or end of training and/or validation steps. See the "Callbacks" section in the PEDL overview documentation for more details.

  • Add pedl.callback.TensorBoard to simplify TensorBoard integration.

    See "TensorBoard Integration" in the PEDL overview documentation for an example of how to integrate TensorBoard into your workflow.

  • Remove support for pedl system-dump.

  • Fix bug in EstimatorTrial that caused long-running trial runner containers to consume unbounded disk space.

Version 0.7.12

Release Date: December 3, 2018

  • Add a workaround for a TensorFlow memory leak bug when using EstimatorTrial.

    Previously, the physical memory of a PEDL trial runner could grow unboundedly when using the EstimatorTrial API with certain types of tf.train.Optimizer instances. This version of PEDL includes a monkey-patched version of TensorFlow to address this issue until an upstream fix is merged by the TensorFlow team. Please see https://github.com/tensorflow/tensorflow/issues/24047 for a full bug report.

Version 0.7.11

Release Date: November 29, 2018

  • Add scripts to simplify backing up and restoring PEDL's metadata database.

    These scripts are named pedl-db-backup and pedl-db-restore, respectively.

  • Upgrade to Keras 2.2.4 in the default trial runner base image.

  • Workaround bug in Keras when using multiprocessing.

    A bug in Python's multiprocessing module resulted in hangs when used with Keras simple model definitions in some situations. This release of PEDL includes a workaround for the underlying multiprocessing bug.

  • Fix error when garbage collecting checkpoints stored on Kerberos-enabled HDFS file systems.

Version 0.7.10

Release Date: November 15, 2018

  • Prevent swallowing of the full traceback in trial logs when model definition code raises a StopIteration exception.

Version 0.7.9

Release Date: November 10, 2018

  • New Feature: Add support for TensorFlow's tf.estimator.Estimator API.

    Users can now use the `EstimatorTrial` interface to train [Premade](https://www.tensorflow.org/guide/premade_estimators) or [Custom](https://www.tensorflow.org/guide/custom_estimators) `tf.estimator.Estimator`s with PEDL. A new example model definition using this interface (`mnist_estimator`) has been added to the [examples](examples) page. Please see documentation for a full description of the API.
    
  • New Feature: Add support for HDFS checkpointing with Kerberos enabled.

    Users can add the `kerberos: true` configuration to the `checkpoint_storage` section when `type` is `"hdfs"` to enable Kerberos mode. When using this feature, users may also need to configure the `security/kerberos/config_file` to point to a valid Kerberos configuration file location for each agent.
    
  • New Feature: Add support for preconfiguring the trial runner environment with a bash script.

    When PEDL detects a file named `pedl-prepare-env.sh` at the top-level of a model definition directory, it will execute this script during startup of the trial runner container. Note that this script is executed _before_ the trial runner executes any model definition code with the Python interpreter.
    
  • Web UI: Fix bug that prevented the trial detail modal from appearing when a metric name had certain special characters (e.g. /).

  • Support for Python 3.7 compatibility with the PEDL CLI.

Version 0.7.8

Release Date: November 5, 2018

  • WebUI: Add support for filtering experiments with canceled or errored states.

  • New Feature: Ensure that the Keras TensorBoard callback serializes validation metrics when using Simple Keras Model Definitions

    When used in previous versions of PEDL, the Keras TensorBoard callback would only serialize training metrics.

  • New Feature: Add support for validation callbacks when using Simple Keras Model Definitions.

    Previous versions of PEDL would only execute callbacks during training steps. See documentation on the KerasValidationCallback class for more details.

  • Add utility functions for referencing the current PEDL context.

    pedl.get_experiment_config(), pedl.get_trial_id(), and pedl.get_experiment_id() have been added as utility functions to be used anywhere in model code.

  • Experimental: Initial support for HDFS checkpointing.

    HDFS support is undocumented in this release of PEDL—please consult with the Determined AI team before using.

  • Update the master, agent, and trial runners to use Python 3.6.7.

  • WebUI: Simplify the "Create New Experiment" and "Continue Training Workflow" modals.

    Previous versions of PEDL displayed a richly formatted fields for each experiment configuration option, but only supported a subset of available top-level options. This release of PEDL moves to using a single large text area for the raw experiment configuration YAML that can be directly edited.

Version 0.7.7

Release Date: October 11, 2018

  • New Feature: Add support for custom base docker images.

    This release of PEDL introduces support for specifying a custom Docker base_image in the experiment configuration. The base_image should be accessible to all agent nodes via docker pull. If a private image is used, Docker Registry credentials must be specified in the registry_auth section in the experiment configuration. The maintainer of the custom base image is responsible for installing PEDL dependencies—see the Custom Docker Base Images section in documentation for a full list of dependency requirements.

  • New Feature: Add --download-to flag to pedl list-checkpoints.

    This flag allows users to download the listed checkpoints for any experiment configured with S3 checkpoint storage. This flag can be used in tandem with the --best flag to download the top N checkpoints for an experiment.

  • WebUI: Display the best validation metric in addition to the latest validation metric for all trials.

Version 0.7.6

Release Date: October 2, 2018

  • New Feature: Add support for optionally associating Git metadata with an experiment.

    pedl create --git will look for a Git repository in the model definition directory to save metadata associated with the current Git commit and the remote URL of the current upstream branch. If an experiment is created with the --git flag, the Web UI will display the Git commit, committer, commit date, and link to the upstream remote URL. This feature assumes that any commits in the local repository also exist in the upstream remote repository.

Version 0.7.5

Release Date: September 27, 2018

  • New Feature: Add support for automatically taking checkpoints when the validation performance of an experiment improves.

    This release of PEDL introduces a new experiment config option, checkpoint_policy. Using the default policy (best), PEDL will checkpoint any trial whenever its validation performance is exceeds the previous best validation performance for this experiment. The all checkpoint policy causes PEDL to take a checkpoint after every validation operation; policy none results in no additional checkpoints being taken. Note that checkpoints might still be taken for other reasons: for example, if the min_checkpoint_period option is enabled, or if a trial is moved from one slot to another by the scheduler.

  • Change scheduler to favor spreading tasks around the cluster.

    In previous versions of PEDL, the scheduler attempted to pack tasks on a subset of the cluster. This policy has some advantages: for example, it can result in leaving entire agent machines idle, which then allows those machines to be deactivated or used for a future multi-GPU job. However, this packing behavior can also be problematic: placing additional jobs on the same machine can result in contention for other resources on that host (e.g., CPU or I/O). This release of PEDL changes the scheduler to spread tasks around the cluster when possible; two tasks will only be placed on the same machine if there are no agents that are completely idle.

  • Add support for --best flag to pedl list-checkpoints.

    If the --best N flag is specified, pedl list-checkpoints will return the "best" N checkpoints, according to the experiment's configured validation metric. Checkpoints that do not have an associated validation operation will be omitted.

  • Improve compatibility for Keras callbacks when using the simple model API.

Version 0.7.4

Release Date: September 18, 2018

  • WebUI: Fix bug in the Continue Training workflow when using Keras simple model definitions.

  • WebUI: Fix bug in the Continue Training workflow when using nested hyperparameters.

Version 0.7.3

Release Date: September 17, 2018

  • Fix bug in Keras simple model definitions when no user-defined metrics are passed to model.compile().

Version 0.7.2

Release Date: September 13, 2018

  • New Feature: Support for population-based training (PBT).

    Refer to the documentation to see how to use PBT with PEDL.

  • Breaking Change: Validation functions for Keras models should now operate on tensors, rather than NumPy arrays.

    For trials using the KerasTrial and KerasFunctionalTrial classes, validation functions should now have TensorFlow tensors for their arguments and return types, as with the current version of TensorFlowTrial. The new API is not backward-compatible with the old API: any PEDL models that use either Keras trial class will need to be updated. The cifar10_cnn_keras and mnist_keras_functional examples demonstrate how to use the new API.

Version 0.7.1

Release Date: September 6, 2018

  • New Feature: Support for filtering experiments by multiple labels.

    In the experiment list page, it is now possible to enter multiple experiment labels at the same time; only experiments that have all of the labels be shown. Type in a label and press 'enter' to add it to the list of labels to filter by; when the text input is empty, press the left and right arrow keys to select an existing label and 'backspace' to remove it from the list.

  • WebUI: Add API reference documentation.

    This documentation is available via the "API Reference" link at the top of any page in PEDL.

  • WebUI: Fix ability to specify source trial ID in create experiment modal.

  • WebUI: Fix links to examples in the main documentation.

Version 0.7.0

Release Date: August 23, 2018

  • New Feature: Persist experiment state across master crashes.

    In previous releases of PEDL, a crash in the master would cause all running or paused experiments to enter an error state; now, trials can resume from their last checkpoints after a crash.

  • New Feature: Support for disabling and enabling agents to allow seamless cluster upgrades.

    This release adds the pedl disable-agent and pedl enable-agent CLI commands, which disable and enable scheduling of tasks on agents. Disabling all agents and waiting for existing jobs to finish allows the cluster to be restarted without losing any work.

  • New Feature: Support for previewing hyperparameter searches.

    This release adds the pedl preview-search CLI command, which simulates a run of the given searcher configuration and prints a summary of the training steps that it schedules.

  • New Feature: Support for .pedlignore files.

    If there is a file called .pedlignore in the top level of a model definition directory passed to pedl create, it is now treated as a list of patterns (in the same style as .gitignore) to exclude from the upload to the master.

  • Revamp documentation.

    We have changed how we generate our documentation to improve styling, navigation, and search.

  • Show experiment and trial IDs in the trial detail modal.

Version 0.6.7

Release Date: August 9, 2018

  • New Feature, Breaking Change: New API for writing TensorFlow models.

    This version of PEDL introduces a rewrite of TensorFlowTrial, the base class for PEDL models that use TensorFlow. The new TensorFlowTrial supports models with multiple inputs and outputs, supports validation functions on tensors (improving performance), and fixes other limitations of the previous TensorFlowTrial API. The new API is not backward compatible with the old API: any PEDL models that use TensorFlow will need to be updated. The mnist_tf example distributed with PEDL has been updated to use the new API.

  • New Feature: Support for experiment labels.

    A label is an arbitrary string that can be associated with an experiment; each experiment can have a set of labels. Labels can be used to organize experiments and identify groups of experiments that have similar properties. Labels can be added and removed via the CLI (pedl label) or the Web UI.

  • Improve compatibility with recent versions of Kubernetes.

  • Cleanup and refactoring of PEDL fault tolerance logic.

Version 0.6.6

Release Date: August 6, 2018

  • Fix incompatibility in the aiodocker library to support Docker >= 18.06.0-ce.

Version 0.6.5

Release Date: August 2, 2018

  • Experimental: Support for "simple" model definitions.

    In previous releases of PEDL, model definitions were required to implement a custom Trial API. This API is how PEDL implements support for hyperparameter searches, automatic checkpointing, workload migration between agents, and metadata capture. However, this approach requires modifying model code to implement this API, which can be inconvenient when running "off-the-shelf" models.

    This release of PEDL introduces experiment support for "simple" model definitions. This feature allows PEDL to run unmodified model code: features like automatic checkpointing are supported by intercepting calls to certain framework APIs. This feature is currently only supported for models written with Keras that use the fit_generator API. To access hyperparameters, a new optional API has been introduced, pedl.get_hyperparameter(). For more information, see the documentation and the mnist_keras_simple example.

  • New Feature: Improved trial fault tolerance.

    PEDL's support for handling trial failures has been substantially refactored. The main user-visible change is that when a trial fails, only that trial will need to be restarted; other trials in the same experiment will continue running without interruption. This change also fixes several corner-case bugs and lays the groundwork for supporting master fault tolerance in a future release of PEDL.

    This release also changes the semantics of the max_restarts configuration parameter: previously, this parameter defined the number of times that an experiment would be restarted after a failure of any one of the experiment's trials. It now defines the maximum number of times that any one trial can fail before the entire experiment is aborted (i.e., it is now a per-trial counter, not a per-experiment counter).

  • New Feature: Add default checkpoint GC policy.

    In previous releases of PEDL, checkpoint GC was not performed by default. In this release, all experiments will have a checkpoint GC policy by default (save_experiment_best: 0, save_trial_best: 1, save_trial_latest: 1).

  • Update versions of several dependencies.

    The trial runners now include Keras 2.2.2, PyTorch 0.4.1, and NumPy 1.15.0.

Version 0.6.4

Release Date: July 26, 2018

  • Experimental: Support for PyTorch models.

    PyTorch models are written by subclassing the abstract class PyTorchTrial and specifying a base_image of determinedai/pedl-tr-py3.6-pytorch in the experiment config file. See examples/mnist_pytorch for a complete example.

    PyTorch models in PEDL currently do not support multi-GPU training.

  • New Feature: Support for abruptly killing trials.

    A new CLI sub-command, pedl kill-trial, has been added. This immediately terminates the container associated with the specified trial ID. Note that once the trial's current container has been terminated, the trial will typically be restarted in a different container (due to PEDL's support for automatic experiment fault tolerance).

  • In pedl describe --metrics, display all validation metrics of an experiment.

    Previously, only the metric used by the experiment's search method was displayed.

Version 0.6.3

Release Date: July 19, 2018

  • Upgrade to TensorFlow 1.9.0 in the default trial runner.

    Note that as a result of this change, the version of tf.keras has been upgraded from 2.1.2 to 2.1.6.

  • Make base_image optional in experiment configurations.

    If not specified, the base_image defaults to determinedai/pedl-tr-py3.6-tf. Coincidentally, that is currently the only legal value for base_image.

  • Improve Python 3.5 compatibility.

Version 0.6.2

Release Date: July 13, 2018

  • Upgrade to Keras 2.2.0 in the default trial runner.

  • Fix bug in fault tolerance logic when an experiment that is being canceled encounters an error.

  • More aggressively schedule new work when an experiment's max_slots limit is changed.

Version 0.6.1

Release Date: July 12, 2018

  • Breaking Change: When using bind mounts or shared_fs checkpoints, the specified host_path must already exist. In previous versions of PEDL, bind mounts could use host_paths that did not previously exist on the host file system.

  • Breaking Change: The mechanism for specifying read-only bind mounts has changed. In previous versions of PEDL, the mode parameter was used; in this release of PEDL, a new parameter read_only should be used instead.

  • WebUI: Support for plotting multiple training metrics in trial "detail" view.

    In previous releases of PEDL, the trial detail view only supported displaying validation metrics and training loss. As with the plot of training loss, the plot of other training metrics displays the mean value of the training metric for each step.

  • cli: Support for "test mode" when creating experiments.

    When an experiment is created using pedl create --test_mode, PEDL will run only a single trial of the experiment, and this trial will only be trained for a single step. Then validation metrics will be computed, and a checkpoint of the trial will be taken. Finally, the experiment will be archived, and the experiment's checkpoint will be garbage collected. This feature is intended to support rapid iteration during the initial phase of developing a new model.

  • cli: Support for changing an experiment's max_slots limit on-the-fly.

  • cli: Report multiple training metrics in pedl describe.

    In previous releases of PEDL, pedl describe --metrics only reported training loss. As with training loss, the CLI will report the mean value of the training metric for each step.

  • Support saving checkpoints to arbitrary subdirectories of a shared file system.

    When saving checkpoints to a shared_fs, the new configuration parameter checkpoint_path can be used to control the subdirectory on the shared file system where checkpoints will be placed.

  • Support for configuring bind propagation for bind mounts via a new propagation configuration parameter.

Version 0.6.0

Release Date: July 3, 2018

  • New Feature: Support for recovering from trial and agent failures.

    In previous versions of PEDL, the failure of any trial within an experiment would cause the entire experiment to fail. Similarly, if an agent crashed, all experiments that were running any trials on that agent would be marked as failed.

    PEDL now supports recovering from trial and agent failures by automatically re-running failed workloads. This improves PEDL's tolerance of transient faults, such as network failures or out-of-memory errors. If an error occurs while running a trial, PEDL will restart the execution of that trial from its most recent checkpoint (if any). Note that since deep learning workloads are not deterministic in general (see the discussion of reproducibility for more details), any metadata that was recorded after the last checkpoint will be deleted (and subsequently recomputed). The maximum number of times an experiment will be restarted to recover from failures is controlled by max_restarts, which defaults to 5. This parameter ensures that PEDL does not go into an infinite loop if an experiment encounters the same error repeatedly.

  • cli: Add pedl list-tasks subcommand.

    This is useful for examining the state of the task scheduler.

  • cli: Fix error in pedl describe --metrics.

  • WebUI: Round validation metrics to at most five decimal digits in the list of trials for an experiment.

  • WebUI: Improve scaling of the X-Axis in the experiment-level validation metric plot to match the length of the experiment.

  • WebUI: Improve responsiveness when activating, pausing, archiving, and unarchiving experiments.

  • Improve error handling when a trial returns an invalid validation metric value (e.g., math.nan).

  • Improve reporting of training metrics in KerasFunctionalTrial.

    Previously, only the weighted sum loss of a multi-loss model was reported as a training metric "loss". Starting in this version of PEDL, the values of each loss function in addition to the weighted sum loss will be reported as training metrics. Training metric history is accessible via pedl describe --metrics. However, the Web UI trial detail view graph will continue to only display a single training metric "loss" (the weighted sum loss).

Version 0.5.12

Release Date: June 21, 2018

  • New Feature, Breaking Change: Support for multiple inputs and multiple loss functions in KerasFunctionalTrial.

    In previous versions of PEDL, models using KerasFunctionalTrial could specify multiple outputs, but only a single input and a single loss function. This limitation has been lifted.

    As a result, the API for validation metric functions in KerasFunctionalTrial has changed: previously, the second argument to a metric function was an np.ndarray containing the true labels for the validation set. In this release of PEDL, the second argument is now a dictionary that maps layer names to np.ndarray values.

  • New Feature: Support for archiving experiments.

    PEDL now supports archival of completed experiments. When an experiment is archived, all experiment metadata is preserved but the experiment is hidden by default from the WebUI and the list of experiments returned by pedl list. Both the PEDL cli and WebUI now include options to enable display of archived experiments.

  • New Feature: Support for multiple Python packages when creating experiments.

    When creating an experiment using pedl create, users can now specify one or more additional Python packages via the --package flag. Packages should be provided as source distributions—e.g., a ZIP or TAR archive created by python setup.py sdist in a Python project that uses setuptools.

    This feature allows models to use dependencies on the user's local file system; network-accessible dependencies can also be downloaded using the runtime_packages feature supported by previous versions of PEDL.

  • New Feature: Support for plotting per-trial validation metrics in Web UI.

    In previous versions of PEDL, the "Details" modal contained a plot of per-step average training loss; this dialog has been improved to also support plotting any of the experiment's validation metrics.

  • Fix rounding error in fair-share scheduling logic.

Version 0.5.11

Release Date: June 14, 2018

  • Add support for warm-starting from arbitrary checkpoints.

    Previously, it was only possible to warm start from the latest checkpoint associated with a particular source trial ID. In this release of PEDL, experiments can also be warm started from a specific checkpoint using the source_checkpoint_uuid configuration parameter.

  • cli: Add validation metric to pedl list-checkpoints.

    This is useful because checkpoints with an associated validation are treated differently by the garbage collector.

  • cli: Add pedl config subcommand.

    This displays the configuration of an experiment in YAML format.

  • webui: After creating a new experiment, navigate to it.

  • Experiments now default to the normal priority level.

    There was previously no default value for this configuration parameter.

  • Improve scalability of pedl system-dump.

  • Improve validation of priorities in experiment configuration.

  • Improve logging when processing changes to experiment priority, GC policy, and description.

Version 0.5.10

Release Date: June 8, 2018

  • Fix bug when garbage-collecting shared-fs checkpoints.

    The result of this bug is that garbage-collecting shared-fs checkpoints resulted in marking the checkpoint as DELETED in the PEDL database, but the actual checkpoint storage would not be removed correctly.

Version 0.5.9

Release Date: June 7, 2018

  • New Feature: Add a CLI command to update the checkpoint GC policy of an experiment.

    pedl set-gc-policy can be used to update the checkpoint GC policy of running or finished experiments. For example, this can be used to reduce the storage consumed by historical experiments.

  • New Feature: Add support for changing the description of an existing experiment.

    Experiment descriptions can be changed via a new CLI command, pedl set-description.

  • Improve performance of pedl list-trials CLI command.

  • Improve error handling of experiment decoding errors in the Web UI.

    In previous versions of PEDL, the WebUI failed to display if any of the experiments in the database contained an invalid configuration. In PEDL >= 0.5.9, this error handling is improved—experiments with invalid configurations will be omitted (with a user-facing error prompt) from the Web UI instead of causing a fatal error.

  • Fix rare race condition on experiment shutdown.

  • Fix bug when using multi-GPU models instantiated with the tensorflow.python.keras library.

Version 0.5.8

Release Date: June 5, 2018

  • Major Change: Temporarily disable experiment priorities.

    This release of PEDL ignores the priority field in experiment configurations. This is a temporary change; support for experiment priorities will be restored shortly.

  • Reject experiment configurations with non-default base_image.

    PEDL does not currently support experiments that use custom Docker base images.

  • Improve scalability of pedl system-dump.

  • Improve logging for training and validation data loaders.

Version 0.5.7

Release Date: May 31, 2018

  • Properly handle containers that crash without an active workload.

  • Fix bug in master experiment shutdown logic.

  • Fix race condition in agent during container exit.

  • Improve performance when retrieving trial logs.

  • WebUI: Add metric value to tooltip in experiment validation plot.

  • Fix bug in using an adaptive search with a min_validation_period specified.

  • Fix bug when garbage collecting experiments with failed validations.

  • Avoid low-probability agent crash due to container name collision.

  • Upgrade websockets library to version 5.0.1 in trial runners.

Version 0.5.6

Release Date: May 29, 2018

  • Fix error in scheduler that occurs if the total number of cluster slots changes.

  • Upgrade websockets library to version 5.0.1.

Version 0.5.5

Release Date: May 24, 2018

  • New Feature: Add support for garbage collecting checkpoints.

    When an experiment finishes, the system may optionally delete some checkpoints to reclaim space. See save_trial_latest, save_trial_best, and save_experiment_best under checkpoint_storage in the experiment configuration documentation for details on how to configure an experiment with this feature enabled. By default, all checkpoints are saved.

  • Add support for absolute imports in multiple file model definitions.

  • Improve performance of scheduler when scheduling many large experiments with large numbers of trials.

Version 0.5.4

Release Date: May 23, 2018

  • New Feature: Add support for a system dump command in CLI.

    The new pedl system-dump command generates a large zip file with cluster logs, data, and statistics.

  • Significantly improve performance of scheduler when scheduling experiments with large numbers of trials.

  • Fix bug in using subdirectories with multi-file model definitions.

  • Fix bug in pedl describe with terminal experiments.

  • Upgrade the websockets Python library to version 5.0.0.

Version 0.5.3

Release Date: May 17, 2018

  • New Feature: Rewrite PEDL task scheduler.

    This release includes a new scheduler implementation. The major user-visible change is that experiments within a priority tier will now be fair-shared: e.g., if there are two high-priority experiments, each experiment will have the opportunity to consume half the cluster's resources. (The previous scheduler allocated resources to experiments in the same priority tier according to FIFO order). See the documentation for more information about the new scheduler. Note that scheduler-related configuration options (e.g., max_slots, priority) have not changed.

  • Add support for using non-AWS S3 implementations to store model checkpoints.

    A new experiment configuration option, endpoint_url, has been added to allow specifying this.

  • The default trial runner base image has been upgraded to include Keras 2.1.6, NumPy 1.14.3, and SciPy 1.1.0.

  • Fix bug in handling HTTP requests with no Content-Type.

Version 0.5.2

Release Date: May 15, 2018

  • Trial runners can now be started under a non-root user ID.

    In local deployments, use TRIAL_RUNNER_UID and TRIAL_RUNNER_GID under dist/etc/agent.conf to configure a non-root user ID and optional group ID. In kubernetes deployments, use the trialRunner.uid and trialRunner.gid configuration parameters. If unspecified, the uid defaults to 0 (the root user) and the gid defaults to the root group.

  • WebUI: Incorporate PEDL documentation into the web server.

    Users can now access the full PEDL documentation on the web UI via the "Documentation" button in the navigation bar.

  • WebUI: Improve formatting of Y-Axis labels in trial "Details" visualization.

    Graphs with very large or very small training loss values will now use scientific notation for labels on the Y-Axis.

  • Fix bug in handling image build errors in the agent.

Version 0.5.1

Release Date: May 10, 2018

  • New Feature: Support viewing per-trial training loss history in WebUI.

    This release adds a "Details" button that displays a plot of the training loss for a given trial. The plot shows the mean per-step training loss.

  • WebUI: Don't refresh experiment details for terminal experiments.

  • WebUI: Fix broken "Continue Training" button (0.4.9 regression).

  • Fix rare error when launching trials due to container name conflict (0.4.9 regression).

  • Improve error handling for experiments with misconfigured searcher metric.

    The metric field in the searcher section of the experiment config must correspond to the name of a validation metric produced by the model. When this is not the case, PEDL now detects this situation and reports an error.

  • cli: Improve error reporting for pedl logs.

Version 0.5.0

Release Date: May 5, 2018

  • Avoid WebUI error when displaying experiments with misconfigured searcher metric name.

Version 0.4.9

Release Date: May 4, 2018

  • New Feature: Support for incremental computation of validation metrics.

    Previously, the API for computing validation metrics required the entire validation set to be loaded into memory. For experiments with large validation sets, this might be very expensive.

    This release of PEDL introduces a new API for that splits the computation of a validation metric into a "batch validation function", which computes an intermediate result for a single batch, and a "reducer", which combines all the intermediate results into a final metric value. Not all validation metrics can be expressed in this way, but for those that can, using this new (optional) API can result in reduced memory consumption.

  • New Feature: Support for warm-starting experiments that use random and adaptive search methods.

    Previously, warm-starting was only supported for single experiments. When warm starting random and adaptive experiments, the source_trial_id is used to set the initial weights for all of the trials in the experiment.

  • Introduce a more concise format for specifying constant hyperparameters.

    Example of the new format:

    batch_size: 32
    

    This is equivalent to the old syntax, which is still supported:

    batch_size:
      type: const
      val: 32
    
  • Upgrade to YAML 1.2 format for experiment configurations.

    Notably, this allows scientific notation (e.g., 1e-4) to be used when specifying hyperparameters.

  • Fix pedl logs -f to handle Ctrl+C (KeyboardInterrupt) more cleanly.

  • Fix authentication bug when fetching trial runner images from a remote Docker registry.

  • Upgrade to Python 3.6.5 in the agent and trial-runner containers.

    The master container already used Python 3.6.5.

Version 0.4.8

Release Date: April 30, 2018

  • Fix missing dependencies in the CLI.

Version 0.4.7

Release Date: April 26, 2018

  • New Feature: Support for periodic validation computation.

    In previous versions of PEDL, validation metrics were only calculated after the final step of a trial, or after the final step of each rung when using the adaptive search method. This release of PEDL adds support for periodically computing validations in addition to those mentioned previously. A new configuration parameter, min_validation_period, specifies the maximum number of training steps that will be run since the last validation computation before a new validation computation will be initiated.

    Users should note that enabling periodic validation could slow experiment progress, depending on the cost of a validation computation. Due to this, periodic validations are not enabled for an experiment by default.

  • Fix bug with experiments that use the tensorflow.python.keras package.

    In previous versions of PEDL, experiments using the tensorflow.python.keras package would crash when attempting to save a checkpoint.

  • Fix an off-by-one error that slightly limited the integer range of trial seeds.

    Trial seeds are now randomly selected from the [0, 231) integer range, whereas in previous PEDL versions they were randomly selected from the [0, 231 - 1) integer range.

  • Improvements to trial logging.

    The agent ID and initial workload are now logged on trial runner startup.

  • Add --tail support to pedl logs

    This flag specifies the number of lines of log output to show, counting from the end of the log (analogous to tail -n).

  • Improve logging of WebSocket errors.

  • Improve error logging for CLI commands enable-slot and disable-slot.

Version 0.4.6

Release Date: April 19, 2018

  • Breaking Change: Switch to a new, backwards-incompatible checkpoint format for Keras trials.

    Previous versions of PEDL used the default Keras serialization format (model.save()). Unfortunately, this format is problematic for models that use the Keras multi_gpu_model() API.

    This release of PEDL switches to a new custom checkpoint format for Keras models. This change works around the shortcomings of the default Keras format and allows multi-GPU models to be restored from checkpoints, but the new checkpoint format is backwards-incompatible: PEDL >= 0.4.6 cannot use Keras model checkpoints (e.g., for experiment warm starts) created by PEDL < 0.4.6.

    One consequence of this change is that Keras model definitions that use custom objects no longer need to implement the custom_objects API method. As a result, this method has been removed from KerasTrial and KerasFunctionalTrial.

  • Support changing the priority of experiments on-the-fly.

    This is done using a new CLI sub-command, pedl set-priority.

  • Add container launch errors to the per-trial log.

    In previous versions of PEDL, if an error occurred when launching a container for a trial, that error was only visible in the PEDL agent log. Container launch errors are now also visible in the per-trial log (e.g., pedl logs).

  • Enforce a maximum size on model definitions.

    PEDL now rejects model definitions that are greater than 96MB in total size.

  • Display experiment progress as part of the experiment list in CLI and Web UI.

  • Fix PEDL agent crash with large model definitions.

  • Fix bug that caused the pedl-agent-stop script to hang for a long time.

  • Tweak display of experiment states in Web UI.

    Completed, active, and failed experiments are now shown in different colors.

Version 0.4.5

Release Date: April 12, 2018

  • New Feature: Support for model definitions consisting of multiple files.

    In previous versions of PEDL, experiments could only use a single model definition file. This restriction has been lifted; an experiment can now consist of a directory of files. When creating multi-file experiments, users should ensure the top-level directory is a well-formed Python package (e.g., it should contain a __init__.py file). Multi-file experiments can be created via both the CLI (pedl create <experiment-config> <dir>) and the Web UI.

  • New Feature: Support for periodic trial checkpoints.

    In previous versions of PEDL, trials were only checkpointed when the trial was moved to another agent or when the experiment finished. This release of PEDL adds support for periodically checkpointing each trial of an experiment. A new configuration parameter, min_checkpoint_period, specifies the maximum number of training steps that will be run since the last checkpoint before a new checkpoint of the trial will be taken. Periodic checkpoints are not enabled for an experiment by default.

  • New Feature: Initial support for reproducible experiments.

    PEDL includes limited support for improving the reproducibility of deep learning experiments. See the documentation for more details.

  • Significantly improve Web UI performance.

    The webui should now place much less load on the master when viewing experiments with many steps and/or trials.

  • Allow TensorFlow trials to specify a custom session configuration tf.ConfigProto.

  • Add new CLI sub-command, download-s3-checkpoint.

    This makes it easier to download trial checkpoints that are stored in S3.

  • Improve Web UI display of trials with in-progress validation operations.

    When displaying trials with in-progress validation operations, the Web UI previously displayed a blank validation metric; it will now display the last successfully computed validation metric.

  • Tweak display of experiment states in Web UI.

    These were previously displayed in red text (even for successfully completed experiments), which was confusing. All experiment states are now displayed using the same color as normal text.

  • Fix bug in KerasFunctionalTrial, when multiple training metrics specified the same output layer.

  • Fix error when warm-starting from a trial with multiple checkpoints.

  • Fix JavaScript error when activating or pausing experiments in the Web UI.

  • Raise maximum WebSocket packet length to 4MB.

Version 0.4.4

Release Date: April 6, 2018

  • Fix a Web UI crash with experiments that have misconfigured bind_mounts.

  • Fix a Web UI error that was caused by stale code for editing model definitions.

  • Update to Python 3.6.5 in the PEDL master container.

  • Print Nvidia driver version number during PEDL agent startup.

Version 0.4.3

Release Date: April 5, 2018

  • Breaking Change: Remove support for editing model definitions via the builtin editor in the Web UI.

    Previous versions of PEDL supported editing model definitions directly in the Web UI. This feature has been removed, in anticipation of support for model definitions that consist of multiple files.

  • Breaking Change: Remove support for displaying a histogram of predicted validation labels in the Web UI.

    This feature was not broadly useful and the implementation was fragile. A future version of PEDL will introduce support for custom plots as a fully supported feature.

  • Support for disabling GPUs dynamically.

    PEDL now supports two new CLI commands, pedl disable-slot and pedl enable-slot. These commands allow GPUs at an agent to be disabled and enabled, respectively. When a slot is disabled, any workload that is currently running in the slot is allowed to finish its current step; it will then be checkpointed and migrated to a different slot.

    Note that these settings are not persisted: if an agent disconnects from PEDL and reconnects, all of its GPUs / slots will be enabled. GPUs can be disabled in a persistent way by editing GPU_LIST in agent.conf, but changing GPU_LIST requires restarting the agent.

  • Increase width of log modal in Web UI.

    This makes it easier to view trial logs in the Web UI.

  • Add an "Experiment ID" column to pedl list-slots.

    This makes it easier to identify all the slots currently used by a particular experiment.

  • Reduce the number of intermediate Docker layers created for runtime_packages.

  • Fix bugs in the "Continue Training" feature in the Web UI.

    The previous implementation neglected to correctly preserve some properties of the experiment being continued from (e.g., bind_mounts).

  • Fix crash in pedl describe when the described experiment was in the midst of computing validation metrics.

  • Experimental: Reproducibility in single and random Experiments.

    PEDL now supports near-reproducible experiments when using the above search methods. There may still be some limitations around achieving perfect reproducibility to floating point precision during optimization, depending on model choice and/or underlying hardware. See the documentation for more details.

Version 0.4.2

Release Date: March 28, 2018

  • Experimental: Support for synchronous data-parallel training using multiple GPUs.

    PEDL now supports trials that use multiple GPUs on a single agent. This feature allows multiple GPUs to be used to train a single experiment to convergence more quickly. To enable parallel training, set the slots_per_trial field in the experiment configuration to be the number of parallel GPUs to use for each trial in the experiment. Note that enabling parallel training does not require changing your model code.

    The current implementation has a few shortcomings:

    • the user must manually configure the desired degree of parallelism
    • all trials in the experiment must use the same degree of parallelism
    • a naive communication strategy is used to share gradients between GPUs, which can result in poor performance for some models
    • multi-slot experiments are scheduled using a simplistic algorithm that can sometimes result in underutilization

    These shortcomings will be addressed in future releases of PEDL.

  • Breaking Change: The TensorFlowTrial API has changed.

    Model definitions that use TensorFlow will need to be updated: several TensorFlowTrial interface methods have been renamed and a new required interface method has been added. The examples and API docs have been updated to describe the new API.

  • Add progress reporting during training and validation of Keras trials.

    This makes it easier to observe the rate at which a Keras trial is making progress.

  • Correctly handle errors when the agent fails to launch a container.

  • Improve reporting of errors and assertion failures.

  • Fix bug in pedl list-checkpoints for in-progress experiments.

  • Rename the WAITING task state to IDLE.

    This more accurately describes what containers in this state are doing.

Version 0.4.1

Release Date: March 22, 2018

  • Add initial support for "warm starting" of experiments.

    This allows a new experiment to be created that uses the weights from a particular trial of a previous experiment. For example, this feature can be used to continue training promising trials from previous experiments for a longer period of time. Note that the new and old experiments must use the same model architecture; however, hyperparameters that don't influence the model architecture can safely be changed.

  • Add support for checkpointing Keras trials that use custom layers and other custom objects.

    This requires KerasTrial subclasses to implement a new interface method, custom_objects().

  • Fix bug in KerasFunctionalTrial with single-output models.

  • Improve error checking for experiment configurations.

  • Improve accuracy of experiment progress indicator.

  • Fix 0.4.0 regression in webui: when viewing an experiment, an error occurred if the experiment changed from "active" to "completed".

Version 0.4.0

Release Date: March 19, 2018

  • Breaking Change: The checkpoint format for TensorFlow experiments is now SavedModel.

    In previous versions of PEDL, the tf.train.Saver format was used.

  • Add support for single trial search method.

    The random searcher can be used to achieve a single trial experiment, but this new search method provides first-class support for an experiment that consists of a single trial.

  • Improve webui for PEDL deployments with many experiments.

  • Support filtering experiments by date range and description.

  • Fix bug with experiments that used categorical hyperparameters with numerical values.

  • Add CentOS 7 as a supported platform.

  • Upgrade to Keras 2.1.5 in the default trial runner base image.

  • Upgrade to Postgres 10.3.

Version 0.3.2

Release Date: March 8, 2018

  • Breaking Change: Rename the trial_runner field in the experiment configuration file.

    The name of the base Docker container for running trials is now specified by the subfield base_image of the top-level trial_environment field. For example:

    trial_environment:
      base_image: determinedai/pedl-tr-py3.6-tf
    
  • Breaking Change: Remove extra Python packages from the base trial runner image.

    The image previously contained several commonly used Python libraries (e.g., joblib, pandas, zarr). These packages have been removed; the base trial runner only contains TensorFlow, Keras, and their dependencies. The runtime_packages feature (described below) has been added to support installing custom dependencies.

  • Support for customizing the trial runner container.

    The experiment configuration file supports two new subfields of trial_environment: runtime_commands and runtime_packages. These specify a list of commands to be executed and a list of Python packages to be installed into the trial runner, respectively. These customizations are applied before any workloads are run in the trial container.

  • Support for training and validation callbacks.

    Model definitions can now define callbacks that will be executed after training or validation operations. For example, this feature can be used to record training and validation metrics as TensorBoard event files, which can then be visualized using TensorBoard. A complete example of TensorBoard integration is included in the documentation.

  • In adaptive search, more aggressively mark trials as "completed", when possible.

  • Improve reliability of starting the PEDL master via systemd.

  • webui: In the experiment detail page, support changing the sort order of the trial tables.

    For example, this makes it easier to see which completed or active trials have the best validation metric.

  • webui: In the experiment list page, support changing the sort order of the experiment tables.

Version 0.3.1

Release Date: February 27, 2018

  • New Interface: The KerasFunctionalTrial interface has been added to support the Keras Functional API.

    See the documentation for usage instructions and current limitations.

  • API Change: The trial API function make_training_and_validation_loaders has been renamed to make_data_loaders.

    The old name is still supported but is deprecated, and will be removed in a future release of PEDL.

  • Add an example of how to plot PEDL experiment metadata using a Jupyter notebook (see examples/notebooks).

  • PEDL now includes support for experiment progress estimation.

    In both the CLI and the WebUI, users can view the fraction of total work for a given experiment that has been completed.

  • Improve error handling for experiments with bad hyperparameter settings.

  • Optimize training performance for TensorFlow-based experiments.

  • Change the master to reject connection attempts from agents running a different version of PEDL.

  • Fix 0.3.0 regression in webui: per-trial "Logs" button stopped working.

  • Upgrade to TensorFlow 1.5.0, Keras 2.1.4, and NumPy 1.14.0 in the default trial runner.

Version 0.3.0

Release Date: February 12, 2018

  • Support per-experiment resource limits.

    Users can now specify a max_slots setting in the resources section of the experiment config file.

  • Support configuring the agent to use a subset of the GPUs on a host.

    This is done via the GPU_LIST parameter in agent.conf.

  • Adopt a more friendly scheme for agent IDs.

    Agent IDs are no longer UUIDs; instead, they are user-configured strings that default to the hostname of the agent machine.

  • Bundle the API docs with the PEDL package.

  • cli: Add support for --follow / -f to pedl logs, similar to tail -f.

  • cli: Add support for --follow-first-trial to pedl create.

    Users can now start an experiment and follow the logs of the experiment's first trial using a single command. This simplifies a common model development workflow.

  • cli: Allow the master address to be set via environment variable.

  • Fix master hang on shutdown.

  • Upgrade Postgres to 10.2.