Release Date: December 19, 2019
Breaking Change: Move aggregation frequency, gradient compression, and mixed precision training configurations from hyperparameters to
optimizationsin the experiment config.
Breaking Change: Move pedl.trial.get_trial_seed() to pedl.get_trial_seed().
Add documentation for TFKerasTrial.
Deprecation Warning: KerasTrial, KerasFunctionalTrial, and Simple Keras interfaces have been deprecated and will be removed in a future PEDL version. Please use TFKerasTrial.
Move the Python API reference documentation pages from
Add the pedl.frameworks.keras.data.InMemorySequence utility class.
Add support for TensorFlow 1.15.0. TensorFlow 1.14.0 remains the default because in 1.15.0, TFKerasTrial does not work with tf.data.Dataset-based data loaders.
Add TensorBoard support for experiments using HDFS storage.
Support configuration of ports used by GLOO and NCCL ports used during distributed training.
Support configuring checkpoint storage in the cluster configuration. This is the default storage used for new experiments. See Checkpoints for details.
master.yamlto configure the network interface used for distributed training. Specifying the network interface in this way should reduce the start-up time for distributed training.
Update PyTorchTrial documentation to match the new PyTorchTrial API.
Web UI: Added logs for commands, notebooks, shells, and TensorBoards.
Web UI: Add a "copy to clipboard" button across all log views.
Web UI: Persist filter selections on the experiment list page as query parameters.
Web UI: Fix bug where the plot view for experiment and trial details would show on a second line in Firefox.
Web UI: Fix bug where description fields were being incorrectly sorted in tables.
Web UI, CLI: Fix bug where retrieved master logs were missing some logged fields.
CLI: Support downloading checkpoints from Google Cloud Storage (GCS).
Release Date: December 6, 2019
Breaking Change: Remove custom definition of learning rate scheduling in favor of direct support of PyTorch learning rate schedulers.
Breaking Change: The
pedl.callback.Callbackinterface has been deprecated.
Add option to set
/dev/shmon a per-experiment basis in the experiment config.
Support launching an elastic agent with an IAM Instance Profile attached on AWS.
Support AWS authentication for checkpoints with IAM roles or environment variables. For more information, see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html.
WebUI: Allow control- and command-clicking to open experiments and trial pages in new browser tabs.
WebUI: Add the ability to bulk kill experiments from experiment list view.
WebUI: Add confirmation prompt to terminal actions and actions affecting more than one entity.
WebUI: Add more detailed information the trial page:
Show detailed checkpoint information for each trial.
Show timing information for training, validation, and checkpointing for each trial.
WebUI: Display the plotted trial metric in numerical format in the steps table.
Release Date: November 22, 2019
- Breaking Change: Update PyTorchTrial API in a backward-incompatible way.
The following API changes have been made:
1. Remove `hparams` for function args; please use `pedl.get_hyperparameter()` instead. 2. Remove `PyTorchTrial.losses()` and `PyTorchTrial.training_metrics()` which are replaced with `PyTorchTrial.train_batch(self, batch: TorchData, model: nn.Module, epoch_idx: int, batch_idx: int)` which performs the forward pass and returns the loss and other training metrics. 3. Remove `PyTorchTrial.validation_metrics()` which is replaced by `PyTorchTrial.evaluate_batch(self, batch: TorchData, model: nn.Module)`, which returns the validation metrics.
Existing PyTorch model definitions will need to be updated to work with this release of PEDL.
Breaking Change: Support PyTorch DataLoaders in a PyTorch 1.3 compatible way in preparation for PyTorch 1.3 support. Please see the PyTorch documentation for more details.
Breaking Change: Simplify master configuration for dynamic agents.
The master URL of dynamic agents is now configured using the key
master_url, replacing the previous key
master_address. A valid master URL is in the format of
scheme://hostname:port. The master URL defaults to
httpas scheme, the local IP address of the master as
portif the master runs on cloud. The hostname can still be configured using alias.
For AWS dynamic agents, the
regionnow defaults to the region of the master instance, and support for the
ec2.regionalias has been removed.
tag_valuenow default to
managed-byand an identifier determined by whether the master runs on EC2.
For GCP dynamic agents, the
zonenow defaults to the project ID and zone of the master instance, and support for the
gce.project-idalias has been removed.
label_valuenow default to
managed-byand an identifier determined by whether the master runs on GCP.
New Feature: Support a user-defined startup script for dynamic agents. The startup script runs as
rooton all dynamic agents during instance startup. See Dynamic Agents on AWS and Dynamic Agents on GCP for details.
New Feature: Support GCP native API for dynamic agents. The instance resource base configuration that will be merged with the other fields in the configuration to construct the instance inserting request. See Dynamic Agents on GCP for details.
Support attaching a second disk to GCP dynamic agents by configuring the instance base configuration and the user startup script. See the GCP dynamic agents documentation for details.
Fix bug that prevented AWS dynamic agents from working if the
subnet_idis not specified but
security_group_idis specified in the configuration.
Add support for
Known issue: It currently does not support pause/restart for distributed training.
WebUI: Improve "Show Configuration" functionality on the experiment detail page.
WebUI: Improve messaging on the Cluster page when there are currently no agents running.
Release Date: November 14, 2019
Add support for encrypting communications between the PEDL Master, CLI, and WebUI over HTTPS.
Refer to the documentation for configuration instructions.
Add documentation for network requirements and recommendations for PEDL clusters.
Improve distributed training for EstimatorTrial and added support for
WebUI: Allow viewing master logs from the Cluster page.
Fix bug that prevented pulling images from repositories that are not docker.io.
Release Date: November 11, 2019
Fix Docker bug that led to a build-time container hang when specified linux user ID was very large.
See https://github.com/moby/moby/issues/5419 for more details
Release Date: November 8, 2019
BREAKING CHANGE: Add hyperparameters to PyTorch checkpoints and modify the format to comply with best practices. See the checkpoints documentation for more information.
BREAKING CHANGE: Support more fine-grain configuration of the VPC networking of Dynamic Agents on AWS. Previously, the desired security group was indicated with the field
security_group, and only it was configurable. Now, the network subnet and public IP is also configurable, and the security group is specified with the field
Add documentation for tf.keras checkpoints.
Bind mounts can be specified with a relative
container_pathand will be placed in the working directory of the container. See experiment configuration documentation for details.
CLI: The PEDL CLI now prints a warning if the CLI version differs from the master version.
WebUI: Improve the default sorting behavior of validation metrics.
PEDL_HPARAMSenvironment variable with the per-GPU batch size for distributed and parallel training.
Fix bug where user-provided conda environments were being ignored.
Fix bug where exit codes of
pedl cmd runwere being ignored.
Release Date: November 1, 2019
WebUI: Reduce the number of trials show in the experiment detail page.
WebUI: CPU-only notebooks can be launched in the Notebooks section of PEDL.
Fix bug in adaptive search where the target steps may be greater than step budget.
Improve logging of GCP operations with dynamic agents.
Release Date: October 24, 2019
New Feature: Distributed training support for PyTorch.
Add a new API for downloading data in PyTorch models.
When doing distributed training of a PyTorch model, a single process is created for each GPU being used on a given agent. Each of these processes will invoke the
make_data_loaders()function; in most cases these calls will happen concurrently. If each copy of the training set data loader downloads the entire data set, this causes two problems: (1) the data set will be downloaded multiple times (2) if storing the data set on disk, different copies of the download might overwrite or conflict with one another.
To address these concerns, this release of PEDL introduces a new optional API for PyTorch models. If the developer implements a
download_data()API function, this function will be invoked once per machine, before any data loaders are created. This function can be used to download a single copy of the data set; it should return the path of a directory on disk containing the data set. This path will then be passed when
Optimize performance of single-machine, multi-GPU training with PyTorch.
This new code path can be enabled by specifying
optimized_parallel: Truein the experiment config. This can result in substantially improved multi-GPU training performance (more than 5x faster in some cases). Optimized parallel performance current requires full use of all GPUs on an agent. When this option is enabled,
slots_per_trial(in the experiment config, under the
resourceskey) must be set equal to the total of GPUs on an agent.
Support default value for service account scopes in GCP dynamic agents configuration.
Improve tracking and error reporting for GCP operations when inserting and deleting instances.
Previous versions of PEDL did not emit error messages when GCP operations (e.g., provisioning new dynamic agents) failed. These operations are now tracked more accurately and errors are reported in the master log.
Reduce size of metadata database by cleaning up fault tolerance metadata more aggressively.
Fix bug that prevented agents from reconnecting to the master in some situations.
CLI: Automatically connect via port 80 when the master address is specified as
Previous releases of PEDL defaulted to port 8080, which is inconsistent with the default port number for HTTP URLs.
WebUI: By default, sort tables first by state and then by creation time.
WebUI: Add the ability to close "fork experiment" modals with the escape key.
WebUI: Add the ability to kill commands.
Release Date: October 18, 2019
Breaking Change: Support configuring GCP network and subnetwork to use a different project than the project of the PEDL master instance.
Previously, the network and the subnetwork of the dynamic agents should be set to be the name of the network and the subnetwork. Now, it is required to specify the full path of the configuration. A valid full path for a network should include the project ID and be in the format of
projects/<project>/global/networks/<network>. Likewise, a valid subnetwork should be in the format of
projects/<project>/regions/<region>/subnetworks/<subnetwork>. See Dynamic Agents on GCP for details.
New Feature: Introduce a new version of TensorBoard support.
The architecture of our TensorBoard support has been overhauled. The new TensorBoard support features live updating and automatic serialization of tfevents files for PEDL batch metrics.
This release changes the format used for storing TensorBoard metrics. Experiments created using previous releases of PEDL use the old metric format, which is not compatible with this new version of TensorBoard support. As such, TensorBoard cannot be launched on old experiments. If you would like to use TensorBoard on experiments created in previous versions of PEDL, please contact the Determined AI team.
Support configuring the Docker network used by masters, agents, and tasks.
By default, PEDL creates a Docker network named
pedlon both master and agent machines and uses this network for masters, agents, and task containers. This behavior can be configured via the
network.confconfig file: the
PEDL_NETWORKvariable defines the network used by the master and agent containers, while the
TRIAL_RUNNER_NETWORKvariable controls the network used by task containers. These variables can be set to the name of a Docker network to use (this network will be created automatically); alternatively, the special value
hostcan be used, which causes PEDL to start containers using host-mode networking.
Support configuring shared memory size for trial runners and task containers on a per-agent basis.
In previous releases of PEDL, trial runner containers used 4GB of shared memory (
/dev/shm), whereas task containers used the Docker default for
/dev/shm(64MB). In this release of PEDL, both trial runners and task containers now default to using 4GB of shared memory. This value can now be configured on a per-agent basis by setting
Improve performance of single-machine, multi-GPU training with tf.keras and Tensorpack.
This new code path can be enabled by specifying
optimized_parallelin the experiment configuration. This can result in substantially improved multi-GPU training performance (more than 5x faster in some cases). When this option is enabled, trials from non-distributed, multi-slot experiments must use all the GPUs on the agent. The scheduler will automatically apply this constraint -- for example, if the PEDL cluster consists of agents with 4 GPUs and 8 GPUs, experiments that are configured to use 2
slots_per_trialwill never be scheduled, and the scheduler will automatically place experiments that use
slots_per_trial: 8on the respective agents.
Add ability to link PEDL users to a Unix user and group on agents.
The Unix user/group associated with a PEDL user account can be set via a new CLI command,
pedl user link-with-agent-user. When configured, tasks launched by the PEDL user will run as the linked Unix user and group. See the documentation on users for more details.
Add support for storing checkpoints on Google Cloud Storage (GCS).
Add PEDL CLI to notebooks and commands.
Add documentation for TensorpackTrial.
Do not inject
.bashrcwhen using custom container images.
This avoids overwriting any
.bashrcfile that might exist in the custom image.
Fix bug in merging behavior for configuration templates for commands, notebooks, and tensorboards.
Fix bug in the handling of experiments in a stopping state on master restart.
CLI: Fix error when listing trials of an experiment with no trials.
Release Date: October 10, 2019
Breaking Change: Simplify the interface of tf.keras trial.
The interface of TFKerasTrial has been simplified to a single required function:
build_model(). Unlike in previous versions of PEDL, the implementation of build_model() is required to compile the tf.keras model object before returning it. In the experiment configuration, the specified
searcher.metrickey is expected to adopt the naming convention used by tf.keras:
Support killing pending notebooks, shells, and commands.
Previously, it was only possible to terminate a workload once that workload had started up successfully.
Undefined experiment descriptions will default to a random petname.
Upgrade PyTorch support to 1.2.0.
Add support for CPU-only notebooks.
In previous releases of PEDL, notebook tasks were always allocated a GPU. In this release of PEDL, CPU-only notebooks are now supported. This can be done by setting
0in the configuration when launching the notebook.
Improve TensorBoard documentation.
Support configuring service account for dynamic agents on GCP.
You can specify the name and the scopes of the service account in the master configuration. For more details, see the documentation for Dynamic Agents on GCP.
Make the master HTTP port configurable.
Add a new example with tf.keras:
cli: Add a new command "pedl user list"
Reduce size of container images.
Improve pre-built task environments.
The default task environment, which is used for experiment, notebook, and command environments, now includes a larger set of packages by default. It now includes scikit-learn, matplotlib, pandas, OpenCV, pillow, and xgboost. This reduces the need for users to specify additional workload dependencies in configuration files. The default task environment also is pre-built each release, so if there are no additional packages or commands in the experiment configuration, the task environment image will simply be downloaded directly rather than built from scratch. This should improve the time to launch experiments, notebooks, and commands.
Improve progress reporting of adaptive and simple adaptive experiments.
Support configuring the master address in Kubernetes Helm charts.
Add persistent id for master to database; added cluster_id table with single cluster_id uuid field.
Fix incorrect error message in PBT validation.
Fix a bug that prevented printing out correct master configuration in the logs.
Fix bug in Horovod startup where Horovod may take longer to start than expected.
WebUI: Fix an issue where opening a trial with huge number of log events would result in logs never getting loaded.
WebUI: Fix a regression that caused classic Jupyter notebooks to be opened instead of JupyterLab notebooks.
WebUI: Fix an issue where the initially selected metric for trial plots could be different from the actual plotted one if there was two "loss" metrics.
WebUI: Corrected labels for "Best validation" and "Latest validation" columns in WebUI.
WebUI: Display different types of resources (CPU vs GPU) as separate bars in cluster page.
WebUI: Add support for launching TensorBoard for a specific trial from the trial detail page.
WebUI: Display the plotted validation metric's name in experiment detail view.
WebUI: Add bulk pausing and archiving of experiments to the experiment list page.
This makes it easier to apply the same operation to multiple experiments at the same time.
Release Date: September 27, 2019
New Feature: Introduce a new version of the WebUI.
The visual design of the WebUI has been overhauled. The new WebUI also features improved performance for large experiments and a refactored internal architecture.
New Feature: Add native support for Anaconda environments.
This makes it easier for users to create PEDL experiments that have Conda-based dependencies. See the Custom Environment documentation for details.
Breaking Change: Improvements to the TensorpackTrial interface.
Remove support for
Breaking Change: Change the way instance providers are configured for dynamic agents.
Previously, the cloud provider was configured using the
cloudconfiguration field. This field has now been renamed to
provider. For more details, see the documentation for Dynamic Agents on AWS and Dynamic Agents on GCP.
Fix a bug that caused trials to crash when the master was restarted.
Improved distributed training performance for
Improve performance of models trained with EstimatorTrial interface.
Previously, every training and validation step would pay the overhead cost of initializing a TensorFlow graph. In this release of PEDL, this per-step initialization overhead is reduced to once per trial container.
Reduce number of proxy configuration options.
See the documentation for details.
cli: Fix bug when creating experiments on Windows.
Upgrade to PyTorch 1.1.0
Fix bug where pausing an experiment that had not been scheduled would erroneously trigger the restart logic for its corresponding trials.
Improve validation of PBT configuration parameters.
Improve agent logging when pulling images.
Fix deprecation warnings when importing
Refactor the MNIST PyTorch example to use a directory model definition.
Remove calls to
super().__init()from example models.
Release Date: September 13, 2019
Simplify Kubernetes installation by auto-detecting GPUs. See the Kubernetes documentation for more details.
Update MNIST Keras example.
Fix bug preventing dynamic agents from removing instances after tasks finish.
Release Date: September 12, 2019
New Feature: Support the ability to configure scheduling fit policy.
The scheduler can now be configured to use a worst or best-fit policy when assigning tasks to agents in the cluster. The best-fit policy ensures that tasks will be preferentially "packed" together on the smallest number of agents, rather than be placed on under-utilized agents.
New Feature: Support the ability to create commands, shells, and notebooks using configuration templates.
Configuration templates can be used to reduce redundancy in the configuration files of not just the experiments. With this feature, users can move settings that are shared by many commands, shells, and notebooks into a single YAML file that can then be referenced by configurations that require those settings. See the configuration templates documentation for more details.
Default to "best-fit" instead of "worst-fit" scheduling fit policy.
Update scheduling behavior of trials that use distributed training.
Pack distributed tasks on the least number of agents possible. For example, if your cluster has two 4-slot agents and four 2-slot agents and you need to schedule an 8-slot trial, if one slot was already used, PEDL could schedule the distributed trial to use some mixture of the remaining 4 and 2-slot agents. With this change, PEDL will now wait until both 4-slot agents are free to schedule the trial.
Improve documentation of Elastic AI infrastructure.
Improve documentation of hyperparameter search methods, especially around
Print more helpful error message when trying to create an experiment with an invalid config.
Fix bug that prevents using subnetwork in the network interface of GCP dynamic agents.
Fix bug that prevents the master from spawning any dynamic agents if there are only zero-slot tasks running, such as TensorBoard and GC tasks.
Improved distributed training performance for TensorpackTrial.
Release Date: September 6, 2019
Support HTTP proxy environment variables in agent.
See agent installation documentation for details.
Ignore byte-compiled Python files in model definitions by default.
Update creation time of generated files in notebooks.
Disallow model definition directories named "pedl".
Skip redundant evaluations during training in EstimatorTrial.
Fix bug preventing distributed trials from properly rendezvousing.
Release Date: August 30, 2019
New Feature: Support Tensorpack framework.
Deprecation Warning: Deprecate calling
Trialsubclasses. Also remove support for overwriting
Support configuring logging level in the master configuration file (
Improve error handling when the master configuration is malformed.
Reduce maximum allowable context size by 1MB to 95MB.
Fix rare race condition for model definitions using the TFKerasTrial API.
WebUI: Default to "All experiments" filter.
Improve error handling when
keras.utils.Sequenceobject has zero length.
Release Date: August 23, 2019
New Feature: Dynamic Agents on GCP.
With Dynamic Agents on GCP, the PEDL master automatically provisions and terminates PEDL agent instances based on the number of slots needed by pending tasks and the current utilization of agent instances. See Dynamic Agents on GCP for details.
New Feature: Users.
The new users system allows for assets (e.g., experiments, notebooks, etc.) to be organized by owner. New features are described in the users documentation.
NOTE: Previous versions of the PEDL CLI are incompatible with the new PEDL master. Consequently, all users must upgrade to the latest version of the PEDL CLI.
When upgrading from an older version of PEDL, a new user account, pedl, will be automatically created; all existing assets will be assigned to it. By default, all interactions with PEDL (CLI and WebUI) will automatically authenticate as the pedl user, so no action is required of administrators for the system to function. Refer to the users documentation for instructions on changing this behavior, including how to create individual accounts.
Breaking Change: Fully remove support for deprecated Trial import paths.
pedl.frameworks.tensorflow_trialhave been fully deprecated. Please update all model definitions to use
Improve error handling when reading model definition or context directory that are over 96MB in size.
Default to the JupyterLab view when launching a Jupyter notebook.
Prevent TensorBoard from starting when specifying more than 100 trials.
Improve support for distributed training with TF Keras API.
Fix a bug in the handling of hyperparameters in PBT searches.
Fix bug that caused killing an experiment to always result in a timeout.
WebUI: Add a legend to the cluster allocation chart.
Release Date: August 12, 2019
- Internal release
Release Date: August 8, 2019
New Feature: Dynamic Agents on AWS.
With Dynamic Agents on AWS, the PEDL master automatically provisions and terminates PEDL agent instances based on the number of slots needed by pending tasks and the current utilization of agent instances. See Dynamic Agents on AWS for details.
Breaking Change: Change the default TensorFlow version to 1.14.0 for all configurations. The previous default TensorFlow version was 1.13.1.
Use a Docker user-defined network instead of using
--linkfor compatibility with Docker 19.x. Users upgrading PEDL should make sure to run
make installto install the new systemd services.
pedl trial logs -fcommand exit after the trial it is logging reaches a terminal state.
Download prebuilt images from Docker Hub when possible, or build them from scratch otherwise.
Make the CLI print progress while zipping model and context directories.
Fix shell mode "Too many authentication failures" error by specifying identities for SSH agent authentication.
Limit the step budget for
adaptivesearches to 50000 and limit the number of trials for
adaptive_simplesearchers to 2000.
Fix that adaptive and PBT searchers ignored
Avoid container name collisions across different agent instances.
- Internal release
- Internal release
- Internal release
- Internal release
Release Date: July 9, 2019
Breaking Change: Remove support for TensorFlow 1.10 and 1.11.
Support logging garbage collection tasks that have failed.
Add support for TensorFlow 1.14. The default TensorFlow version remains 1.13.1.
Improve support for distributed training with the TensorFlow Estimator API when using TensorFlow 1.14.
Fix bug preventing the use of context directories with notebooks, commands, and shell sessions.
- Internal release
Release Date: July 4, 2019
- Remove support for Python 3.6.8. All containers will use Python 3.6.9 by default.
Release Date: July 3, 2019
- Fix bug that prevented Kerberos support without an HDFS checkpoint storage configuration.
Release Date: June 27, 2019
Breaking Change: Specify experiment templates via the CLI rather than the experiment configuration.
Previously, experiment configuration templates were specified via the experiment configuration key
template. As of this version of PEDL, the
templatekey in the configuration will be ignored—users should specify a template as follows:
pedl experiment create --template <template-name>.
Sort task list in CLI by task type and creation time.
Permit empty Keras sequences for Keras trials.
Fix IMDB Keras adaptive search example.
Release Date: June 13, 2019
New Feature: Support for
keras.utils.Sequences and Python generators in
KerasFunctionalTrial. See the data loading documentation for Keras trials for more details.
New Feature: Support for
PyTorchTrial. See the data loading documentation for PyTorch trials for more details.
BatchLoaderinterface support in
PyTorchTrialhas been removed in favor of
Release Date: June 6, 2019
Fix IMDB Keras example where NumPy 1.16.3 changes the default value for allow_pickle field.
TensorBoard commands now verify trial/experiment existence before launching.
Fix bug that caused the master to exit when restoring experiments with invalid configurations.
Fix bug that prevented
pedl experiment list-checkpointsfrom listing garbage collected checkpoints.
Release Date: May 30, 2019
WebUI: Add TensorBoard button for experiments. It launches TensorBoard or opens a preexisting TensorBoard instance for an experiment.
Fix error when displaying agents for tasks in the CLI.
Release Date: May 23, 2019
Fix regression that led to experiment failure if the default environment Docker images were deleted on the agent node.
Stop including tfevent files in checkpoints when experiments use the
Improve stability and reduce memory footprint during master restarts.
WebUI: Fix bug that caused trial logs to always jump to the bottom of the logs, even after manually scrolling up.
Release Date: May 16, 2019
New Feature: PEDL now offers the ability to create experiments using configuration templates. Configuration templates can be used to reduce redundancy in experiment configuration files. With this feature, users can move settings that are shared by many experiments into a single YAML file that can then be referenced by configurations that require those settings. See the “Configuration Templates” section of the documentation for more details.
Fixed bug where an agent would pull all tags for a custom image instead of the latest tag.
Improve readability of trial log messages by inserting a delimiter "===" on trial start.
Print slot id on trial runner startup.
Test for the presence of an Nvidia driver installation by checking for the driver module directly instead of running command nvidia-smi.
Improve the robustness of the pedl-db-backup command.
Remove the connection warning in trial logs when a trial runner terminates.
Release Date: May 9, 2019
- Minor cleanup and bug fixes.
Release Date: May 7, 2019
New Feature: Trial logs in the WebUI now update in real time.
When a trial log is initially opened, the view is scrolled to the bottom of the log (where the most recent log lines are displayed). If the trial is still active, additional log lines will continue to appear at the bottom of the view (scrolling is automatic) unless the user scrolls away from the bottom of the view. Should the user scroll up, automatic updating/scrolling will be disabled. To enable once again, a user can scroll to the bottom of the view.
New Feature: PEDL now offers the ability to launch TensorBoard to view trial metrics. See the "TensorBoard" section of the documentation for more details.
WebUI: Change the main page to display experiments, commands, and notebooks in separate tabs.
Fix bug causing PEDL Commands to not use prebuilt task environment images.
Fix bug causing Docker autoremove to result in failed experiments.
Fix bug causing TypeErrors in user code to be silently dropped.
The scheduler now uses a worst-fit policy when assigning tasks to agents in the cluster. This ensures that tasks will be placed preferentially on agents that are under-utilized, rather than "packing" tasks together on the smallest number of agents.
Release Date: April 22, 2019
- Fix bug that led to experiment failure if the default environment Docker images were deleted on the agent node.
Release Date: April 19, 2019
- Fix bug in
pedl-pull-imagesthat prevented pre-generated images from being pulled.
Release Date: April 18, 2019
Breaking Change: Remove explicit trial runner Docker images.
pedl-tr-py3.6-pytorchimages are no longer distributed with PEDL because experiments create their environments on-demand with
Breaking Change: Remove the
trial_environmentkey from the experiment configuration (deprecated in PEDL 0.8.9).
Instead of using the
trial_environmentkey, an experiment configuration should use the
Breaking Change: Remove support for old
Trialconstructor interface (deprecated in PEDL 0.8.5).
Update version of TensorFlow.
Experiments now support TensorFlow 1.13.1 and CUDA 10.0 by default.
Release Date: April 5, 2019
New Feature: PEDL now offers the ability to launch Jupyter notebooks attached to one or more slots in the cluster. See the "Jupyter Notebooks" section of the documentation for more details.
Fix bug that crashed any model definitions that imported the
cProfilestandard Python library.
Breaking Change: Simplify PyTorch API.
PyTorchTrialclass, the expected behavior of a model's
forward()has been altered to be more consistent with native PyTorch models. Additionally, the signatures of the
validation_metrics()methods have been modified, and a new
losses()method has been added. See the model definition documentation for more details. PyTorch examples have been updated to use the new API.
Multi-GPU support for training PyTorch models.
PEDL can now transparently train PyTorch models using multiple GPUs if an experiment is configured to use multiple slots per trial. See the experiment configuration documentation for more details.
Release Date: March 19, 2019
New Feature: Configure experiment containers with the same method as PEDL commands.
It is now possible to configure an experiment's environment using the
environmentkey, following the same semantics as with configuring PEDL commands. Additionally, the
environmentkey for both experiments and PEDL commands now supports GPU- and CPU-specific tags under
New Feature: Running and pending commands now listed in the WebUI.
The experiments overview page now lists running and pending commands in the "Commands" section right under the "Finished Experiments" section.
Deprecation Warning: Experiment configuration key
trial_environmenthas been deprecated in favor of
For backwards compatibility support,
trial_environmentis still supported in this version of PEDL. It will be removed in a future version.
Release Date: March 14, 2019
Documentation: Add data loaders tutorial and example data loaders.
New Feature: Add listing of running or pending commands.
The PEDL CLI command
cmdcan now take the
listargument which displays a list of running and pending commands.
Update version of PyTorch.
The trial runners now include PyTorch 1.1.0.
Release Date: March 7, 2019
New Feature: Support for grid search.
There is a new
gridoption for hyperparameter search. The MNIST examples have been updated with sample grid search experiment configuration files. Please see the grid search documentation for more details.
New Feature: Quick start guide in documentation.
Check out the quick start guide!
New Feature: Support for per-experiment weights in scheduler.
It is now possible to specify a weight for each experiment using the
resources.weightsfield, defaulting to 1; each active experiment will be allocated a number of slots that is approximately proportional to its weight. The weight of an existing experiment can be set via the CLI (
pedl experiment set weight <id> <weight>).
Upgrade to TensorFlow 1.12 in default TF trial runner image.
Release Date: February 22, 2019
New Feature: Support specifying bind mounts for PEDL commands.
PEDL commands now take a
--volume <host path>:<container path>argument that mounts a path on the agent machine as a path in the command container (e.g.,
--volume /shared-fs:/shared-fs). Multiple mounts can be indicated with multiple
New Feature: Support the ability to maintain callback state.
To use callbacks that maintain state, please implement the
load()functions in the
ReduceLROnPlateaucallback when used with Keras simple model definitions.
The semantics of the
ReduceLROnPlateauare slightly modified when used in PEDL. Please see the Keras simple model definition documentation for more details.
Breaking Change: Support multi-input multi-output PyTorch models.
loss()method has been removed and the signatures of the
validation_metrics()methods have been modified. See the model definition overview for more information. PyTorch examples have been updated to use the new API. The MNIST example now contains a multi-output example as well.
Breaking Change: Update the CLI.
pedl trial listand
pedl checkpoint listcommands have been moved to
pedl experiment list-trialsand
pedl experiment list-checkpoints, respectively. The new names may be abbreviated
pedl e ltand
pedl e lc.
Fix bug that prevented creating experiments with a
Re-enable support for the
Update default trial runner images to use SciPy 1.2.1 and Keras-Preprocessing 1.0.9.
Upgrade to Postgres 10.7.
Release Date: February 14, 2019 💕
New Feature: Support for non-graceful termination of active PEDL commands.
PEDL commands can now be terminated immediately with
pedl cmd kill <task_id>. The task id can be found by either listing slots (
pedl slot list) or listing tasks (
pedl task list).
Deprecation Warning: Data loaders have been decoupled from the
validation_loaderarguments have been removed from the constructors to
TensorFlowTrial. Previously, the constructors for these classes were
def __init__(self, training_loader, validation_loader, hparams). Now, they should be
def __init__(self, hparams).
For backwards compatibility support, the old interface will still be supported in this version of PEDL. It will be removed in a future version.
Deprecation Warning: Experiment configuration key checkpoint_storage.checkpoint_path has been deprecated in favor of checkpoint_storage.storage_path.
pedl cmd killcurrently does not support the killing of commands that are still pulling or building, i.e., killing a command may be postponed until the task starts running and after the Docker build steps.
Fix bug that prevented use of quoted commands with
pedl cmd run.
pedl cmd run "echo hello && echo world"should now work as intended.
Update the default trial runner images to use TensorFlow 1.11.0 and CuDNN 7.4.
WebUI: Make pressing the escape key close any open modal.
WebUI: Remove support for creating experiments; use the CLI (
pedl experiment create) to create experiments.
Documentation: Add an FAQ section and an overview page on hyperparameter search methods.
Fix bug when using a non-default WORKDIR in custom Docker images with PEDL commands.
Release Date: February 5, 2019
Fix scheduler bug that lead to an indefinite hang with
pedl slot listor
pedl agent list.
Display PEDL command description when starting a command with
pedl cmd run.
Improve organization of documentation by splitting "PEDL Overview" into multiple pages.
cli: Return an error message from
pedl experiment killif the experiment is not active.
Release Date: January 30, 2019
Breaking Change: Move namespace of TensorBoard callback to
Modify dependency installation order of operations when a custom Docker base image is specified.
When a custom Docker base image is specified,
runtime_commandsare now installed after injecting PEDL harness code and installing PEDL harness dependencies. These configurations can be used to override PEDL harness dependencies, if needed.
Release Date: January 29, 2019
- Fix a bug in a database migration when converting checkpoints to a new internal format.
Release Date: January 29, 2019
New Feature: Track file sizes of checkpoints.
pedl checkpoint listwill now display the size of each checkpoint. The sizes for checkpoints computed before this version of PEDL are not computed retroactively and default to 0.
New Feature: Add ability to non-gracefully terminate an experiment with
pedl experiment kill.
Killing an experiment will immediately terminate an experiment by killing all of its associated trials.
pedl experiment killdoes not checkpoint each trial before terminating it, so this command should be used with care. To gracefully terminate an experiment, please use
pedl experiment cancel.
cli: Remove device UUID from display name when showing slots via
pedl slot list.
cli: Fix bug that displayed an incorrect response message when killing a trial via
pedl trial kill.
cli: Fix bug in
pedl slot disable.
Fix bug in setting a default value for checkpoint storage configuration
checkpoint_pathon experiment creation.
Release Date: January 28, 2019
NOTE: This release changes the command-line syntax of the PEDL CLI. See below for details.
NOTE: This release includes significant changes to the internals of the PEDL master. As a result, running experiments cannot be upgraded from previous versions of PEDL. Before upgrading to PEDL 0.8.0, please cancel all running experiments. (Warm-starting of old experiments with the upgraded PEDL master should continue to work.)
Breaking Change: Port PEDL master to Go, improve scalability.
The PEDL master has been reimplemented in Go. In addition, the master uses a new approach to managing concurrent operations. In concert, these changes should result in substantial improvements to the master's performance and its robustness under heavy load. As noted above, running experiments cannot be upgraded from previous versions of PEDL.
Breaking Change: Change command-line syntax for PEDL CLI.
The CLI has been changed to use a consistent
pedl <noun> <verb>syntax. For example, creating an experiment was previously done via
pedl create; the new syntax is
pedl experiment create, which can be shortened to
pedl e create. The previous command-line syntax is no longer supported. The new CLI also supports tab-completion. To enable it, run
eval "$(register-python-argcomplete pedl)".
New Feature: Add support for executing arbitrary commands.
PEDL now supports running arbitrary commands on agent machines. This feature is intended to support workflows that do not easily fit into the standard experiment workflow. Commands can be started using
pedl cmd run. For more information, see the documentation.
Breaking Change: Reject experiment config files with unrecognized keys.
Previously, PEDL would accept experiment configuration files with unrecognized keys. Such keys were ignored, so typos in the config file could result in confusing behavior. In this release of PEDL, unrecognized keys in configuration files will now be rejected. As a special-case, arbitrary keys are allowed under the
datatop-level key. Users that wish to include custom directives in their experiment configuration files should move those directives to the
Breaking Change: Adopt Keras naming convention for validation metric names in the Simple Keras API.
Validation metrics will automatically be prefixed with
val_, for consistency with the naming convention for validation metrics used by Keras itself.
Add support for exporting TensorFlow Estimator trials to the
Model definitions that use the TensorFlow Estimator API can now implement an optional API,
build_serving_input_receiver_fns, to support exporting the model to the
Disable support for the
Support for PBT will be reintroduced in a future release of PEDL.
Remove support for "system dump".
Shrink size of PEDL agent container image.
Upgrade to Postgres 10.6.
Update the agent and trial runner container images to use Python 3.6.8.
Release Date: December 13, 2018
- Improve robustness of HDFS checkpointing logic to retry-able failures.
Release Date: December 12, 2018
Breaking Change: New
validation_step_callbacks()interface for standard trial definitions have been removed with this PEDL version. In its place, the
pedl.callback.Callback()API can be used to execute Python functions at the beginning and/or end of training and/or validation steps. See the "Callbacks" section in the PEDL overview documentation for more details.
pedl.callback.TensorBoardto simplify TensorBoard integration.
See "TensorBoard Integration" in the PEDL overview documentation for an example of how to integrate TensorBoard into your workflow.
Remove support for
Fix bug in
EstimatorTrialthat caused long-running trial runner containers to consume unbounded disk space.
Release Date: December 3, 2018
Add a workaround for a TensorFlow memory leak bug when using EstimatorTrial.
Previously, the physical memory of a PEDL trial runner could grow unboundedly when using the
EstimatorTrialAPI with certain types of
tf.train.Optimizerinstances. This version of PEDL includes a monkey-patched version of TensorFlow to address this issue until an upstream fix is merged by the TensorFlow team. Please see https://github.com/tensorflow/tensorflow/issues/24047 for a full bug report.
Release Date: November 29, 2018
Add scripts to simplify backing up and restoring PEDL's metadata database.
These scripts are named
Upgrade to Keras 2.2.4 in the default trial runner base image.
Workaround bug in Keras when using
A bug in Python's
multiprocessingmodule resulted in hangs when used with Keras simple model definitions in some situations. This release of PEDL includes a workaround for the underlying
Fix error when garbage collecting checkpoints stored on Kerberos-enabled HDFS file systems.
Release Date: November 15th 2018
- Prevent swallowing of the full traceback in trial logs when model definition code raises a
Release Date: November 10, 2018
New Feature: Add support for TensorFlow's
Users can now use the `EstimatorTrial` interface to train [Premade](https://www.tensorflow.org/guide/premade_estimators) or [Custom](https://www.tensorflow.org/guide/custom_estimators) `tf.estimator.Estimator`s with PEDL. A new example model definition using this interface (`mnist_estimator`) has been added to the [examples](examples) page. Please see documentation for a full description of the API.
New Feature: Add support for HDFS checkpointing with Kerberos enabled.
Users can add the `kerberos: true` configuration to the `checkpoint_storage` section when `type` is `"hdfs"` to enable Kerberos mode. When using this feature, users may also need to configure the `security/kerberos/config_file` to point to a valid Kerberos configuration file location for each agent.
New Feature: Add support for preconfiguring the trial runner environment with a bash script.
When PEDL detects a file named `pedl-prepare-env.sh` at the top-level of a model definition directory, it will execute this script during startup of the trial runner container. Note that this script is executed _before_ the trial runner executes any model definition code with the Python interpreter.
Web UI: Fix bug that prevented the trial detail modal from appearing when a metric name had certain special characters (e.g.
Support for Python 3.7 compatibility with the PEDL CLI.
Release Date: November 5, 2018
WebUI: Add support for filtering experiments with canceled or errored states.
New Feature: Ensure that the Keras
TensorBoardcallback serializes validation metrics when using Simple Keras Model Definitions
When used in previous versions of PEDL, the Keras
TensorBoardcallback would only serialize training metrics.
New Feature: Add support for validation callbacks when using Simple Keras Model Definitions.
Previous versions of PEDL would only execute callbacks during training steps. See documentation on the
KerasValidationCallbackclass for more details.
Add utility functions for referencing the current PEDL context.
pedl.get_experiment_id()have been added as utility functions to be used anywhere in model code.
Experimental: Initial support for HDFS checkpointing.
HDFS support is undocumented in this release of PEDL—please consult with the Determined AI team before using.
Update the master, agent, and trial runners to use Python 3.6.7.
WebUI: Simplify the "Create New Experiment" and "Continue Training Workflow" modals.
Previous versions of PEDL displayed a richly formatted fields for each experiment configuration option, but only supported a subset of available top-level options. This release of PEDL moves to using a single large text area for the raw experiment configuration YAML that can be directly edited.
Release Date: October 11, 2018
New Feature: Add support for custom base Docker images.
This release of PEDL introduces support for specifying a custom Docker
base_imagein the experiment configuration. The
base_imageshould be accessible to all agent nodes via
docker pull. If a private image is used, Docker Registry credentials must be specified in the
registry_authsection in the experiment configuration. The maintainer of the custom base image is responsible for installing PEDL dependencies—see the
Custom Docker Base Imagessection in documentation for a full list of dependency requirements.
New Feature: Add
This flag allows users to download the listed checkpoints for any experiment configured with S3 checkpoint storage. This flag can be used in tandem with the
--bestflag to download the top N checkpoints for an experiment.
WebUI: Display the best validation metric in addition to the latest validation metric for all trials.
Release Date: October 2, 2018
New Feature: Add support for optionally associating Git metadata with an experiment.
pedl create --gitwill look for a Git repository in the model definition directory to save metadata associated with the current Git commit and the remote URL of the current upstream branch. If an experiment is created with the
--gitflag, the Web UI will display the Git commit, committer, commit date, and link to the upstream remote URL. This feature assumes that any commits in the local repository also exist in the upstream remote repository.
Release Date: September 27, 2018
New Feature: Add support for automatically taking checkpoints when the validation performance of an experiment improves.
This release of PEDL introduces a new experiment config option,
checkpoint_policy. Using the default policy (
best), PEDL will checkpoint any trial whenever its validation performance is exceeds the previous best validation performance for this experiment. The
allcheckpoint policy causes PEDL to take a checkpoint after every validation operation; policy
noneresults in no additional checkpoints being taken. Note that checkpoints might still be taken for other reasons: for example, if the
min_checkpoint_periodoption is enabled, or if a trial is moved from one slot to another by the scheduler.
Change scheduler to favor spreading tasks around the cluster.
In previous versions of PEDL, the scheduler attempted to pack tasks on a subset of the cluster. This policy has some advantages: for example, it can result in leaving entire agent machines idle, which then allows those machines to be deactivated or used for a future multi-GPU job. However, this packing behavior can also be problematic: placing additional jobs on the same machine can result in contention for other resources on that host (e.g., CPU or I/O). This release of PEDL changes the scheduler to spread tasks around the cluster when possible; two tasks will only be placed on the same machine if there are no agents that are completely idle.
Add support for
--best Nflag is specified,
pedl list-checkpointswill return the "best" N checkpoints, according to the experiment's configured validation metric. Checkpoints that do not have an associated validation operation will be omitted.
Improve compatibility for Keras callbacks when using the simple model API.
Release Date: September 18, 2018
WebUI: Fix bug in the Continue Training workflow when using Keras simple model definitions.
WebUI: Fix bug in the Continue Training workflow when using nested hyperparameters.
Release Date: September 17, 2018
- Fix bug in Keras simple model definitions when no user-defined metrics are passed to
Release Date: September 13, 2018
New Feature: Support for population-based training (PBT).
Refer to the documentation to see how to use PBT with PEDL.
Breaking Change: Validation functions for Keras models should now operate on tensors, rather than NumPy arrays.
For trials using the
KerasFunctionalTrialclasses, validation functions should now have TensorFlow tensors for their arguments and return types, as with the current version of
TensorFlowTrial. The new API is not backward-compatible with the old API: any PEDL models that use either Keras trial class will need to be updated. The
mnist_keras_functionalexamples demonstrate how to use the new API.
Release Date: September 6, 2018
New Feature: Support for filtering experiments by multiple labels.
In the experiment list page, it is now possible to enter multiple experiment labels at the same time; only experiments that have all of the labels be shown. Type in a label and press 'enter' to add it to the list of labels to filter by; when the text input is empty, press the left and right arrow keys to select an existing label and 'backspace' to remove it from the list.
WebUI: Add API reference documentation.
This documentation is available via the "API Reference" link at the top of any page in PEDL.
WebUI: Fix ability to specify source trial ID in create experiment modal.
WebUI: Fix links to examples in the main documentation.
Release Date: August 23, 2018
New Feature: Persist experiment state across master crashes.
In previous releases of PEDL, a crash in the master would cause all running or paused experiments to enter an error state; now, trials can resume from their last checkpoints after a crash.
New Feature: Support for disabling and enabling agents to allow seamless cluster upgrades.
This release adds the
pedl enable-agentCLI commands, which disable and enable scheduling of tasks on agents. Disabling all agents and waiting for existing jobs to finish allows the cluster to be restarted without losing any work.
New Feature: Support for previewing hyperparameter searches.
This release adds the
pedl preview-searchCLI command, which simulates a run of the given searcher configuration and prints a summary of the training steps that it schedules.
New Feature: Support for
If there is a file called
.pedlignorein the top level of a model definition directory passed to
pedl create, it is now treated as a list of patterns (in the same style as
.gitignore) to exclude from the upload to the master.
We have changed how we generate our documentation to improve styling, navigation, and search.
Show experiment and trial IDs in the trial detail modal.
Release Date: August 9, 2018
New Feature, Breaking Change: New API for writing TensorFlow models.
This version of PEDL introduces a rewrite of
TensorFlowTrial, the base class for PEDL models that use TensorFlow. The new
TensorFlowTrialsupports models with multiple inputs and outputs, supports validation functions on tensors (improving performance), and fixes other limitations of the previous
TensorFlowTrialAPI. The new API is not backward compatible with the old API: any PEDL models that use TensorFlow will need to be updated. The
mnist_tfexample distributed with PEDL has been updated to use the new API.
New Feature: Support for experiment labels.
A label is an arbitrary string that can be associated with an experiment; each experiment can have a set of labels. Labels can be used to organize experiments and identify groups of experiments that have similar properties. Labels can be added and removed via the CLI (
pedl label) or the Web UI.
Improve compatibility with recent versions of Kubernetes.
Cleanup and refactoring of PEDL fault tolerance logic.
Release Date: August 6, 2018
- Fix incompatibility in the
aiodockerlibrary to support Docker >=
Release Date: August 2, 2018
Experimental: Support for "simple" model definitions.
In previous releases of PEDL, model definitions were required to implement a custom
TrialAPI. This API is how PEDL implements support for hyperparameter searches, automatic checkpointing, workload migration between agents, and metadata capture. However, this approach requires modifying model code to implement this API, which can be inconvenient when running "off-the-shelf" models.
This release of PEDL introduces experiment support for "simple" model definitions. This feature allows PEDL to run unmodified model code: features like automatic checkpointing are supported by intercepting calls to certain framework APIs. This feature is currently only supported for models written with Keras that use the
fit_generatorAPI. To access hyperparameters, a new optional API has been introduced,
pedl.get_hyperparameter(). For more information, see the documentation and the
New Feature: Improved trial fault tolerance.
PEDL's support for handling trial failures has been substantially refactored. The main user-visible change is that when a trial fails, only that trial will need to be restarted; other trials in the same experiment will continue running without interruption. This change also fixes several corner-case bugs and lays the groundwork for supporting master fault tolerance in a future release of PEDL.
This release also changes the semantics of the
max_restartsconfiguration parameter: previously, this parameter defined the number of times that an experiment would be restarted after a failure of any one of the experiment's trials. It now defines the maximum number of times that any one trial can fail before the entire experiment is aborted (i.e., it is now a per-trial counter, not a per-experiment counter).
New Feature: Add default checkpoint GC policy.
In previous releases of PEDL, checkpoint GC was not performed by default. In this release, all experiments will have a checkpoint GC policy by default (
Update versions of several dependencies.
The trial runners now include Keras 2.2.2, PyTorch 0.4.1, and NumPy 1.15.0.
Release Date: July 26, 2018
Experimental: Support for PyTorch models.
PyTorch models are written by subclassing the abstract class
PyTorchTrialand specifying a
determinedai/pedl-tr-py3.6-pytorchin the experiment config file. See
examples/mnist_pytorchfor a complete example.
PyTorch models in PEDL currently do not support multi-GPU training.
New Feature: Support for abruptly killing trials.
A new CLI sub-command,
pedl kill-trial, has been added. This immediately terminates the container associated with the specified trial ID. Note that once the trial's current container has been terminated, the trial will typically be restarted in a different container (due to PEDL's support for automatic experiment fault tolerance).
pedl describe --metrics, display all validation metrics of an experiment.
Previously, only the metric used by the experiment's search method was displayed.
Release Date: July 19, 2018
Upgrade to TensorFlow 1.9.0 in the default trial runner.
Note that as a result of this change, the version of
tf.kerashas been upgraded from 2.1.2 to 2.1.6.
base_imageoptional in experiment configurations.
If not specified, the
determinedai/pedl-tr-py3.6-tf. Coincidentally, that is currently the only legal value for
Improve Python 3.5 compatibility.
Release Date: July 13, 2018
Upgrade to Keras 2.2.0 in the default trial runner.
Fix bug in fault tolerance logic when an experiment that is being canceled encounters an error.
More aggressively schedule new work when an experiment's
max_slotslimit is changed.
Release Date: July 12, 2018
Breaking Change: When using bind mounts or
shared_fscheckpoints, the specified
host_pathmust already exist. In previous versions of PEDL, bind mounts could use
host_paths that did not previously exist on the host file system.
Breaking Change: The mechanism for specifying read-only bind mounts has changed. In previous versions of PEDL, the
modeparameter was used; in this release of PEDL, a new parameter
read_onlyshould be used instead.
WebUI: Support for plotting multiple training metrics in trial "detail" view.
In previous releases of PEDL, the trial detail view only supported displaying validation metrics and training loss. As with the plot of training loss, the plot of other training metrics displays the mean value of the training metric for each step.
cli: Support for "test mode" when creating experiments.
When an experiment is created using
pedl create --test_mode, PEDL will run only a single trial of the experiment, and this trial will only be trained for a single step. Then validation metrics will be computed, and a checkpoint of the trial will be taken. Finally, the experiment will be archived, and the experiment's checkpoint will be garbage collected. This feature is intended to support rapid iteration during the initial phase of developing a new model.
cli: Support for changing an experiment's
cli: Report multiple training metrics in
In previous releases of PEDL,
pedl describe --metricsonly reported training loss. As with training loss, the CLI will report the mean value of the training metric for each step.
Support saving checkpoints to arbitrary subdirectories of a shared file system.
When saving checkpoints to a
shared_fs, the new configuration parameter
checkpoint_pathcan be used to control the subdirectory on the shared file system where checkpoints will be placed.
Support for configuring bind propagation for bind mounts via a new
Release Date: July 3, 2018
New Feature: Support for recovering from trial and agent failures.
In previous versions of PEDL, the failure of any trial within an experiment would cause the entire experiment to fail. Similarly, if an agent crashed, all experiments that were running any trials on that agent would be marked as failed.
PEDL now supports recovering from trial and agent failures by automatically re-running failed workloads. This improves PEDL's tolerance of transient faults, such as network failures or out-of-memory errors. If an error occurs while running a trial, PEDL will restart the execution of that trial from its most recent checkpoint (if any). Note that since deep learning workloads are not deterministic in general (see the discussion of reproducibility for more details), any metadata that was recorded after the last checkpoint will be deleted (and subsequently recomputed). The maximum number of times an experiment will be restarted to recover from failures is controlled by
max_restarts, which defaults to 5. This parameter ensures that PEDL does not go into an infinite loop if an experiment encounters the same error repeatedly.
This is useful for examining the state of the task scheduler.
cli: Fix error in
pedl describe --metrics.
WebUI: Round validation metrics to at most five decimal digits in the list of trials for an experiment.
WebUI: Improve scaling of the X-Axis in the experiment-level validation metric plot to match the length of the experiment.
WebUI: Improve responsiveness when activating, pausing, archiving, and unarchiving experiments.
Improve error handling when a trial returns an invalid validation metric value (e.g.,
Improve reporting of training metrics in
Previously, only the weighted sum loss of a multi-loss model was reported as a training metric "loss". Starting in this version of PEDL, the values of each loss function in addition to the weighted sum loss will be reported as training metrics. Training metric history is accessible via
pedl describe --metrics. However, the Web UI trial detail view graph will continue to only display a single training metric "loss" (the weighted sum loss).
Release Date: June 21st, 2018
New Feature, Breaking Change: Support for multiple inputs and multiple loss functions in
In previous versions of PEDL, models using
KerasFunctionalTrialcould specify multiple outputs, but only a single input and a single loss function. This limitation has been lifted.
As a result, the API for validation metric functions in
KerasFunctionalTrialhas changed: previously, the second argument to a metric function was an
np.ndarraycontaining the true labels for the validation set. In this release of PEDL, the second argument is now a dictionary that maps layer names to
New Feature: Support for archiving experiments.
PEDL now supports archival of completed experiments. When an experiment is archived, all experiment metadata is preserved but the experiment is hidden by default from the WebUI and the list of experiments returned by
pedl list. Both the PEDL cli and WebUI now include options to enable display of archived experiments.
New Feature: Support for multiple Python packages when creating experiments.
When creating an experiment using
pedl create, users can now specify one or more additional Python packages via the
--packageflag. Packages should be provided as source distributions—e.g., a ZIP or TAR archive created by
python setup.py sdistin a Python project that uses
This feature allows models to use dependencies on the user's local file system; network-accessible dependencies can also be downloaded using the
runtime_packagesfeature supported by previous versions of PEDL.
New Feature: Support for plotting per-trial validation metrics in Web UI.
In previous versions of PEDL, the "Details" modal contained a plot of per-step average training loss; this dialog has been improved to also support plotting any of the experiment's validation metrics.
Fix rounding error in fair-share scheduling logic.
Release Date: June 14, 2018
Add support for warm-starting from arbitrary checkpoints.
Previously, it was only possible to warm start from the latest checkpoint associated with a particular source trial ID. In this release of PEDL, experiments can also be warm started from a specific checkpoint using the
cli: Add validation metric to
This is useful because checkpoints with an associated validation are treated differently by the garbage collector.
This displays the configuration of an experiment in YAML format.
WebUI: After creating a new experiment, navigate to it.
Experiments now default to the
There was previously no default value for this configuration parameter.
Improve scalability of
Improve validation of priorities in experiment configuration.
Improve logging when processing changes to experiment priority, GC policy, and description.
Release Date: June 8, 2018
Fix bug when garbage-collecting
The result of this bug is that garbage-collecting
shared-fscheckpoints resulted in marking the checkpoint as
DELETEDin the PEDL database, but the actual checkpoint storage would not be removed correctly.
Release Date: June 7, 2018
New Feature: Add a CLI command to update the checkpoint GC policy of an experiment.
pedl set-gc-policycan be used to update the checkpoint GC policy of running or finished experiments. For example, this can be used to reduce the storage consumed by historical experiments.
New Feature: Add support for changing the description of an existing experiment.
Experiment descriptions can be changed via a new CLI command,
Improve performance of
pedl list-trialsCLI command.
Improve error handling of experiment decoding errors in the Web UI.
In previous versions of PEDL, the WebUI failed to display if any of the experiments in the database contained an invalid configuration. In PEDL >=
0.5.9, this error handling is improved—experiments with invalid configurations will be omitted (with a user-facing error prompt) from the Web UI instead of causing a fatal error.
Fix rare race condition on experiment shutdown.
Fix bug when using multi-GPU models instantiated with the
Release Date: June 5, 2018
NOTE: Temporarily disable experiment priorities. This release of PEDL ignores the
priority field in experiment configurations. This is a temporary change; support for experiment priorities will be restored shortly.
Reject experiment configurations with non-default
PEDL does not currently support experiments that use custom Docker base images.
Improve scalability of
Improve logging for training and validation data loaders.
Release Date: May 31st, 2018
Properly handle containers that crash without an active workload.
Fix bug in master experiment shutdown logic.
Fix race condition in agent during container exit.
Improve performance when retrieving trial logs.
WebUI: Add metric value to tooltip in experiment validation plot.
Fix bug in using an
adaptivesearch with a
Fix bug when garbage collecting experiments with failed validations.
Avoid low-probability agent crash due to container name collision.
websocketslibrary to version 5.0.1 in trial runners.
Release Date: May 29, 2018
Fix error in scheduler that occurs if the total number of cluster slots changes.
websocketslibrary to version 5.0.1.
Release Date: May 24, 2018
New Feature: Add support for garbage collecting checkpoints.
When an experiment finishes, the system may optionally delete some checkpoints to reclaim space. See
checkpoint_storagein the experiment configuration documentation for details on how to configure an experiment with this feature enabled. By default, all checkpoints are saved.
Add support for absolute imports in multiple file model definitions.
Improve performance of scheduler when scheduling many large experiments with large numbers of trials.
Release Date: May 23, 2018
New Feature: Add support for a system dump command in CLI.
pedl system-dumpcommand generates a large zip file with cluster logs, data, and statistics.
Significantly improve performance of scheduler when scheduling experiments with large numbers of trials.
Fix bug in using subdirectories with multi-file model definitions.
Fix bug in
pedl describewith terminal experiments.
websocketsPython library to version
Release Date: May 17, 2018
New Feature: Rewrite PEDL task scheduler.
This release includes a new scheduler implementation. The major user-visible change is that experiments within a priority tier will now be fair-shared: e.g., if there are two high-priority experiments, each experiment will have the opportunity to consume half the cluster's resources. (The previous scheduler allocated resources to experiments in the same priority tier according to FIFO order). See the documentation for more information about the new scheduler. Note that scheduler-related configuration options (e.g.,
priority) have not changed.
Add support for using non-AWS S3 implementations to store model checkpoints.
A new experiment configuration option,
endpoint_url, has been added to allow specifying this.
The default trial runner base image has been upgraded to include Keras 2.1.6, NumPy 1.14.3, and SciPy 1.1.0.
Fix bug in handling HTTP requests with no
Release Date: May 15, 2018
Trial runners can now be started under a non-root user ID.
In local deployments, use
dist/etc/agent.confto configure a non-root user ID and optional group ID. In kubernetes deployments, use the
trialRunner.gidconfiguration parameters. If unspecified, the uid defaults to 0 (the root user) and the gid defaults to the root group.
WebUI: Incorporate PEDL documentation into the web server.
Users can now access the full PEDL documentation on the web UI via the "Documentation" button in the navigation bar.
WebUI: Improve formatting of Y-Axis labels in trial "Details" visualization.
Graphs with very large or very small training loss values will now use scientific notation for labels on the Y-Axis.
Fix bug in handling image build errors in the agent.
Release Date: May 10, 2018
New Feature: Support viewing per-trial training loss history in WebUI.
This release adds a "Details" button that displays a plot of the training loss for a given trial. The plot shows the mean per-step training loss.
WebUI: Don't refresh experiment details for terminal experiments.
WebUI: Fix broken "Continue Training" button (0.4.9 regression).
Fix rare error when launching trials due to container name conflict (0.4.9 regression).
Improve error handling for experiments with misconfigured searcher metric.
metricfield in the
searchersection of the experiment config must correspond to the name of a validation metric produced by the model. When this is not the case, PEDL now detects this situation and reports an error.
cli: Improve error reporting for
Release Date: May 5, 2018
- Avoid WebUI error when displaying experiments with misconfigured searcher metric name.
Release Date: May 4, 2018
New Feature: Support for incremental computation of validation metrics.
Previously, the API for computing validation metrics required the entire validation set to be loaded into memory. For experiments with large validation sets, this might be very expensive.
This release of PEDL introduces a new API for that splits the computation of a validation metric into a "batch validation function", which computes an intermediate result for a single batch, and a "reducer", which combines all the intermediate results into a final metric value. Not all validation metrics can be expressed in this way, but for those that can, using this new (optional) API can result in reduced memory consumption.
New Feature: Support for warm-starting experiments that use
Previously, warm-starting was only supported for
singleexperiments. When warm starting
source_trial_idis used to set the initial weights for all of the trials in the experiment.
Introduce a more concise format for specifying constant hyperparameters.
Example of the new format:
This is equivalent to the old syntax, which is still supported:
batch_size: type: const val: 32
Upgrade to YAML 1.2 format for experiment configurations.
Notably, this allows scientific notation (e.g.,
1e-4) to be used when specifying hyperparameters.
pedl logs -fto handle Ctrl+C (
KeyboardInterrupt) more cleanly.
Fix authentication bug when fetching trial runner images from a remote Docker registry.
Upgrade to Python 3.6.5 in the agent and trial-runner containers.
The master container already used Python 3.6.5.
Release Date: April 30, 2018
- Fix missing dependencies in the CLI.
Release Date: April 26, 2018
New Feature: Support for periodic validation computation.
In previous versions of PEDL, validation metrics were only calculated after the final step of a trial, or after the final step of each rung when using the
adaptivesearch method. This release of PEDL adds support for periodically computing validations in addition to those mentioned previously. A new configuration parameter,
min_validation_period, specifies the maximum number of training steps that will be run since the last validation computation before a new validation computation will be initiated.
Users should note that enabling periodic validation could slow experiment progress, depending on the cost of a validation computation. Due to this, periodic validations are not enabled for an experiment by default.
Fix bug with experiments that use the
In previous versions of PEDL, experiments using the
tensorflow.python.keraspackage would crash when attempting to save a checkpoint.
Fix an off-by-one error that slightly limited the integer range of trial seeds.
Trial seeds are now randomly selected from the [0, 231) integer range, whereas in previous PEDL versions they were randomly selected from the [0, 231 - 1) integer range.
Improvements to trial logging.
The agent ID and initial workload are now logged on trial runner startup.
This flag specifies the number of lines of log output to show, counting from the end of the log (analogous to
Improve logging of WebSocket errors.
Improve error logging for CLI commands
Release Date: April 19, 2018
Breaking Change: Switch to a new, backwards-incompatible checkpoint format for Keras trials.
Previous versions of PEDL used the default Keras serialization format (
model.save()). Unfortunately, this format is problematic for models that use the Keras
This release of PEDL switches to a new custom checkpoint format for Keras models. This change works around the shortcomings of the default Keras format and allows multi-GPU models to be restored from checkpoints, but the new checkpoint format is backwards-incompatible: PEDL >= 0.4.6 cannot use Keras model checkpoints (e.g., for experiment warm starts) created by PEDL < 0.4.6.
One consequence of this change is that Keras model definitions that use custom objects no longer need to implement the
custom_objectsAPI method. As a result, this method has been removed from
Support changing the priority of experiments on-the-fly.
This is done using a new CLI sub-command,
Add container launch errors to the per-trial log.
In previous versions of PEDL, if an error occurred when launching a container for a trial, that error was only visible in the PEDL agent log. Container launch errors are now also visible in the per-trial log (e.g.,
Enforce a maximum size on model definitions.
PEDL now rejects model definitions that are greater than 96MB in total size.
Display experiment progress as part of the experiment list in CLI and Web UI.
Fix PEDL agent crash with large model definitions.
Fix bug that caused the
pedl-agent-stopscript to hang for a long time.
Tweak display of experiment states in Web UI.
Completed, active, and failed experiments are now shown in different colors.
Release Date: April 12, 2018
New Feature: Support for model definitions consisting of multiple files.
In previous versions of PEDL, experiments could only use a single model definition file. This restriction has been lifted; an experiment can now consist of a directory of files. When creating multi-file experiments, users should ensure the top-level directory is a well-formed Python package (e.g., it should contain a
__init__.pyfile). Multi-file experiments can be created via both the CLI (
pedl create <experiment-config> <dir>) and the Web UI.
New Feature: Support for periodic trial checkpoints.
In previous versions of PEDL, trials were only checkpointed when the trial was moved to another agent or when the experiment finished. This release of PEDL adds support for periodically checkpointing each trial of an experiment. A new configuration parameter,
min_checkpoint_period, specifies the maximum number of training steps that will be run since the last checkpoint before a new checkpoint of the trial will be taken. Periodic checkpoints are not enabled for an experiment by default.
New Feature: Initial support for reproducible experiments.
PEDL includes limited support for improving the reproducibility of deep learning experiments. See the documentation for more details.
Significantly improve Web UI performance.
The WebUI should now place much less load on the master when viewing experiments with many steps and/or trials.
Allow TensorFlow trials to specify a custom session configuration
Add new CLI sub-command,
This makes it easier to download trial checkpoints that are stored in S3.
Improve Web UI display of trials with in-progress validation operations.
When displaying trials with in-progress validation operations, the Web UI previously displayed a blank validation metric; it will now display the last successfully computed validation metric.
Tweak display of experiment states in Web UI.
These were previously displayed in red text (even for successfully completed experiments), which was confusing. All experiment states are now displayed using the same color as normal text.
Fix bug in
KerasFunctionalTrial, when multiple training metrics specified the same output layer.
Fix error when warm-starting from a trial with multiple checkpoints.
Raise maximum WebSocket packet length to 4MB.
Release Date: April 6, 2018
Fix a Web UI crash with experiments that have misconfigured
Fix a Web UI error that was caused by stale code for editing model definitions.
Update to Python 3.6.5 in the PEDL master container.
Print Nvidia driver version number during PEDL agent startup.
Release Date: April 5, 2018
Breaking Change: Remove support for editing model definitions via the builtin editor in the Web UI.
Previous versions of PEDL supported editing model definitions directly in the Web UI. This feature has been removed, in anticipation of support for model definitions that consist of multiple files.
Breaking Change: Remove support for displaying a histogram of predicted validation labels in the Web UI.
This feature was not broadly useful and the implementation was fragile. A future version of PEDL will introduce support for custom plots as a fully supported feature.
Support for disabling GPUs dynamically.
PEDL now supports two new CLI commands,
pedl enable-slot. These commands allow GPUs at an agent to be disabled and enabled, respectively. When a slot is disabled, any workload that is currently running in the slot is allowed to finish its current step; it will then be checkpointed and migrated to a different slot.
Note that these settings are not persisted: if an agent disconnects from PEDL and reconnects, all of its GPUs / slots will be enabled. GPUs can be disabled in a persistent way by editing
agent.conf, but changing
GPU_LISTrequires restarting the agent.
Increase width of log modal in Web UI.
This makes it easier to view trial logs in the Web UI.
Add an "Experiment ID" column to
This makes it easier to identify all the slots currently used by a particular experiment.
Reduce the number of intermediate Docker layers created for
Fix bugs in the "Continue Training" feature in the Web UI.
The previous implementation neglected to correctly preserve some properties of the experiment being continued from (e.g.,
Fix crash in
pedl describewhen the described experiment was in the midst of computing validation metrics.
Experimental: Reproducibility in
PEDL now supports near-reproducible experiments when using the above search methods. There may still be some limitations around achieving perfect reproducibility to floating point precision during optimization, depending on model choice and/or underlying hardware. See the documentation for more details.
Release Date: March 28, 2018
Experimental: Support for synchronous data-parallel training using multiple GPUs.
PEDL now supports trials that use multiple GPUs on a single agent. This feature allows multiple GPUs to be used to train a single experiment to convergence more quickly. To enable parallel training, set the
slots_per_trialfield in the experiment configuration to be the number of parallel GPUs to use for each trial in the experiment. Note that enabling parallel training does not require changing your model code.
The current implementation has a few shortcomings:
- the user must manually configure the desired degree of parallelism
- all trials in the experiment must use the same degree of parallelism
- a naive communication strategy is used to share gradients between GPUs, which can result in poor performance for some models
- multi-slot experiments are scheduled using a simplistic algorithm that can sometimes result in underutilization
These shortcomings will be addressed in future releases of PEDL.
Breaking Change: The
TensorFlowTrialAPI has changed.
Model definitions that use TensorFlow will need to be updated: several
TensorFlowTrialinterface methods have been renamed and a new required interface method has been added. The examples and API docs have been updated to describe the new API.
Add progress reporting during training and validation of Keras trials.
This makes it easier to observe the rate at which a Keras trial is making progress.
Correctly handle errors when the agent fails to launch a container.
Improve reporting of errors and assertion failures.
Fix bug in
pedl list-checkpointsfor in-progress experiments.
WAITINGtask state to
This more accurately describes what containers in this state are doing.
Release Date: March 22, 2018
Add initial support for "warm starting" of experiments.
This allows a new experiment to be created that uses the weights from a particular trial of a previous experiment. For example, this feature can be used to continue training promising trials from previous experiments for a longer period of time. Note that the new and old experiments must use the same model architecture; however, hyperparameters that don't influence the model architecture can safely be changed.
Add support for checkpointing Keras trials that use custom layers and other custom objects.
KerasTrialsubclasses to implement a new interface method,
Fix bug in
KerasFunctionalTrialwith single-output models.
Improve error checking for experiment configurations.
Improve accuracy of experiment progress indicator.
Fix 0.4.0 regression in WebUI: when viewing an experiment, an error occurred if the experiment changed from "active" to "completed".
Release Date: March 19, 2018
Breaking Change: The checkpoint format for TensorFlow experiments is now
In previous versions of PEDL, the
tf.train.Saverformat was used.
Add support for
singletrial search method.
randomsearcher can be used to achieve a single trial experiment, but this new search method provides first-class support for an experiment that consists of a single trial.
Improve WebUI for PEDL deployments with many experiments.
Support filtering experiments by date range and description.
Fix bug with experiments that used categorical hyperparameters with numerical values.
Add CentOS 7 as a supported platform.
Upgrade to Keras 2.1.5 in the default trial runner base image.
Upgrade to Postgres 10.3.
Release Date: March 8, 2018
Breaking Change: Rename the
trial_runnerfield in the experiment configuration file.
The name of the base Docker container for running trials is now specified by the subfield
base_imageof the top-level
trial_environmentfield. For example:
trial_environment: base_image: determinedai/pedl-tr-py3.6-tf
Breaking Change: Remove extra Python packages from the base trial runner image.
The image previously contained several commonly used Python libraries (e.g.,
zarr). These packages have been removed; the base trial runner only contains TensorFlow, Keras, and their dependencies. The
runtime_packagesfeature (described below) has been added to support installing custom dependencies.
Support for customizing the trial runner container.
The experiment configuration file supports two new subfields of
runtime_packages. These specify a list of commands to be executed and a list of Python packages to be installed into the trial runner, respectively. These customizations are applied before any workloads are run in the trial container.
Support for training and validation callbacks.
Model definitions can now define callbacks that will be executed after training or validation operations. For example, this feature can be used to record training and validation metrics as TensorBoard event files, which can then be visualized using TensorBoard. A complete example of TensorBoard integration is included in the documentation.
In adaptive search, more aggressively mark trials as "completed", when possible.
Improve reliability of starting the PEDL master via systemd.
WebUI: In the experiment detail page, support changing the sort order of the trial tables.
For example, this makes it easier to see which completed or active trials have the best validation metric.
WebUI: In the experiment list page, support changing the sort order of the experiment tables.
Release Date: February 27, 2018
New Feature: The
KerasFunctionalTrialinterface has been added to support the Keras Functional API.
See the documentation for usage instructions and current limitations.
Breaking Change: The trial API function
make_training_and_validation_loadershas been renamed to
The old name is still supported but is deprecated, and will be removed in a future release of PEDL.
Add an example of how to plot PEDL experiment metadata using a Jupyter notebook (see
PEDL now includes support for experiment progress estimation.
In both the CLI and the WebUI, users can view the fraction of total work for a given experiment that has been completed.
Improve error handling for experiments with bad hyperparameter settings.
Optimize training performance for TensorFlow-based experiments.
Change the master to reject connection attempts from agents running a different version of PEDL.
Fix 0.3.0 regression in WebUI: per-trial "Logs" button stopped working.
Upgrade to TensorFlow 1.5.0, Keras 2.1.4, and NumPy 1.14.0 in the default trial runner.
Release Date: February 12, 2018
Support per-experiment resource limits.
Users can now specify a
max_slotssetting in the
resourcessection of the experiment config file.
Support configuring the agent to use a subset of the GPUs on a host.
This is done via the
Adopt a more friendly scheme for agent IDs.
Agent IDs are no longer UUIDs; instead, they are user-configured strings that default to the hostname of the agent machine.
Bundle the API docs with the PEDL package.
cli: Add support for
pedl logs, similar to
cli: Add support for
Users can now start an experiment and follow the logs of the experiment's first trial using a single command. This simplifies a common model development workflow.
cli: Allow the master address to be set via environment variable.
Fix master hang on shutdown.
Upgrade Postgres to 10.2.