Release Notes¶
Version 0.13¶
Version 0.13.7¶
Release Date: October 29, 2020
New Features
Add support for running workloads on spot instances on AWS. Spot instances can be up to 70% cheaper than on-demand instances. If a spot instance is terminated, Determined’s built-in fault tolerance means that model training will continue on a different agent automatically. Spot instances can be enabled by setting
spot: true
in the Cluster Configuration. For more details, see the guide on using AWS Spot Instances.Support MMDetection, a popular library for object detection, in Determined. MMDetection allows users to easily train state-of-the-art object detection models; with Determined, users can take things one step further with cutting-edge distributed training and hyperparameter tuning to further boost performance. See the Determined implementation of MMDetection for more information on how to get started.
WebUI: Allow the experiments list page to be filtered by labels. Selecting more than one label will filter experiments by the intersection of the selected labels.
Deprecated Features
Deprecate the simple and advanced adaptive hyperparameter search algorithms. They will be removed in a future release. Both algorithms have been replaced with Hyperparameter Search: Adaptive (Asynchronous), which has state-of-the-art performance, as well as better scalability and resource-efficiency.
Improvements
Documentation: Add a guide for Setting up an AWS Kubernetes (EKS) Cluster.
Master: Support a minimum instance count for dynamic agents. The master will attempt to scale the cluster to at least the configured value at all times. This is configurable via
provisioner.min_instances
in the Cluster Configuration. This will increase responsiveness to workload demand because agent(s) will be ready even when the cluster is idle.Kubernetes: Improve the performance of the
/agents
endpoint for Kubernetes deployments. This will improve the performance of the cluster page in the WebUI, as well as when usingdet slot list
anddet task list
via the CLI.Kubernetes: Release version
0.3.0
of the Determined Helm chart.WebUI: Improve metric selection on the trial detail page. This should improve filtering for trials with many metrics.
WebUI: Use scientific notation when appropriate for floating point metric values.
WebUI: Show both experiment and trial TensorBoard sources when applicable.
Bug Fixes
WebUI: Fix an issue where TensorBoard sources did not display properly for TensorBoards started via the CLI.
WebUI: Fix an issue with rendering boolean hyperparameters in the WebUI.
CLI: Fix an issue where trial IDs were occasionally not displayed when running
det task list
ordet slot list
in the CLI.Master: Fix the default value for the
fit
field if thescheduler
is set in the Cluster Configuration.
Version 0.13.6¶
Release Date: October 14, 2020
Improvements
Agent: The
boot_disk_source_image
field for GCP dynamic agents andimage_id
field for AWS dynamic agents are now optional. If omitted, the default value is the Determined agent image that matches the Determined master being used.Documentation: Ship Swagger UI with Determined documentation. The
/swagger-ui
endpoint has been renamed to/docs/rest-api
.Documentation: Add a guide on configuring TLS in Determined.
Kubernetes: Add support for configuring memory and CPU requirements for the Determined database when installing via the Determined Helm Chart.
Kubernetes: Add support for configuring the storageClass that is used when deploying a database using the Determined Helm Chart.
Bug Fixes
Harness: Do not require the master to present a full TLS certificate chain when the certificate is signed by a well-known Certificate Authority.
Harness: Fix a bug which affected
TFKerasTrial
using TensorFlow 2 withgradient_aggregation
> 1.Master: Fix a bug where the master instance would fail if an experiment could not be read from the database.
WebUI: Preserve the colors used for multiple metrics on the metric chart.
WebUI: Fix the ability to cancel a batch of experiments.
WebUI: Fix a bug which caused the Experiment Details page to not render when the latest validation metric is not available.
Version 0.13.5¶
Release Date: September 30, 2020
Improvements
Security: Use one TCP port for all incoming connections to the master and use TLS for all connections if configured.
Breaking Change: The
http_port
andhttps_port
options in the master configuration have been replaced by the singleport
option. Thesecurity.http
option is no longer accepted; the master can no longer be configured to listen over HTTP and HTTPS simultaneously.
Security: Support configuring TLS encryption when deploying Determined on Kubernetes. For more details please see Install Determined on Kubernetes.
Agent: Increase default max agent starting and idle timeouts to 20 minutes and increase max disconnected period from 5 to 10 minutes.
Deployment: Add support for
det-deploy aws
in the following new regions:ap-northeast-1
,eu-central-1
,eu-west-1
,us-east-2
.Docker: Publish new Docker task containers that upgrade TensorFlow versions from 1.15.0 to 1.15.4, and 2.2.0 to 2.2.1.
Documentation: Add extra documentation and reorganize examples by use case.
Documentation: Add a
tf.layers-in-Estimator
example.Kubernetes: Add support for users to specify
initContainers
andcontainers
as part of their custom pod specs. Please see Specifying Custom Pod Specs for details.Kubernetes: Publish version 0.2.0 of the Determined Helm chart.
Native API: Deprecate Native API. Removed related examples and docs.
Trials: Remove support for
TensorpackTrial
.WebUI: Improve polling behavior for experiment and trial details pages to avoid hanging indefinitely for very large experiments/trials.
Bug Fixes
Trials: Fix a bug where if only a subset of workers on a machine executed the
on_trial_close()
EstimatorTrial
callback, the container would terminate as soon as one worker exited.Trials: Fix a bug where
det e create --test
would succeed when there were checkpointing failures.WebUI: Fix the issue of multiple selected rows dissappearing after a successful table batch action.
WebUI: Remove unused TensorBoard sources column from the task list page.
WebUI: Fix rendering metrics with the same name on the metric chart.
WebUI: Make several fixes to improve select appearance and user experience.
WebUI: Fix the issue of agent and cluster info not loading on slow connections.
WebUI: Fix the issue where the chart in the Experiment page does not have the metric name in the legend.
Version 0.13.4¶
Release Date: September 16, 2020
Improvements
Support configuring default values for the task image, Docker pull policy, and Docker registry credentials via the Master Configuration and the Helm Chart Configuration. In previous versions of Determined, these values had to be specified on a per-task basis (e.g., in the experiment configuration). Per-task configuration is still supported and will overwrite the default value (if any).
Add connection checks for dynamic agents. A dynamically provisioned agent will be terminated if it is not actively connected to the master for at least five minutes.
Emit a warning if
DistributeConfig
is specified for anEstimator
. Configuring anEstimator
viatf.distribute.Strategy
can conflict with how Determined performs distributed training. With this change, Determined will attempt to catch this problem and surface an error message in the experiment logs. AnEstimator
can still be configured with an emptyDistributeConfig
without issue.Remove support for
dataflow_to_tf_dataset
inEstimatorTrial
. Dataflows should be wrapped usingwrap_dataset(shard=False)
instead.WebUI: Add middle mouse button click detection on tables to open in a new tab/page.
WebUI: Improve the trial detail metrics view.
Support metrics with non-numeric values.
Default to showing only the searcher metric on initial page load.
Add search capability to the metric select filter. This should improve the experience when there are many metrics.
Add support for displaying multiple metrics on the metric chart.
WebUI: Move TensorBoard sources from a table column into a separate modal.
WebUI: Optimize loading of active TensorBoards and notebooks.
Bug Fixes
Improve handling of certain corner cases where distributed training jobs could hang indefinitely.
Fix an issue where detecting GPU availability in TensorFlow code would cause
EstimatorTrial
models to OOM.Fix an issue where accessing logs could create a memory leak.
Fix an issue that prevents resuming from checkpoints that contain a large number of files.
WebUI: Fix an issue where table page sizes were not saved between page loads.
WebUI: Fix an issue where opening a TensorBoard on an experiment would not direct the user to an already running TensorBoard, but instead create a new one.
WebUI: Fix an issue where batch actions on the experiments table would cause rows to disappear.
Known Issues
WebUI: In the trial detail metrics view, experiments that have both a training metric and a validation metric of the same name will not be displayed correctly on the metrics chart.
Version 0.13.3¶
Release Date: September 8, 2020
Bug Fixes
Deployment: Fix a bug where
det-deploy local cluster-up
was failing.WebUI: Fix a bug where experiment labels were not displayed on the experiment list page.
WebUI: Fix a bug with decoding API responses because of unexpected non-numeric metric values.
Version 0.13.2¶
Release Date: September 3, 2020
New Features
Support deploying Determined on Kubernetes.
Determined workloads run as a collection of pods, which allows standard Kubernetes tools for logging, metrics, and tracing to be used. Determined is compatible with Kubernetes >= 1.15, including managed Kubernetes services such as Google Kubernetes Engine (GKE) and AWS Elastic Kubernetes Service (EKS).
When using Determined with Kubernetes, we currently do not support fair-share scheduling, priority scheduling, per-experiment weights, or gang-scheduling for distributed training experiments; workloads will be scheduled according the behavior of the default Kubernetes scheduler.
Users can configure the behavior of the pods that are launched for Determined workloads by specifying a custom pod spec. A default pod spec can be configured when installing Kubernetes, but a custom pod spec can also be specified on a per-task basis (e.g., via the environment.pod_spec field in the experiment configuration file).
For more information on using Determined with Kubernetes, see the documentation.
Support running multiple distributed training jobs on a single agent.
In previous versions of Determined, a distributed training job could only be scheduled on an agent if it was configured to use all of the GPUs on that agent. In this release, that restriction has been lifted: for example, an agent with 8 GPUs can now be used to run two 4-GPU distributed training jobs. This feature is particularly useful as a way to improve utilization and fair resource allocation for smaller clusters.
Improvements
WebUI: Update primary navigation. The primary navigation is all to one side, and is now collapsible to maximize content space.
WebUI: Trial details improvements:
Update metrics selector to show the number of metrics selected to improve readability.
Add the “Has Checkpoint or Validation” filter.
Persist the “Has Checkpoint or Validation” filter setting across all trials, and persist the “Metrics” filter on trials of the same experiment.
WebUI: Improve table pagination behavior. This will improve performance on Determined instances with many experiments.
WebUI: Persist the sort order and sort column for the experiments, tasks, and trials tables to local storage.
WebUI: Improve the default axes’ ranges for metrics charts. Also, update the range as new data points arrive.
Add a warning when the PyTorch LR scheduler incorrectly uses an unwrapped optimizer. When using PyTorch with Determined, LR schedulers should be constructed using an optimizer that has been wrapped via the
wrap_optimizer()
method.Add a reminder to remove
sys.exit()
ifSystemExit
exception is caught.
Bug Fixes
WebUI: Fix an issue where the recent task list did not apply the limit filter properly.
Fix Keras and Estimator wrapping functions not returning the original objects when exporting checkpoints.
Fix progress reporting for
adaptive_asha
searches that contain failed trials.Fix an issue that was causing OOM errors for some distributed
EstimatorTrial
experiments.
Version 0.13.1¶
Release Date: August 31, 2020
Bug Fixes
Database migration: Fix a bug with a database migration in Determined version 0.13.0 which caused it to run slow and backfill incorrect values. Users on Determined versions 0.12.13 or earlier are recommended to upgrade to version 0.13.1. Users already on version 0.13.0 should upgrade to version 0.13.1 as usual.
Tensorboard: Fix a bug that prevents Tensorboards from experiments with old experiment configuration versions from being loaded.
WebUI: Fix an API response decoding issue on React where a null checkpoint resource was unhandled and could prevent trial detail page from rendering.
WebUI: Fix an issue where terminated Tensorboard and notebook tasks were rendered as openable.
Version 0.13.0¶
Release Date: August 20, 2020
This release of Determined introduces several significant new features and modifications to existing features. When upgrading from a prior release of Determined, users should pay particular attention to the following changes:
The concept of “steps” has been removed from the CLI, WebUI, APIs, and configuration files. Before upgrading, terminate all active and paused experiments (e.g., via
det experiment cancel
ordet experiment kill
). The format of the experiment config file has changed – configuration files that worked with previous versions of Determined will need to be updated to work with Determined >= 0.13.0. For more details, see the notes below or the migration guide.The WebUI has been partially rewritten, moving several components that were implemented in Elm to now being written in React and TypeScript. As part of this change, many improvements to the performance, appearance, and usability of the WebUI have been made. For more details, see the list of changes below. Please notify the Determined team of any regressions in functionality.
The usability of the
det shell
feature has been significantly enhanced. As part of this change, the way in which arguments todet shell
are parsed has changed; see details below.
We recommend taking a backup of the database before upgrading Determined.
New Features
Allow trial containers to connect to the master using TLS.
Allow agent’s TLS verification to skip verification or use a custom certificate for the master.
For
TFKerasTrial
andEstimatorTrial
, add support for disabling automatic sharding of the training dataset when doing distributed training. When wrapping a dataset viacontext.wrap_dataset
, users can now passshard_dataset=False
. If this is done, users are responsible for splitting their dataset in such a manner that every GPU (rank) sees unique data.
Improvements
Remove Steps from the UX: Remove the concept of a “step” from the CLI, WebUI, and configuration files. Add new configuration settings to allow settings previously in terms of steps to be configured instead in terms of records, batches or epochs. See the migration guide for details on migrating from the old configuration to the new configuration.
Many configuration settings can now be set in terms of records, batches or epochs. For example, a single searcher can be configured to run for 100 records by setting
max_length: {records: 100}
, 100 batches by settingmax_length: {batches: 100}
, or 100 epochs by settingrecords_per_epoch
at the root of the config andmax_length: {epochs: 100}
.A new configuration setting,
records_per_epoch
, is added that must be specified when any quantity is configured in terms of epochs.Breaking Change: For single, random and grid searchers
searcher.max_steps
has been replaced bysearcher.max_length
Breaking Change: For ASHA based searchers,
searcher.target_trial_steps
andsearcher.step_budget
has been replaced bysearcher.max_length
andsearcher.budget
, respectively.Breaking Change: For PBT,
searcher.steps_per_round
has been replaced bysearcher.length_per_round
.Breaking Change: For all experiments, the names for
min_validation_period
andmin_checkpoint_period
are unchanged but they are now configured in terms of records, batches or epochs.
Shell Mode Improvements: Determined supports launching GPU-attached terminal sessions via
det shell
. This release includes several changes to improve the usability of this feature, including:The
determined
anddetermined-cli
Python packages are now automatically installed inside containers launched bydet shell
. Any user-defined environment variables for the task image will be passed into the ssh sessions opened viadet shell start
ordet shell open
.det shell
should now work correctly in “host” networking mode.det shell
should now work correctly with dynamic agents and in cloud environments.Breaking Change: Change how additional arguments to
ssh
are passed throughdet shell start
anddet shell open
. Previously they were passed as a single string, likedet shell open SHELL_ID --ssh-opt '-X -Y -o SomeSetting="some string"'
, but now the--ssh-opt
has been removed and all extra positional arguments are passed through without requiring double-layers of quoting, likedet shell open SHELL_ID -- -X -Y -o SomeSetting="some string"
(note the use of--
to indicate all following arguments are positional arguments).
WebUI changes
Tasks List:
/det/tasks
Consolidate notebooks, tensorboards, shells, commands into single list page.
Add type filter to control which task types to display. By default all task types are shown when none of the types are selected.
Add type column with iconography to train users to familiarize task types with visual indicators.
Convert State filter from multi-select to single-select.
Convert actions from expanded buttons to overflow menu (triple vertical dots).
Move notebook launch buttons to task list from notebook list page.
Add pagination support that auto turns on when entries extend beyond 10 entries.
Add list of TensorBoard sources in a table Source column.
Experiment List:
/det/experiments
State filter converted from multi-select to single-select.
Convert actions from expanded buttons to overflow menu (triple vertical dots).
Batch operation logic change to available if the action can be applied to any of the selected experiments
Add pagination support that auto turns on when entries extend beyond 10 entries.
Experiment Detail:
/det/experiments/<id>
Implement charting with Plotly with zooming capability.
Trial table paginates on the WebUI side in preparation for API pagination in the near future.
Convert steps to batches in trials table and metric chart.
Update continue trial flow to use batches, epochs or records.
Use Monaco editor for the experiment config with YAML syntax highlighting.
Add links to source for Checkpoint modal view, allowing users to navigate to the corresponding experiment or trial for the checkpoint.
Trial Detail:
/det/trials/<id>
Add trial information table.
Add trial metrics chart.
Implement charting with Plotly with zooming capability.
Trial info table paginates on the WebUI side in preparation for API pagination in the near future.
Add support for batches, records and epochs for experiment config.
Convert metric chart to show batches.
Convert steps table to batches table.
Master Logs:
/det/logs
, Trial Logs:/det/trials/<id>/logs
, Task Logs:/det/<tasktype>/<id>/logs
Limit logs to 1000 lines for initial load and load an additional 1000 for each subsequent fetch of older logs.
Use new log viewer optimized for efficient rendering.
Introduce log line numbers.
Add ANSI color support.
Add error, warning, and debug visual icons and colors.
Add tailing button to enable tailing log behavior.
Add scroll to top button to load older logs out
Fix back and forth scrolling behavior on log viewer.
Cluster:
/det/cluster
Separate out GPU from CPU resources.
Show resource availability and resource count (per type).
Render each resource as a donut chart.
Navigation
Update sidebar navigation for new task and experiment list pages.
Add link to new swagger API documentation.
Hide pagination controls for tables with less than 10 entries.
Bug Fixes
Configuration: Do not load the entire experiment configuration when trying to check if an experiment is valid to be archived or unarchived.
Configuration: Improve the master to validation hyperparameter configurations when experiments are submitted. Currently, the master checks whether
global_batch_size
has been specified and if it is numeric.Logs: Fix issue of not detecting newlines in the log messages, particularly Kubernetes log messages.
Logs: Add intermediate step to trial log download to alert user that the CLI is the recommended action, especially for large logs.
Searchers: Fix a bug in the SHA searcher caused by the promotion of already-exited trials.
Security: Apply user authentication to streaming endpoints.
Tasks: Allow the master certificate file to be readable even for a non-root task.
TensorBoard: Fix issue affecting TensorBoards on AWS in us-east-1 region.
TensorBoard: Recursively search for tfevents files in subdirectories, not just the top level log directory.
WebUI: Fix scrolling issue that occurs when older logs are loaded, the tailing behavior is enabled, and the view is scrolled up.
WebUI: Fix colors used for different states in the cluster resources chart.
WebUI: Correct the numbers in the
Batches
column on the experiment list page.WebUI: Fix cluster and dashboard reporting for disabled slots.
WebUI: Fix issue of archive/unarchive not showing up properly under the task actions.
Version 0.12¶
Version 0.12.13¶
Release Date: August 6, 2020
New Features
Model Registry: Determined now includes a built-in model registry, which makes it easy to organize trained models by providing versioning and labeling tools. See Organizing Models in the Model Registry to get started.
New PyTorch API: Add a new version of the PyTorch API that is more flexible and supports deep learning experiments that use multiple models, optimizers, and LR schedulers. The old API is still supported but is now deprecated and will be removed in a future release. See the migration guide for details on updating your PyTorch model code. Deprecated methods will be supported until at least the next minor release.
The new API supports PyTorch code that uses multiple models, optimizers, and LR schedulers. In your trial class, you should instantiate those objects and wrap them with
wrap_model()
,wrap_optimizer()
, andwrap_lr_scheduler()
in the constructor of your PyTorch trial class. The previous API methodsbuild_model
,optimizer
, andcreate_lr_scheduler
inPyTorchTrial
are now deprecated.Support customizing forward and backward passes in
train_batch()
. Gradient clipping should now be done by passing a function to theclip_grads
argument ofstep_optimizer()
. The callbackon_before_optimizer_step
is now deprecated.Configuring automatic mixed precision (AMP) in PyTorch should now be done by calling
configure_apex_amp()
in the constructor of your PyTorch trial class. Theoptimizations.mixed_precision
experiment configuration key is now deprecated.The
model
arguments totrain_batch()
,evaluate_batch()
, andevaluate_full_dataset()
are now deprecated.
More Efficient Hyperparameter Search: This release introduces a new hyperparameter search method,
adaptive_asha
. This is based on an asynchronous version of theadaptive
algorithm, and should enable large searches to find high-quality hyperparameter configurations more quickly. See the documentation and the associated paper for more information.
Improvements
Allow proxy environment variables to be set in the agent config. See Environment Variables for more information.
Preserve random state for PyTorch experiments when checkpointing and restoring.
Remove
determined.pytorch.reset_parameters()
. This should have no effect except when using highly customizednn.Module
implementations.WebUI: Show total number of resources in the cluster resource charts.
Add support for Nvidia T4 GPUs.
det-deploy
: Add support forg4
instance types on AWS.Upgrade Nvidia drivers on the default AWS and GCP images from
410.104
to450.51.05
.
Bug Fixes
Fix an issue with the SHA searcher that could cause searches to stop making progress without finishing.
Fix an issue where
$HOME
was not properly set in notebooks running in nonroot containers.Fix an issue where killed experiments had their state reset to the latest checkpoint.
Randomize the notebook listening port to avoid port binding issues in host mode.
Version 0.12.12¶
Release Date: July 22, 2020
Improvements
Remove support for
on_train_step_begin
andon_train_step_end
, deprecateon_validation_step_end
, and introduce new callbackon_validation_end
with same functionality. Add helper methodsis_epoch_start
andis_epoch_end
to PyTorch context.Add a new API to support custom reducers in
EstimatorTrial
. See :ref:estimator-trial
for details.CLI: Add the
register_version
command for registering a new version of a model.CLI: Add a
--head
option when printing trial logs.WebUI: Make it possible to launch TensorBoard from experiment dashboard cards.
Bug Fixes
Fix distributed training and Determined shell with non-root containers. The default task environments now include a user plugin to support running containers with arbitrary non-root users. Custom images based on the latest default task environments should also work.
Fix convergence issue for TF 2 multi-GPU models. Change default TF1 version from 1.14 to 1.15.
Fix issue affecting TensorFlow TensorBoard outputs.
Use local log line IDs for trial logs.
CLI: Improve the CLI’s custom TLS certificate handling with non-self-signed certs.
WebUI: Fix a parsing problem with task start times.
WebUI: Fix log viewer timestamp copy/paste.
Known Issues
WebUI: Older trial logs are not loaded by scrolling to the top of the page.
Version 0.12.11¶
Release Date: July 8, 2020
Add logging to console in test mode for the Native API when using
determined.experimental.create
.Improve reliability of saving checkpoints to GCS in the presence of transient network errors.
Add an example using TensorFlow’s Image Segmentation via UNet tutorial.
WebUI: Improve trial log rendering performance.
WebUI: Fix an issue where cluster utilization was displayed incorrectly.
WebUI: Fix an issue where active experiments and commands would not appear on the dashboard.
WebUI: Fix an issue where having telemetry enabled with an invalid key would cause the WebUI to render incorrectly.
Version 0.12.10¶
Release Date: June 26, 2020
Improvements
WebUI: Add a dedicated page for master logs at
/det/logs
.WebUI: Provide a Swagger UI for exploring the Determined REST API. This can be accessed via the API link on the WebUI.
WebUI: Default the Experiments view list length to 25 entries. More entries can be shown as needed.
WebUI: Improve detection of situations where the WebUI version doesn’t match the master version as a result of browser caching.
CLI: Improve performance when retrieving trial logs.
CLI: Add the
det user rename
command for administrators to change the username of existing users.Expand documentation on Using Checkpoints by including checkpoint metadata management.
Reorganize examples by splitting Trial API examples into separate folders.
Bug Fixes
Allow
det-deploy local agent-up
to work with remote masters.Ensure network failures during checkpoint upload do not unrecoverably break the associated trial.
Ensure
shared_fs
checkpoint storage is usable for non-root containers for somehost_path
values.Fix a timeout issue that affected large (40+ machines) distributed experiments.
Ensure the CLI can make secure connections to the master.
Fix an issue that affected multi-GPU in
PyTorchTrial
with mixed precision enabled.Add a timeout to trial containers to ensure they are terminated promptly.
Version 0.12.7¶
Release Date: June 11, 2020
Breaking Change: Gradient clipping for PyTorchTrial should now be specified via
determined.pytorch.PyTorchCallback
via theon_before_optimizer_step()
method instead of being specified via the experiment configuration. Determined provides two built-in callbacks for gradient clipping:determined.pytorch.ClipGradsL2Norm
anddetermined.pytorch.ClipGradsL2Value
.Add a
metadata
field to checkpoints. Checkpoints can now have arbitrary key-value pairs associated with them. Metadata can be added, queried, and removed via aPython API
. See the documentation for details.Add support for Keras callbacks that stop training early, including the official EarlyStopping callback. When a stop is requested, Determined will finish the training (or validation) step we are in, checkpoint, and terminate the trial.
Add support for Estimator callbacks that stop training early, including the official stop_if_no_decrease_hook. When a stop is requested, Determined will finish the training (or validation) step we are in, checkpoint, and terminate the trial.
Add support for model code that stops training of a trial programmatically.
We recommend using the official Keras callbacks or Estimator hooks if you are using those frameworks. For PyTorch, you can request that training be stopped by calling
set_stop_requested()
from a PyTorch callback. When a stop is requested, Determined will finish the current training or validation step, checkpoint, and terminate the trial. Trials that are stopped early are considered to be “completed” (e.g., in the WebUI and CLI).
More robust error handling for hyperparameter searches where one of the trials in the search encounters a persistent error.
Determined will automatically restart the execution of trials that fail within an experiment, up to
max_restart
failures. After this point, any trials that fail are marked as “errored” but the hyperparameter search itself is allowed to continue running. This is particularly useful when some parts of the hyperparameter space result in models that cannot be trained successfully (e.g., the search explores a range of batch sizes and some of those batch sizes cause GPU OOM errors). An experiment can complete successfully as long as at least one of the trials within it completes successfully.
Support multi-GPU training for TensorFlow 2 models that use
IndexedSlices
for model parameters.NaN
values in training and validation metrics are now treated as errors.This will result in restarting the trial from the most recently checkpoint if it has been restarted fewer than
max_restarts
times. Previously,NaN
values were converted to the maximum floating point value.
Preserve the last used user name on the log-in page.
Add
on_trial_close
method todetermined.estimator.RunHook
. Use this for post-trial cleanup.Finalize gradient communication prior to applying gradient clipping in PyTorchTrial when perfoming multi-GPU training.
WebUI: Add pause, activate, and cancel actions to dashboard tasks.
Add a
det-nobody
user (with UID 65533) to default images. This provides an out-of-the-box option for running non-privileged containers with a working home directory.
Version 0.12.5¶
Release Date: May 27, 2020
Breaking Change: Alter command-line options for controlling test mode and local mode. Test experiments on the cluster were previously created with
det e create --test-mode ...
but now should be created withdet e create --test ...
. Local testing is started withdet e create --test --local ...
. Fully local training (meaning--local
without--test
) is not yet supported.Add support for TensorFlow 2.2.
Add support for post-checkpoint callbacks in
PyTorchTrial
.Add support for checkpoint hooks in
EstimatorTrial
.Add support for TensorBoard backed by S3-compliant APIs that are not AWS S3.
Add generic callback support for PyTorch.
TensorBoards now shut down after 10 minutes if metrics are unavailable.
Update to NCCL 2.6.4 for distributed training.
Update minimum required task environment version to 0.4.0.
Fix Native API training one step rather than one batch when using TensorFlow Keras and Estimator.
CLI: Add support for producing CSV and JSON output to
det slot list
anddet agent list
.CLI: Include the number of containers on each agent in the output of
det agent list
.
Version 0.12.4¶
Release Date: May 14, 2020
Breaking Change: Users are no longer automatically logged in as the “determined” user. Refer to Users for more details.
Support multi-slot notebooks. The number of slots per notebook cannot exceed the size of the largest available agent. The number of slots to use for a notebook task can be configured when the notebook is launched:
det notebook start --config resources.slots=2
Support fetching the configuration of a running master via the CLI (
det master config
).Authentication sessions now expire after 7 days.
Improve log messages for
tf.keras
trial callbacks.Add
nvidia-container-toolkit
support.Fix an error in the experimental
bert_glue_pytorch
example.The
tf.keras
examples for the Native and Trial APIs now refer to the same model.Add a topic guide explaining Determined’s approach to Elastic Infrastructure.
Add a topic guide explaining the Native API (since deprecated).
UI: The Determined favicon acquires a small dot when any slots are in use.
UI: Fix an issue with command sorting in the WebUI.
UI: Fix an issue with badges appearing as the wrong color.
Version 0.12.3¶
Release Date: April 27, 2020
Add a tutorial for the new (experimental) Native API.
Add support for locally testing experiments via
det e create --local
.Add
determined.experimental.Determined
class for accessingExperimentReference
,TrialReference
, andCheckpoint
objects.TensorBoard logs now appear under the
storage_path
forshared_fs
checkpoint configurations.Allow commands, notebooks, shells, and TensorBoards to be killed before they are scheduled.
Print container exit reason in trial logs.
Choose a better default for the
--tail
option of command logs.Add REST API endpoints for trials.
Support the execution of a startup script inside the agent docker container
Master and agent Docker containers will have the ‘unless-stopped’ restart policy by default when using
det-deploy local
.Prevent the
det trial logs -f
command from waiting for too long after the trial being watched reaches a terminal state.Fix bug where logs disappear when an image is pulled.
Fix bug that affected the use of
LRScheduler
inPyTorchTrial
for multi-GPU training.Fix bug after master restart where some errored experiments would show progress indicators.
Fix ordering of steps from
det trial describe --json
.Docs: Added topic guide for effective distributed training.
Docs: Reorganize install documentation.
UI: Move the authenticated user to the top of the users list filter on the dashboard, right after “All”.
Version 0.12.2¶
Release Date: April 21, 2020
Breaking Changes
Rename PEDL to Determined. The canonical way to import it is via
import determined as det
.Reorganize source code. The frameworks module was removed, and each framework’s submodules were collapsed into the main framework module. For example:
det.frameworks.pytorch.pytorch_trial.PyTorchTrial
is nowdet.pytorch.PyTorchTrial
det.frameworks.pytorch.data.DataLoader
is nowdet.pytorch.DataLoader
det.frameworks.pytorch.checkpoint.load
is nowdet.pytorch.load
det.frameworks.pytorch.util.reset_parameters
is nowdet.pytorch.reset_parameters
det.frameworks.keras.tf_keras_trial.TFKerasTrial
is nowdet.keras.TFKerasTrial
det.frameworks.tensorflow.estimator_trial.EstimatorTrial
is nowdet.estimator.EstimatorTrial
det.frameworks.tensorpack.tensorpack_trial
is nowdet.tensorpack.TensorpackTrial
det.frameworks.util
anddet.frameworks.pytorch.util
have been removed entirely
Unify all plugin functions under the Trial class.
make_data_loaders
has been moved to two functions that should be implemented as part of the Trial class. For example,PyTorchTrial
data loaders should now be implemented inbuild_training_data_loader()
andbuild_validation_data_loader()
in the trial definition. Please see updated examples and documentation for changes in each framework.Trial classes are now required to define a constructor function. The signature of the constructor function is:
def __init__(self, context) -> None: ...
where
context
is an instance of the newdet.TrialContext
class. This new object is the primary mechanism for querying information about the system. Some of its methods include:get_hparam(name)
: get a hyperparameter by nameget_trial_id()
: get the trial ID being trainedget_experiment_config()
: get the experiment config for this experimentget_per_slot_batch_size()
: get the batch size appropriate for training (which will be adjusted from theglobal_batch_size
hyperparameter in distributed training experiments)get_global_batch_size()
: get the effective batch size (which differs from per-slot batch size in distributed training experiments)distributed.get_rank()
: get the unique process rank (one process per slot)distributed.get_local_rank()
: get a unique process rank within the agentdistributed.get_size()
: get the number of slotsdistributed.get_num_agents
: get the number of agents (machines) being used
The
global_batch_size
hyperparameter is required (that is, a hyperparameter with this name must be specified in the configuration of every experiment). Previously, the hyperparameterbatch_size
was required and was manipulated automatically for distributed training. Nowglobal_batch_size
will not be manipulated; users should train based oncontext.get_per_slot_batch_size()
. See Distributed Training for more context.Remove
download_data()
. If users wish to download data at runtime, they should make sure that each process (one process per slot) downloads to a unique location. This can be accomplished by appendingcontext.get_rank()
to the download path.Remove
det.trial_controller.util.get_rank()
anddet.trial_controller.util.get_container_gpus()
. Usecontext.distributed.get_rank()
andcontext.distributed.get_num_agents()
instead.
General Improvements
tf.data.Dataset
is now supported as input for all versions of TensorFlow (1.14, 1.15, 2.0, 2.1) for TFKerasTrial and EstimatorTrial. Please note that Determined currently does not support checkpointingtf.data.Dataset
inputs. Therefore, when resuming training, it resumes from the start of the dataset. Model weights are loaded correctly as always.TFKerasTrial
now supports five different types of inputs:A tuple
(x_train, y_train)
of NumPy arrays.x_train
must be a NumPy array (or array-like), a list of arrays (in case the model has multiple inputs), or a dict mapping input names to the corresponding array, if the model has named inputs.y_train
should be a NumPy array.A tuple
(x_train, y_train, sample_weights)
of NumPy arrays.A tf.data.Dataset returning a tuple of either
(inputs, targets)
or(inputs, targets, sample_weights)
.A keras.utils.Sequence returning a tuple of either
(inputs, targets)
or(inputs, targets, sample weights)
.A
det.keras.SequenceAdapter
returning a tuple of either(inputs, targets)
or(inputs, targets, sample weights)
.
PyTorch trial checkpoints no longer save in MLflow’s MLmodel format.
The
det trial download
command now accepts-o
to save a checkpoint to a specific path. PyTorch checkpoints can then be loaded from a specified local filesystem path.Allow the agent to read configuration values from a YAML file.
Include experiment ID in the downloaded trial logs.
Display checkpoint storage location in the checkpoint info modal for trials and experiments.
Preserve recent tasks’ filter preferences in the WebUI.
Add task name to
det slot list
command output.Model definitions are now downloaded as compressed tarfiles (.tar.gz) instead of zipfiles (.zip).
startup-hook.sh
is now executed in the same directory as the model definition.Rename
projects
toexamples
in the Determined repository.Improve documentation:
Add documentation page on the lifecycle of an experiment.
Add how-to and topic guides for multi-GPU (both for single-machine parallel and multi-machine) training.
Add a topic guide on best practices for writing model definitions.
Fix bug that occasionally caused multi-machine training to hang on initialization.
Fix bug that prevented
TensorpackTrial
from successfully loading checkpoints.Fix a bug in
TFKerasTrial
where runtime errors could cause the trial to hang or would silently drop the stack trace produced by Keras.Fix trial lifecycle bugs for containers that exit during the pulling phase.
Fix bug that led to some distributed trials timing out.
Fix bug that caused
tf.keras
trials to fail in the multi-GPU setting when using an optimizer specified by its name.Fix bug in the CLI for downloading model definitions.
Fix performance issues for experiments with very large numbers of trials.
Optimize performance for scheduling large hyperparameter searches.
Add configuration for telemetry in
master.yaml
.Add a utility function for initializing a trial class for development (det.create_trial_instance)
Add security.txt.
Add
det.estimator.load()
to load TensorFlow Estimatorsaved_model
checkpoints into memory.Ensure AWS EC2 keypair exists in account before creating the CloudFormation stack.
Add support for gradient aggregation in Keras trials for TensorFlow 2.1.
Add TrialReference and Checkpoint experimental APIs for exporting and loading checkpoints.
Improve performance when starting many tasks simultaneously.
Web Improvements
Improve discoverability of dashboard actions.
Add dropdown action menu for killing and archiving recent tasks on the dashboard.
Add telemetry for web interactions.
Fix an issue around cluster utilization status showing as “No Agent” for a brief moment during initial load.
Add Ace editor to attributions list.
Set UI preferences based on the logged-in user.
Fix an issue where the indicated user filter was not applied to the displayed tasks.
Improve error messaging for failed actions.