Release Notes¶
Version 0.12¶
Version 0.12.12¶
Release Date: July 22, 2020
Improvements
Remove support for
on_train_step_begin
andon_train_step_end
, deprecateon_validation_step_end
, and introduce new callbackon_validation_end
with same functionality. Add helper methodsis_epoch_start
andis_epoch_end
to PyTorch context.Add a new API to support custom reducers in
EstimatorTrial
. See :ref:estimator-trial
for details.CLI: Add the
register_version
command for registering a new version of a model.CLI: Add a
--head
option when printing trial logs.WebUI: Make it possible to launch TensorBoard from experiment dashboard cards.
Bug Fixes
Fix distributed training and Determined shell with non-root containers. The default task environments now include a user plugin to support running containers with arbitrary non-root users. Custom images based on the latest default task environments should also work.
Fix convergence issue for TF 2 multi-GPU models. Change default TF1 version from 1.14 to 1.15.
Fix issue affecting TensorFlow TensorBoard outputs.
Use local log line IDs for trial logs.
CLI: Improve the CLI’s custom TLS certificate handling with non-self-signed certs.
WebUI: Fix a parsing problem with task start times.
WebUI: Fix log viewer timestamp copy/paste.
Known Issues
WebUI: Older trial logs are not loaded by scrolling to the top of the page.
Version 0.12.11¶
Release Date: July 8, 2020
Add logging to console in test mode for the Native API when using
determined.experimental.create
.Improve reliability of saving checkpoints to GCS in the presence of transient network errors.
Add an example using TensorFlow’s Image Segmentation via UNet tutorial.
WebUI: Improve trial log rendering performance.
WebUI: Fix an issue where cluster utilization was displayed incorrectly.
WebUI: Fix an issue where active experiments and commands would not appear on the dashboard.
WebUI: Fix an issue where having telemetry enabled with an invalid key would cause the WebUI to render incorrectly.
Version 0.12.10¶
Release Date: June 26, 2020
Improvements
WebUI: Add a dedicated page for master logs at
/det/logs
.WebUI: Provide a Swagger UI for exploring the Determined REST API. This can be accessed via the API link on the WebUI.
WebUI: Default the Experiments view list length to 25 entries. More entries can be shown as needed.
WebUI: Improve detection of situations where the WebUI version doesn’t match the master version as a result of browser caching.
CLI: Improve performance when retrieving trial logs.
CLI: Add the
det user rename
command for administrators to change the username of existing users.Expand documentation on Using Checkpoints by including checkpoint metadata management.
Reorganize examples by splitting Native API and Trial API examples into separate folders.
Bug Fixes
Allow
det-deploy local agent-up
to work with remote masters.Ensure network failures during checkpoint upload do not unrecoverably break the associated trial.
Ensure
shared_fs
checkpoint storage is usable for non-root containers for somehost_path
values.Fix a timeout issue that affected large (40+ machines) distributed experiments.
Ensure the CLI can make secure connections to the master.
Fix an issue that affected multi-GPU in
PyTorchTrial
with mixed precision enabled.Add a timeout to trial containers to ensure they are terminated promptly.
Version 0.12.7¶
Release Date: June 11, 2020
Breaking Change: Gradient clipping for PyTorchTrial should now be specified via
determined.pytorch.PyTorchCallback
via theon_before_optimizer_step()
method instead of being specified via the experiment configuration. Determined provides two built-in callbacks for gradient clipping:determined.pytorch.ClipGradsL2Norm
anddetermined.pytorch.ClipGradsL2Value
.Add a
metadata
field to checkpoints. Checkpoints can now have arbitrary key-value pairs associated with them. Metadata can be added, queried, and removed via aPython API
. See the documentation for details.Add support for Keras callbacks that stop training early, including the official EarlyStopping callback. When a stop is requested, Determined will finish the training (or validation) step we are in, checkpoint, and terminate the trial.
Add support for Estimator callbacks that stop training early, including the official stop_if_no_decrease_hook. When a stop is requested, Determined will finish the training (or validation) step we are in, checkpoint, and terminate the trial.
Add support for model code that stops training of a trial programmatically.
We recommend using the official Keras callbacks or Estimator hooks if you are using those frameworks. For PyTorch, you can request that training be stopped by calling
set_stop_requested()
from a PyTorch callback. When a stop is requested, Determined will finish the current training or validation step, checkpoint, and terminate the trial. Trials that are stopped early are considered to be “completed” (e.g., in the WebUI and CLI).
More robust error handling for hyperparameter searches where one of the trials in the search encounters a persistent error.
Determined will automatically restart the execution of trials that fail within an experiment, up to
max_restart
failures. After this point, any trials that fail are marked as “errored” but the hyperparameter search itself is allowed to continue running. This is particularly useful when some parts of the hyperparameter space result in models that cannot be trained successfully (e.g., the search explores a range of batch sizes and some of those batch sizes cause GPU OOM errors). An experiment can complete successfully as long as at least one of the trials within it completes successfully.
Support multi-GPU training for TensorFlow 2 models that use
IndexedSlices
for model parameters.NaN
values in training and validation metrics are now treated as errors.This will result in restarting the trial from the most recently checkpoint if it has been restarted fewer than
max_restarts
times. Previously,NaN
values were converted to the maximum floating point value.
Preserve the last used user name on the log-in page.
Add
on_trial_close
method todetermined.estimator.RunHook
. Use this for post-trial cleanup.Finalize gradient communication prior to applying gradient clipping in PyTorchTrial when perfoming multi-GPU training.
WebUI: Add pause, activate, and cancel actions to dashboard tasks.
Add a
det-nobody
user (with UID 65533) to default images. This provides an out-of-the-box option for running non-privileged containers with a working home directory.
Version 0.12.5¶
Release Date: May 27, 2020
Breaking Change: Alter command-line options for controlling test mode and local mode. Test experiments on the cluster were previously created with
det e create --test-mode ...
but now should be created withdet e create --test ...
. Local testing is started withdet e create --test --local ...
. Fully local training (meaning--local
without--test
) is not yet supported.Add support for TensorFlow 2.2.
Add support for post-checkpoint callbacks in
PyTorchTrial
.Add support for checkpoint hooks in
EstimatorTrial
.Add support for TensorBoard backed by S3-compliant APIs that are not AWS S3.
Add generic callback support for PyTorch.
TensorBoards now shut down after 10 minutes if metrics are unavailable.
Update to NCCL 2.6.4 for distributed training.
Update minimum required task environment version to 0.4.0.
Fix Native API training one step rather than one batch when using TensorFlow Keras and Estimator.
CLI: Add support for producing CSV and JSON output to
det slot list
anddet agent list
.CLI: Include the number of containers on each agent in the output of
det agent list
.
Version 0.12.4¶
Release Date: May 14, 2020
Breaking Change: Users are no longer automatically logged in as the “determined” user. Refer to Users for more details.
Support multi-slot notebooks. The number of slots per notebook cannot exceed the size of the largest available agent. The number of slots to use for a notebook task can be configured when the notebook is launched:
det notebook start --config resources.slots=2
Support fetching the configuration of a running master via the CLI (
det master config
).Authentication sessions now expire after 7 days.
Improve log messages for
tf.keras
trial callbacks.Add
nvidia-container-toolkit
support.Fix an error in the experimental
bert_glue_pytorch
example.The
tf.keras
examples for the Native and Trial APIs now refer to the same model.Add a topic guide explaining Determined’s approach to Elastic Infrastructure.
Add a topic guide explaining the Native API.
UI: The Determined favicon acquires a small dot when any slots are in use.
UI: Fix an issue with command sorting in the WebUI.
UI: Fix an issue with badges appearing as the wrong color.
Version 0.12.3¶
Release Date: April 27, 2020
Add a tutorial for the new (experimental) Native API.
Add support for locally testing experiments via
det e create --local
.Add
determined.experimental.Determined
class for accessingExperimentReference
,TrialReference
, andCheckpoint
objects.TensorBoard logs now appear under the
storage_path
forshared_fs
checkpoint configurations.Allow commands, notebooks, shells, and TensorBoards to be killed before they are scheduled.
Print container exit reason in trial logs.
Choose a better default for the
--tail
option of command logs.Add REST API endpoints for trials.
Support the execution of a startup script inside the agent docker container
Master and agent Docker containers will have the ‘unless-stopped’ restart policy by default when using
det-deploy local
.Prevent the
det trial logs -f
command from waiting for too long after the trial being watched reaches a terminal state.Fix bug where logs disappear when an image is pulled.
Fix bug that affected the use of
LRScheduler
inPyTorchTrial
for multi-GPU training.Fix bug after master restart where some errored experiments would show progress indicators.
Fix ordering of steps from
det trial describe --json
.Docs: Added topic guide for effective distributed training.
Docs: Reorganize install documentation.
UI: Move the authenticated user to the top of the users list filter on the dashboard, right after “All”.
Version 0.12.2¶
Release Date: April 21, 2020
Breaking Changes
Rename PEDL to Determined. The canonical way to import it is via
import determined as det
.Reorganize source code. The frameworks module was removed, and each framework’s submodules were collapsed into the main framework module. For example:
det.frameworks.pytorch.pytorch_trial.PyTorchTrial
is nowdet.pytorch.PyTorchTrial
det.frameworks.pytorch.data.DataLoader
is nowdet.pytorch.DataLoader
det.frameworks.pytorch.checkpoint.load
is nowdet.pytorch.load
det.frameworks.pytorch.util.reset_parameters
is nowdet.pytorch.reset_parameters
det.frameworks.keras.tf_keras_trial.TFKerasTrial
is nowdet.keras.TFKerasTrial
det.frameworks.tensorflow.estimator_trial.EstimatorTrial
is nowdet.estimator.EstimatorTrial
det.frameworks.tensorpack.tensorpack_trial
is nowdet.tensorpack.TensorpackTrial
det.frameworks.util
anddet.frameworks.pytorch.util
have been removed entirely
Unify all plugin functions under the Trial class.
make_data_loaders
has been moved to two functions that should be implemented as part of the Trial class. For example,PyTorchTrial
data loaders should now be implemented inbuild_training_data_loader()
andbuild_validation_data_loader()
in the trial definition. Please see updated examples and documentation for changes in each framework.Trial classes are now required to define a constructor function. The signature of the constructor function is:
def __init__(self, context) -> None:
where
context
is an instance of the newdet.TrialContext
class. This new object is the primary mechanism for querying information about the system. Some of its methods include:get_hparam(name)
: get a hyperparameter by nameget_trial_id()
: get the trial ID being trainedget_experiment_config()
: get the experiment config for this experimentget_per_slot_batch_size()
: get the batch size appropriate for training (which will be adjusted from theglobal_batch_size
hyperparameter in distributed training experiments)get_global_batch_size()
: get the effective batch size (which differs from per-slot batch size in distributed training experiments)distributed.get_rank()
: get the unique process rank (one process per slot)distributed.get_local_rank()
: get a unique process rank within the agentdistributed.get_size()
: get the number of slotsdistributed.get_num_agents
: get the number of agents (machines) being used
The
global_batch_size
hyperparameter is required (that is, a hyperparameter with this name must be specified in the configuration of every experiment). Previously, the hyperparameterbatch_size
was required and was manipulated automatically for distributed training. Nowglobal_batch_size
will not be manipulated; users should train based oncontext.get_per_slot_batch_size()
. See Distributed Training for more context.Remove
download_data()
. If users wish to download data at runtime, they should make sure that each process (one process per slot) downloads to a unique location. This can be accomplished by appendingcontext.get_rank()
to the download path.Remove
det.trial_controller.util.get_rank()
anddet.trial_controller.util.get_container_gpus()
. Usecontext.distributed.get_rank()
andcontext.distributed.get_num_agents()
instead.
General Improvements
tf.data.Dataset
is now supported as input for all versions of TensorFlow (1.14, 1.15, 2.0, 2.1) for TFKerasTrial and EstimatorTrial. Please note that Determined currently does not support checkpointingtf.data.Dataset
inputs. Therefore, when resuming training, it resumes from the start of the dataset. Model weights are loaded correctly as always.TFKerasTrial
now supports five different types of inputs:A tuple
(x_train, y_train)
of NumPy arrays.x_train
must be a NumPy array (or array-like), a list of arrays (in case the model has multiple inputs), or a dict mapping input names to the corresponding array, if the model has named inputs.y_train
should be a NumPy array.A tuple
(x_train, y_train, sample_weights)
of NumPy arrays.A tf.data.Dataset returning a tuple of either
(inputs, targets)
or(inputs, targets, sample_weights)
.A keras.utils.Sequence returning a tuple of either
(inputs, targets)
or(inputs, targets, sample weights)
.A
det.keras.SequenceAdapter
returning a tuple of either(inputs, targets)
or(inputs, targets, sample weights)
.
PyTorch trial checkpoints no longer save in MLflow’s MLmodel format.
The
det trial download
command now accepts-o
to save a checkpoint to a specific path. PyTorch checkpoints can then be loaded from a specified local filesystem path.Allow the agent to read configuration values from a YAML file.
Include experiment ID in the downloaded trial logs.
Display checkpoint storage location in the checkpoint info modal for trials and experiments.
Preserve recent tasks’ filter preferences in the WebUI.
Add task name to
det slot list
command output.Model definitions are now downloaded as compressed tarfiles (.tar.gz) instead of zipfiles (.zip).
startup-hook.sh
is now executed in the same directory as the model definition.Rename
projects
toexamples
in the Determined repository.Improve documentation:
Add documentation page on the lifecycle of an experiment.
Add how-to and topic guides for multi-GPU (both for single-machine parallel and multi-machine) training.
Add a topic guide on best practices for writing model definitions.
Fix bug that occasionally caused multi-machine training to hang on initialization.
Fix bug that prevented
TensorpackTrial
from successfully loading checkpoints.Fix a bug in
TFKerasTrial
where runtime errors could cause the trial to hang or would silently drop the stack trace produced by Keras.Fix trial lifecycle bugs for containers that exit during the pulling phase.
Fix bug that led to some distributed trials timing out.
Fix bug that caused
tf.keras
trials to fail in the multi-GPU setting when using an optimizer specified by its name.Fix bug in the CLI for downloading model definitions.
Fix performance issues for experiments with very large numbers of trials.
Optimize performance for scheduling large hyperparameter searches.
Add configuration for telemetry in
master.yaml
.Add a utility function for initializing a trial class for development (det.create_trial_instance)
Add security.txt.
Add
det.estimator.load()
to load TensorFlow Estimatorsaved_model
checkpoints into memory.Ensure AWS EC2 keypair exists in account before creating the CloudFormation stack.
Add support for gradient aggregation in Keras trials for TensorFlow 2.1.
Add TrialReference and Checkpoint experimental APIs for exporting and loading checkpoints.
Improve performance when starting many tasks simultaneously.
Web Improvements
Improve discoverability of dashboard actions.
Add dropdown action menu for killing and archiving recent tasks on the dashboard.
Add telemetry for web interactions.
Fix an issue around cluster utilization status showing as “No Agent” for a brief moment during initial load.
Add Ace editor to attributions list.
Set UI preferences based on the logged-in user.
Fix an issue where the indicated user filter was not applied to the displayed tasks.
Improve error messaging for failed actions.