Release Notes¶
Version 0.12¶
Version 0.12.2¶
Release Date: April 21, 2020
Breaking Changes
Rename PEDL to Determined. The canonical way to import it is via
import determined as det
.Reorganize source code. The frameworks module was removed, and each framework’s submodules were collapsed into the main framework module. For example:
det.frameworks.pytorch.pytorch_trial.PyTorchTrial
is nowdet.pytorch.PyTorchTrial
det.frameworks.pytorch.data.DataLoader
is nowdet.pytorch.DataLoader
det.frameworks.pytorch.checkpoint.load
is nowdet.pytorch.load
det.frameworks.pytorch.util.reset_parameters
is nowdet.pytorch.reset_parameters
det.frameworks.keras.tf_keras_trial.TFKerasTrial
is nowdet.keras.TFKerasTrial
det.frameworks.tensorflow.estimator_trial.EstimatorTrial
is nowdet.estimator.EstimatorTrial
det.frameworks.tensorpack.tensorpack_trial
is nowdet.tensorpack.TensorPackTrial
det.frameworks.util
anddet.frameworks.pytorch.util
have been removed entirely
Unify all plugin functions under the Trial class.
make_data_loaders
has been moved to two functions that should be implemented as part of the Trial class. For example,PyTorchTrial
data loaders should now be implemented inbuild_training_data_loader()
andbuild_validation_data_loader()
in the trial definition. Please see updated examples and documentation for changes in each framework.Trial classes are now required to define a constructor function. The signature of the constructor function is:
def __init__(self, context) -> None:
where
context
is an instance of the newdet.TrialContext
class. This new object is the primary mechanism for querying information about the system. Some of its methods include:get_hparam(name)
: get a hyperparameter by nameget_trial_id()
: get the trial ID being trainedget_experiment_config()
: get the experiment config for this experimentget_per_slot_batch_size()
: get the batch size appropriate for training (which will be adjusted from theglobal_batch_size
hyperparameter in distributed training experiments)get_global_batch_size()
: get the effective batch size (which differs from per-slot batch size in distributed training experiments)distributed.get_rank()
: get the unique process rank (one process per slot)distributed.get_local_rank()
: get a unique process rank within the agentdistributed.get_size()
: get the number of slotsdistributed.get_num_agents
: get the number of agents (machines) being used
The
global_batch_size
hyperparameter is required (that is, a hyperparameter with this name must be specified in the configuration of every experiment). Previously, the hyperparameterbatch_size
was required and was manipulated automatically for distributed training. Nowglobal_batch_size
will not be manipulated; users should train based oncontext.get_per_slot_batch_size()
. See Distributed and Parallel Training for more context.Remove
download_data()
. If users wish to download data at runtime, they should make sure that each process (one process per slot) downloads to a unique location. This can be accomplished by appendingcontext.get_rank()
to the download path.Remove
det.trial_controller.util.get_rank()
anddet.trial_controller.util.get_container_gpus()
. Usecontext.distributed.get_rank()
andcontext.distributed.get_num_agents()
instead.
General Improvements
tf.data.Dataset
is now supported as input for all versions of TensorFlow (1.14, 1.15, 2.0, 2.1) for TFKerasTrial and EstimatorTrial. Please note that Determined currently does not support checkpointingtf.data.Dataset
inputs. Therefore, when resuming training, it resumes from the start of the dataset. Model weights are loaded correctly as always.TFKerasTrial
now supports five different types of inputs:A tuple
(x_train, y_train)
of NumPy arrays.x_train
must be a NumPy array (or array-like), a list of arrays (in case the model has multiple inputs), or a dict mapping input names to the corresponding array, if the model has named inputs.y_train
should be a NumPy array.A tuple
(x_train, y_train, sample_weights)
of NumPy arrays.A tf.data.Dataset returning a tuple of either
(inputs, targets)
or(inputs, targets, sample_weights)
.A keras.utils.Sequence returning a tuple of either
(inputs, targets)
or(inputs, targets, sample weights)
.A
det.keras.SequenceAdapter
returning a tuple of either(inputs, targets)
or(inputs, targets, sample weights)
.
PyTorch trial checkpoints no longer save in MLflow’s MLmodel format.
The
det trial download
command now accepts-o
to save a checkpoint to a specific path. PyTorch checkpoints can then be loaded from a specified local filesystem path.Allow the agent to read configuration values from a YAML file.
Include experiment ID in the downloaded trial logs.
Display checkpoint storage location in the checkpoint info modal for trials and experiments.
Preserve recent tasks’ filter preferences in the WebUI.
Add task name to
det slot list
command output.Model definitions are now downloaded as compressed tarfiles (.tar.gz) instead of zipfiles (.zip).
startup-hook.sh
is now executed in the same directory as the model definition.Rename
projects
toexamples
in the Determined repository.Improve documentation:
Add documentation page on the lifecycle of an experiment.
Add how-to and topic guides for multi-GPU (both for single-machine parallel and multi-machine) training.
Add a topic guide on best practices for writing model definitions.
Fix bug that occasionally caused multi-machine training to hang on initialization.
Fix bug that prevented
TensorpackTrial
from succesfully loading checkpoints.Fix a bug in
TFKerasTrial
where runtime errors could cause the trial to hang or would silently drop the stack trace produced by Keras.Fix trial lifecycle bugs for containers that exit during the pulling phase.
Fix bug that led to some distributed trials timing out.
Fix bug that caused
tf.keras
trials to fail in the multi-GPU setting when using an optimizer specified by its name.Fix bug in the CLI for downloading model definitions.
Fix performance issues for experiments with very large numbers of trials.
Optimize performance for scheduling large hyperparameter searches.
Add configuration for telemetry in
master.yaml
.Add a utility function for initializing a trial class for development (det.create_trial_instance)
Add security.txt.
Add
det.estimator.load()
to load TensorFlow Estimatorsaved_model
checkpoints into memory.Ensure AWS EC2 keypair exists in account before creating the CloudFormation stack.
Add support for gradient aggregation in Keras trials for TensorFlow 2.1.
Add TrialReference and Checkpoint experimental APIs for exporting and loading checkpoints.
Improve performance when starting many tasks simultaneously.
Web Improvements
Improve discoverability of dashboard actions.
Add dropdown action menu for killing and archiving recent tasks on the dashboard.
Add telemetry for web interactions.
Fix an issue around cluster utilization status showing as “No Agent” for a brief moment during initial load.
Add Ace editor to attributions list.
Set UI preferences based on the logged-in user.
Fix an issue where the indicated user filter was not applied to the displayed tasks.
Improve error messaging for failed actions.