Shortcuts

determined.NativeContext

The NativeContext provides useful methods for writing tf.keras and tf.estimator experiments using the Native API. Every init() function supported by the Native API returns a subclass of NativeContext:

determined.NativeContext

class determined.NativeContext(env: determined._env_context.EnvContext, hvd_config: determined.horovod.HorovodContext)

A base class that all NativeContexts will inherit when using the Native API.

The context returned by the init() function must inherit from this class.

NativeContext always has a DistributedContext accessible via context.distributed for information related to distributed training.

get_data_config() → Dict[str, Any]

Return the data configuration.

get_experiment_config() → Dict[str, Any]

Return the experiment configuration.

get_experiment_id() → int

Return the experiment ID of the current trial.

get_global_batch_size() → int

Return the global batch size.

get_hparam(name: str) → Any

Return the current value of the hyperparameter with the given name.

get_hparams() → Dict[str, Any]

Return a dictionary of hyperparameter names to values.

get_per_slot_batch_size() → int

Return the per-slot batch size. When a model is trained with a single GPU, this is equal to the global batch size. When multi-GPU training is used, this is equal to the global batch size divided by the number of GPUs used to train the model.

get_stop_requested() → bool

Return whether a trial stoppage has been requested.

get_trial_id() → int

Return the trial ID of the current trial.

set_stop_requested(stop_requested: bool) → None

Set a flag to request a trial stoppage. When this flag is set to True, we finish the step, checkpoint, then exit.

determined.TrialContext.distributed

class determined._train_context.DistributedContext(env: determined._env_context.EnvContext, hvd_config: determined.horovod.HorovodContext)

DistributedContext extends all TrialContexts and NativeContexts under the context.distributed namespace. It provides useful methods for effective distributed training.

get_rank() → int

Return the rank of the process in the trial. The rank of a process is a unique ID within the trial; that is, no two processes in the same trial will be assigned the same rank.

get_local_rank() → int

Return the rank of the process on the agent. The local rank of a process is a unique ID within a given agent and trial; that is, no two processes in the same trial that are executing on the same agent will be assigned the same rank.

get_size() → int

Return the number of slots this trial is running on.

get_num_agents() → int

Return the number of agents this trial is running on.

determined.keras.TFKerasNativeContext

class determined.keras.TFKerasNativeContext(env: determined._env_context.EnvContext, hvd_config: determined.horovod.HorovodContext)

TFKerasNativeContext always has a DistributedContext accessible via context.distributed for information related to distributed training.

get_data_config() → Dict[str, Any]

Return the data configuration.

get_experiment_config() → Dict[str, Any]

Return the experiment configuration.

get_experiment_id() → int

Return the experiment ID of the current trial.

get_global_batch_size() → int

Return the global batch size.

get_hparam(name: str) → Any

Return the current value of the hyperparameter with the given name.

get_hparams() → Dict[str, Any]

Return a dictionary of hyperparameter names to values.

get_per_slot_batch_size() → int

Return the per-slot batch size. When a model is trained with a single GPU, this is equal to the global batch size. When multi-GPU training is used, this is equal to the global batch size divided by the number of GPUs used to train the model.

get_stop_requested() → bool

Return whether a trial stoppage has been requested.

get_trial_id() → int

Return the trial ID of the current trial.

set_stop_requested(stop_requested: bool) → None

Set a flag to request a trial stoppage. When this flag is set to True, we finish the step, checkpoint, then exit.

wrap_dataset(dataset: Any, shard_dataset: bool = True) → Any

This should be used to wrap tf.data.Dataset objects immediately after they have been created. Users should use the output of this wrapper as the new instance of their dataset. If users create multiple datasets (e.g., one for training and one for validation), users should wrap each dataset independently.

Parameters
  • dataset – tf.data.Dataset

  • shard_dataset – When performing multi-slot (distributed) training, this controls whether the dataset is sharded so that each training process (one per slot) sees unique data. If set to False, users must manually configure each process to use unique data.

determined.estimator.EstimatorNativeContext

class determined.estimator.EstimatorNativeContext(env: determined._env_context.EnvContext, hvd_config: determined.horovod.HorovodContext)

EstimatorNativeContext always has a DistributedContext accessible via context.distributed for information related to distributed training.

get_data_config() → Dict[str, Any]

Return the data configuration.

get_experiment_config() → Dict[str, Any]

Return the experiment configuration.

get_experiment_id() → int

Return the experiment ID of the current trial.

get_global_batch_size() → int

Return the global batch size.

get_hparam(name: str) → Any

Return the current value of the hyperparameter with the given name.

get_hparams() → Dict[str, Any]

Return a dictionary of hyperparameter names to values.

get_per_slot_batch_size() → int

Return the per-slot batch size. When a model is trained with a single GPU, this is equal to the global batch size. When multi-GPU training is used, this is equal to the global batch size divided by the number of GPUs used to train the model.

get_stop_requested() → bool

Return whether a trial stoppage has been requested.

get_trial_id() → int

Return the trial ID of the current trial.

set_stop_requested(stop_requested: bool) → None

Set a flag to request a trial stoppage. When this flag is set to True, we finish the step, checkpoint, then exit.

wrap_dataset(dataset: Any, shard_dataset: bool = True) → Any

This should be used to wrap tf.data.Dataset objects immediately after they have been created. Users should use the output of this wrapper as the new instance of their dataset. If users create multiple datasets (e.g., one for training and one for testing), users should wrap each dataset independently. E.g., If users instantiate their training dataset within build_train_spec(), they should call dataset = wrap_dataset(dataset) prior to passing it into tf.estimator.TrainSpec.

Parameters
  • dataset – tf.data.Dataset

  • shard_dataset – When performing multi-slot (distributed) training, this controls whether the dataset is sharded so that each training process (one per slot) sees unique data. If set to False, users must manually configure each process to use unique data.

wrap_optimizer(optimizer: Any) → Any

This should be used to wrap optimizer objects immediately after they have been created. Users should use the output of this wrapper as the new instance of their optimizer. For example, if users create their optimizer within build_estimator(), they should call optimizer = wrap_optimizer(optimzer) prior to passing the optimizer into their Estimator.