Python SDK¶
You can interact with a Determined cluster with the Python SDK.
The client module exposes many of the same capabilities as the det CLI tool directly to Python code with an object-oriented interface.
Experiment Workflow Example¶
As a simple example, let’s walk through the most basic workflow for creating an experiment, waiting for it to complete, and finding the top-performing checkpoint.
The first step is to import the client module and possibly to call login():
from determined.experimental import client
# We will assume that you have called `det user login`, so this is unnecessary:
# client.login(master=..., user=..., password=...)
The next step is to call create_experiment():
# config can be a path to a config file or a python dict of the config.
exp = client.create_experiment(config="my_config.yaml", model_dir=".")
print(f"started experiment {exp.id}")
The returned object will be an ExperimentReference which has methods for controlling the lifetime of the experiment running on the cluster. In this example, we will just wait for the experiment to complete.
exit_status = exp.wait()
print(f"experiment completed with status {exit_status}")
Now that the experiment has completed, you can grab the top-performing checkpoint from training:
best_checkpoint = exp.top_checkpoint()
print(f"best checkpoint was {best_checkpoint.uuid}")
Python SDK Reference¶
Client
¶
The client
module exposes many of the same capabilities as the det
CLI tool directly to
Python code with an object-oriented interface.
As a simple example, let’s walk through the most basic workflow for creating an experiment, waiting for it to complete, and finding the top-performing checkpoint.
The first step is to import the client
module and possibly to call
login()
:
from determined.experimental import client
# We will assume that you have called `det user login`, so this is unnecessary:
# client.login(master=..., user=..., password=...)
The next step is to call create_experiment()
:
# config can be a path to a config file or a python dict of the config.
exp = client.create_experiment(config="my_config.yaml", model_dir=".")
print(f"started experiment {exp.id}")
The returned object will be an ExperimentReference
which has methods for controlling the lifetime of the experiment running on the cluster.
In this example, we will just wait for the experiment to complete.
exit_status = exp.wait()
print(f"experiment completed with status {exit_status}")
Now that the experiment has completed, you can grab the top-performing checkpoint from training:
best_checkpoint = exp.top_checkpoint()
print(f"best checkpoint was {best_checkpoint.uuid}")
See Checkpoints for more ideas on what to do next.
- determined.experimental.client.login(master: Optional[str] = None, user: Optional[str] = None, password: Optional[str] = None, cert_path: Optional[str] = None, cert_name: Optional[str] = None, noverify: bool = False) None ¶
login()
will configure the default Determined() singleton used by all of the other functions in the client module.It is often unnecessary to call
login()
. If you have configured your environment so that the Determined CLI works without any extra arguments or environment variables, you should not have to calllogin()
at all.If you do need to call
login()
, it must be called before any calling any other functions from this module, otherwise it will fail.If you have reason to connect to multiple masters, you should use explicit
Determined
objects instead. Each explicitDetermined
object accepts the same parameters aslogin()
, and offers the same functions as what are offered in this module.Note
Try to avoid having your password in your python code. If you are running on your local machine, you should always be able to use
det user login
on the CLI, andlogin()
will not need either a user or a password. If you have randet user login
with multiple users (and you have not randet user logout
), then you should be able to runlogin(user=...)
for any of those users without putting your password in your code.- Parameters
master (string, optional) – The URL of the Determined master. If this argument is not specified, the environment variables DET_MASTER and DET_MASTER_ADDR will be checked for the master URL in that order.
user (string, optional) – The Determined username used for authentication. (default:
determined
)password (string, optional) – The password associated with the user.
cert_path (string, optional) – A path to a custom PEM-encoded certificate, against which to validate the master. (default:
None
)cert_name (string, optional) – The name of the master hostname to use during certificate validation. Normally this is taken from the master URL, but there may be cases where the master is exposed on multiple networks that this value might need to be overridden. (default:
None
)noverify (boolean, optional) – disable all TLS verification entirely. (default:
False
)
- determined.experimental.client.create_experiment(config: Union[str, pathlib.Path, Dict], model_dir: str, includes: Optional[Iterable[Union[str, pathlib.Path]]] = None) determined.common.experimental.experiment.ExperimentReference ¶
Creates an experiment with config parameters and model directory. The function returns an
ExperimentReference
of the experiment.- Parameters
config (str, pathlib.Path, dictionary) – Experiment config filename (.yaml) or a dict.
model_dir (str) – Directory containing model definition.
iterables (Iterable[Union[str, pathlib.Path]], optional) – Additional files or directories to include in the model definition. (default:
None
)
- determined.experimental.client.get_experiment(experiment_id: int) determined.common.experimental.experiment.ExperimentReference ¶
Get the
ExperimentReference
representing the experiment with the provided experiment ID.- Parameters
experiment_id (int) – The experiment ID.
- determined.experimental.client.get_trial(trial_id: int) determined.common.experimental.trial.TrialReference ¶
Get the
TrialReference
representing the trial with the provided trial ID.- Parameters
trial_id (int) – The trial ID.
- determined.experimental.client.get_checkpoint(uuid: str) determined.common.experimental.checkpoint._checkpoint.Checkpoint ¶
Get the
Checkpoint
representing the checkpoint with the provided UUID.- Parameters
uuid (string) – The checkpoint UUID.
- determined.experimental.client.create_model(name: str, description: Optional[str] = '', metadata: Optional[Dict[str, Any]] = None) determined.common.experimental.model.Model ¶
Add a
Model
to the model registry. This function returns aModel
.- Parameters
name (string) – The name of the model. This name must be unique.
description (string, optional) – A description of the model.
metadata (dict, optional) – Dictionary of metadata to add to the model.
- determined.experimental.client.get_model(identifier: Union[str, int]) determined.common.experimental.model.Model ¶
Get the
Model
from the model registry with the provided identifier, which is either a string-type name or an integer-type model ID. If no model with that name is found in the registry, an exception is raised.- Parameters
identifier (string, int) – The unique name or ID of the model.
- determined.experimental.client.get_models(sort_by: determined.common.experimental.model.ModelSortBy = ModelSortBy.NAME, order_by: determined.common.experimental.model.ModelOrderBy = ModelOrderBy.ASCENDING, name: str = '', description: str = '') List[determined.common.experimental.model.Model] ¶
Get a list of all models in the model registry.
- Parameters
sort_by – Which field to sort by. See
ModelSortBy
.order_by – Whether to sort in ascending or descending order. See
ModelOrderBy
.name – If this parameter is set, models will be filtered to only include models with names matching this parameter.
description – If this parameter is set, models will be filtered to only include models with descriptions matching this parameter.
Checkpoint
¶
- class determined.experimental.client.Checkpoint(session: determined.common.api._session.Session, task_id: Optional[str], allocation_id: Optional[str], uuid: str, report_time: Optional[str], resources: Dict[str, Any], metadata: Dict[str, Any], state: determined.common.experimental.checkpoint._checkpoint.CheckpointState, training: Optional[determined.common.experimental.checkpoint._checkpoint.CheckpointTrainingMetadata] = None)¶
A Checkpoint object is usually obtained from
determined.experimental.client.get_checkpoint()
.A
Checkpoint
represents a trained model.This class provides helper functionality for downloading checkpoints to local storage and loading checkpoints into memory.
The
TrialReference
class contains methods that return instances of this class.- download(path: Optional[str] = None, mode: determined.common.experimental.checkpoint._checkpoint.DownloadMode = DownloadMode.AUTO) str ¶
Download checkpoint to local storage.
See also
- Parameters
path (string, optional) – Top level directory to place the checkpoint under. If this parameter is not set, the checkpoint will be downloaded to
checkpoints/<checkpoint_uuid>
relative to the current working directory.mode (DownloadMode) – Mode governs how a checkpoint is downloaded. Refer to the definition of DownloadMode for details.
- write_metadata_file(path: str) None ¶
Write a file with this Checkpoint’s metadata inside of it.
This is normally executed as part of Checkpoint.download(). However, in the special case where you are accessing the checkpoint files directly (not via Checkpoint.download) you may use this method directly to obtain the latest metadata.
- load(path: Optional[str] = None, tags: Optional[List[str]] = None, **kwargs: Any) Any ¶
Loads a Determined checkpoint into memory.
If the checkpoint is not present on disk it will be downloaded from persistent storage. The behavior here is different for TensorFlow and PyTorch checkpoints.
For PyTorch checkpoints, the return type is an object that inherits from
determined.pytorch.PyTorchTrial
as defined by theentrypoint
field in the experiment config.For TensorFlow checkpoints, the return type is a TensorFlow autotrackable object.
- Parameters
path (string, optional) – Top level directory to load the checkpoint from. (default:
checkpoints/<UUID>
)tags (list string, optional) – Only relevant for TensorFlow SavedModel checkpoints. Specifies which tags are loaded from the TensorFlow SavedModel. See documentation for tf.compat.v1.saved_model.load_v2.
kwargs – Only relevant for PyTorch checkpoints. The keyword arguments will be applied to
torch.load
. See documentation for torch.load.
Warning
Checkpoint.load() has been deprecated and will be removed in a future version.
- Please combine Checkpoint.download() with one of the following instead:
det.pytorch.load_trial_from_checkpoint()
det.keras.load_model_from_checkpoint()
det.estimator.load_estimator_from_checkpoint_path()
- add_metadata(metadata: Dict[str, Any]) None ¶
Adds user-defined metadata to the checkpoint. The
metadata
argument must be a JSON-serializable dictionary. If any keys from this dictionary already appear in the checkpoint metadata, the corresponding dictionary entries in the checkpoint are replaced by the passed-in dictionary values.Warning: this metadata change is not propagated to the checkpoint storage.
- Parameters
metadata (dict) – Dictionary of metadata to add to the checkpoint.
- remove_metadata(keys: List[str]) None ¶
Removes user-defined metadata from the checkpoint. Any top-level keys that appear in the
keys
list are removed from the checkpoint.Warning: this metadata change is not propagated to the checkpoint storage.
- Parameters
keys (List[string]) – Top-level keys to remove from the checkpoint metadata.
- static load_from_path(path: str, tags: Optional[List[str]] = None, **kwargs: Any) Any ¶
Loads a Determined checkpoint from a local file system path into memory.
For PyTorch checkpoints, the return type is an object that inherits from
determined.pytorch.PyTorchTrial
as defined by theentrypoint
field in the experiment config.For TensorFlow checkpoints, the return type is a TensorFlow autotrackable object.
- Parameters
path (string) – Local path to the checkpoint directory.
tags (list string, optional) –
Only relevant for TensorFlow SavedModel checkpoints. Specifies which tags are loaded from the TensorFlow SavedModel. See documentation for tf.compat.v1.saved_model.load_v2.
Warning
Checkpoint.load_from_path() has been deprecated and will be removed in a future version.
- Please use one of the following instead to load your checkpoint:
det.pytorch.load_trial_from_checkpoint_path()
det.keras.load_model_from_checkpoint_path()
det.estimator.load_estimator_from_checkpoint_path()
Determined
¶
- class determined.experimental.client.Determined(master: Optional[str] = None, user: Optional[str] = None, password: Optional[str] = None, cert_path: Optional[str] = None, cert_name: Optional[str] = None, noverify: bool = False)¶
Determined gives access to Determined API objects.
- Parameters
master (string, optional) – The URL of the Determined master. If this argument is not specified, the environment variables
DET_MASTER
andDET_MASTER_ADDR
will be checked for the master URL in that order.user (string, optional) – The Determined username used for authentication. (default:
determined
)
- create_experiment(config: Union[str, pathlib.Path, Dict], model_dir: Union[str, pathlib.Path], includes: Optional[Iterable[Union[str, pathlib.Path]]] = None) determined.common.experimental.experiment.ExperimentReference ¶
Create an experiment with config parameters and model directory. The function returns
ExperimentReference
of the experiment.- Parameters
config (string, pathlib.Path, dictionary) – experiment config filename (.yaml) or a dict.
model_dir (string) – directory containing model definition.
includes (Iterable[Union[str, pathlib.Path]], optional) – Additional files or
(default (directories to include in the model definition.) –
None
)
- get_experiment(experiment_id: int) determined.common.experimental.experiment.ExperimentReference ¶
Get the
ExperimentReference
representing the experiment with the provided experiment ID.
- get_trial(trial_id: int) determined.common.experimental.trial.TrialReference ¶
Get the
TrialReference
representing the trial with the provided trial ID.
- get_checkpoint(uuid: str) determined.common.experimental.checkpoint._checkpoint.Checkpoint ¶
Get the
Checkpoint
representing the checkpoint with the provided UUID.
- create_model(name: str, description: Optional[str] = '', metadata: Optional[Dict[str, Any]] = None, labels: Optional[List[str]] = None) determined.common.experimental.model.Model ¶
Add a model to the model registry.
- Parameters
name (string) – The name of the model. This name must be unique.
description (string, optional) – A description of the model.
metadata (dict, optional) – Dictionary of metadata to add to the model.
- get_model(identifier: Union[str, int]) determined.common.experimental.model.Model ¶
Get the
Model
from the model registry with the provided identifer, which is either a string-type name or an integer-type model ID. If no corresponding model is found in the registry, an exception is raised.- Parameters
identifier (string, int) – The unique name or ID of the model.
- get_model_by_id(model_id: int) determined.common.experimental.model.Model ¶
Get the
Model
from the model registry with the provided id. If no model with that id is found in the registry, an exception is raised.Warning
Determined.get_model_by_id() has been deprecated and will be removed in a future version. Please call Determined.get_model() with either a string-type name or an integer-type model ID.
- get_models(sort_by: determined.common.experimental.model.ModelSortBy = ModelSortBy.NAME, order_by: determined.common.experimental.model.ModelOrderBy = ModelOrderBy.ASCENDING, name: Optional[str] = None, description: Optional[str] = None, model_id: Optional[int] = None) List[determined.common.experimental.model.Model] ¶
Get a list of all models in the model registry.
- Parameters
sort_by – Which field to sort by. See
ModelSortBy
.order_by – Whether to sort in ascending or descending order. See
ModelOrderBy
.name – If this parameter is set, models will be filtered to only include models with names matching this parameter.
description – If this parameter is set, models will be filtered to only include models with descriptions matching this parameter.
model_id – If this paramter is set, models will be filtered to only include the model with this unique numeric id.
- get_model_labels() List[str] ¶
Get a list of labels used on any models, sorted from most-popular to least-popular.
ExperimentReference
¶
- class determined.experimental.client.ExperimentReference(experiment_id: int, session: determined.common.api._session.Session)¶
An ExperimentReference object is usually obtained from
determined.experimental.client.create_experiment()
ordetermined.experimental.client.get_experiment()
.Helper class that supports querying the set of checkpoints associated with an experiment.
- delete() None ¶
Delete an experiment and all its artifacts from persistent storage.
You must be authenticated as admin to delete an experiment.
- get_trials(sort_by: determined.common.experimental.trial.TrialSortBy = TrialSortBy.ID, order_by: determined.common.experimental.trial.TrialOrderBy = TrialOrderBy.ASCENDING) List[determined.common.experimental.trial.TrialReference] ¶
Get the list of
TrialReference
instances representing trials for an experiment.- Parameters
sort_by – Which field to sort by. See
TrialSortBy
.order_by – Whether to sort in ascending or descending order. See
TrialOrderBy
.
- await_first_trial(interval: float = 0.1) determined.common.experimental.trial.TrialReference ¶
Wait for the first trial to be started for this experiment.
- wait(interval: float = 5.0) determined.common.experimental.experiment.ExperimentState ¶
Wait for the experiment to reach a complete or terminal state.
- Parameters
interval (int, optional) – An interval time in seconds before checking next experiement state.
- top_checkpoint(sort_by: Optional[str] = None, smaller_is_better: Optional[bool] = None) determined.common.experimental.checkpoint._checkpoint.Checkpoint ¶
Return the
Checkpoint
for this experiment that has the best validation metric, as defined by thesort_by
andsmaller_is_better
arguments.- Parameters
sort_by (string, optional) – The name of the validation metric to order checkpoints by. If this parameter is not specified, the metric defined in the experiment configuration
searcher
field will be used.smaller_is_better (bool, optional) – Specifies whether to sort the metric above in ascending or descending order. If
sort_by
is unset, this parameter is ignored. By default, the value ofsmaller_is_better
from the experiment’s configuration is used.
- top_n_checkpoints(limit: int, sort_by: Optional[str] = None, smaller_is_better: Optional[bool] = None) List[determined.common.experimental.checkpoint._checkpoint.Checkpoint] ¶
Return the N
Checkpoint
instances with the best validation metrics, as defined by thesort_by
andsmaller_is_better
arguments. This method will return the best checkpoint from the top N best-performing distinct trials of the experiment. Only checkpoints in aCOMPLETED
state with a matchingCOMPLETED
validation are considered.- Parameters
limit (int) – The maximum number of checkpoints to return.
sort_by (string, optional) – The name of the validation metric to use for sorting checkpoints. If this parameter is unset, the metric defined in the experiment configuration searcher field will be used.
smaller_is_better (bool, optional) – Specifies whether to sort the metric above in ascending or descending order. If
sort_by
is unset, this parameter is ignored. By default, the value ofsmaller_is_better
from the experiment’s configuration is used.
Model
¶
- class determined.experimental.client.Model(session: determined.common.api._session.Session, model_id: int, name: str, description: str, creation_time: datetime.datetime, last_updated_time: datetime.datetime, metadata: Dict[str, Any], labels: List[str], username: str, archived: bool)¶
A Model object is usually obtained from
determined.experimental.client.create_model()
ordetermined.experimental.client.get_model()
.Class representing a model in the model registry. It contains methods for model versions and metadata.
- Parameters
model_id (int) – The unique id of this model.
name (string) – The name of the model.
description (string, optional) – The description of the model.
creation_time (datetime) – The time the model was created.
last_updated_time (datetime) – The time the model was most recently updated.
metadata (dict, optional) – User-defined metadata associated with the checkpoint.
labels ([string]) – User-defined text labels associated with the checkpoint.
username (string) – The user who initially created this model.
archived (boolean) – The status (archived or not) for this model.
- get_version(version: int = - 1) Optional[determined.common.experimental.model.ModelVersion] ¶
Retrieve the checkpoint corresponding to the specified id of the model version. If the specified version of the model does not exist, an exception is raised.
If no version is specified, the latest version of the model is returned. In this case, if there are no registered versions of the model,
None
is returned.- Parameters
version (int, optional) – The model version ID requested.
- get_versions(order_by: determined.common.experimental.model.ModelOrderBy = ModelOrderBy.DESCENDING) List[determined.common.experimental.model.ModelVersion] ¶
Get a list of ModelVersions with checkpoints of this model. The model versions are sorted by model version ID and are returned in descending order by default.
- Parameters
order_by (enum) – A member of the
ModelOrderBy
enum.
- register_version(checkpoint_uuid: str) determined.common.experimental.model.ModelVersion ¶
Creates a new model version and returns the
ModelVersion
corresponding to the version.- Parameters
checkpoint_uuid – The UUID of the checkpoint to register.
- add_metadata(metadata: Dict[str, Any]) None ¶
Adds user-defined metadata to the model. The
metadata
argument must be a JSON-serializable dictionary. If any keys from this dictionary already appear in the model’s metadata, the previous dictionary entries are replaced.- Parameters
metadata (dict) – Dictionary of metadata to add to the model.
- remove_metadata(keys: List[str]) None ¶
Removes user-defined metadata from the model. Any top-level keys that appear in the
keys
list are removed from the model.- Parameters
keys (List[str]) – Top-level keys to remove from the model metadata.
- set_labels(labels: List[str]) None ¶
Sets user-defined labels for the model. The
labels
argument must be an array of strings. If the model previously had labels, they are replaced.- Parameters
labels (List[str]) – All labels to set on the model.
- archive() None ¶
Sets the model’s state to archived
- unarchive() None ¶
Removes the model’s archived state
- delete() None ¶
Deletes the model in the registry
ModelOrderBy
¶
- class determined.experimental.client.ModelOrderBy(value)¶
Specifies whether a sorted list of models should be in ascending or descending order.
ModelSortBy
¶
TrialReference
¶
- class determined.experimental.client.TrialReference(trial_id: int, session: determined.common.api._session.Session)¶
A TrialReference object is usually obtained from
determined.experimental.client.get_trial()
.Trial reference class used for querying relevant
Checkpoint
instances.- logs(follow: bool = False, *, head: Optional[int] = None, tail: Optional[int] = None, container_ids: Optional[List[str]] = None, rank_ids: Optional[List[int]] = None, stdtypes: Optional[List[str]] = None, min_level: Optional[determined.common.experimental.trial.LogLevel] = None) Iterable[str] ¶
Return an iterable of log lines from this trial meeting the specified criteria.
- Parameters
follow (bool, optional) – If the iterable should block waiting for new logs to arrive. Mutually exclusive with
head
andtail
. Defaults toFalse
.head (int, optional) – When set, only fetches the first
head
lines. Mutually exclusive withfollow
andtail
. Defaults toNone
.tail (int, optional) – When set, only fetches the first
head
lines. Mutually exclusive withfollow
andhead
. Defaults toNone
.container_ids (List[str], optional) – When set, only fetch logs from lines from specific containers. Defaults to
None
.rank_ids (List[int], optional) – When set, only fetch logs from lines from specific ranks. Defaults to
None
.stdtypes (List[int], optional) – When set, only fetch logs from lines from the given stdio outputs. Defaults to
None
(same as["stdout", "stderr"]
).min_level – (LogLevel, optional): When set, defines the minimum log priority for lines that will be returned. Defaults to
None
(all logs returned).
- top_checkpoint(sort_by: Optional[str] = None, smaller_is_better: Optional[bool] = None) determined.common.experimental.checkpoint._checkpoint.Checkpoint ¶
Return the
Checkpoint
instance with the best validation metric as defined by thesort_by
andsmaller_is_better
arguments.- Parameters
sort_by (string, optional) – The name of the validation metric to order checkpoints by. If this parameter is unset the metric defined in the related experiment configuration searcher field will be used.
smaller_is_better (bool, optional) – Whether to sort the metric above in ascending or descending order. If
sort_by
is unset, this parameter is ignored. By default, the value ofsmaller_is_better
from the experiment’s configuration is used.
- select_checkpoint(latest: bool = False, best: bool = False, uuid: Optional[str] = None, sort_by: Optional[str] = None, smaller_is_better: Optional[bool] = None) determined.common.experimental.checkpoint._checkpoint.Checkpoint ¶
Return the
Checkpoint
instance with the best validation metric as defined by thesort_by
andsmaller_is_better
arguments.Exactly one of the
best
,latest
, oruuid
parameters must be set.- Parameters
latest (bool, optional) – Return the most recent checkpoint.
best (bool, optional) – Return the checkpoint with the best validation metric as defined by the
sort_by
andsmaller_is_better
arguments. Ifsort_by
andsmaller_is_better
are not specified, the values from the associated experiment configuration will be used.uuid (string, optional) – Return the checkpoint for the specified UUID.
sort_by (string, optional) – The name of the validation metric to order checkpoints by. If this parameter is unset the metric defined in the related experiment configuration searcher field will be used.
smaller_is_better (bool, optional) – Whether to sort the metric above in ascending or descending order. If
sort_by
is unset, this parameter is ignored. By default, the value ofsmaller_is_better
from the experiment’s configuration is used.