det API Reference

`det` API Reference#

`determined.ClusterInfo`#

determined.get_cluster_info() → Optional[ClusterInfo]#: Returns either the ClusterInfo object for the current task, or None if not running in a task.

class determined.ClusterInfo(master_url: str, cluster_id: str, agent_id: str, slot_ids: List[int], task_id: str, allocation_id: str, session_token: str, task_type: str, master_cert_name: Optional[str] = None, master_cert_file: Optional[str] = None, latest_checkpoint: Optional[str] = None, trial_info: Optional[TrialInfo] = None, rendezvous_info: Optional[RendezvousInfo] = None, resources_info: Optional[ResourcesInfo] = None)#

ClusterInfo exposes various properties that are set for tasks while running on the cluster.

Examples:

info = det.get_cluster_info()
assert info is not None, "this code only runs on-cluster!"

print("master_url", info.master_url)
print("task_id", info.task_id)
print("allocation_id", info.allocation_id)
print("session_token", info.session_token)

print("container_addrs", info.container_addrs)
print("container_rank", info.container_rank)

if info.task_type == "TRIAL":
    print("trial.id", info.trial.trial_id)
    print("trial.hparams", info.trial.hparams)

Warning

Be careful with this object! If you depend on a ClusterInfo object during training for anything more than e.g. informational logging, you run the risk of making your training code unable to run outside of Determined. ClusterInfo is meant to be most useful to custom launch layers, which likely are not able to run outside of Determined anyway.

agent_id#: The identifier of the Determined agent this container is running on.

allocation_id#: The unique identifier for the current allocation.

cluster_id#: The unique identifier for this cluster.

property container_addrs: List[str]#: A list of addresses for all containers in the allocation, ordered by rank.

property container_rank: int#

The rank assigned to this container.

When using a distributed training framework, the framework may choose a different rank for this container.

property container_slot_counts: List[int]#: A list of slots for all containers in the allocation, ordered by rank.

property gpu_uuids: List[str]#: The UUIDs to the gpus assigned to this container.

property latest_checkpoint: Optional[str]#

The checkpoint ID of the most recent checkpoint that should be loaded.

Since non-trial-type tasks cannot currently save checkpoints, .latest_checkpoint is currently always None for non-trial-type tasks.

master_cert_file#: The file location for the master certificate, if present, or “noverify” if it has been configured not to verify the master cert.

master_cert_name#: The name on the master certificate, when using TLS.

master_url#: The url for reaching the master.

session_token#: The Determined login session token created for the current task.

slot_ids#: The slot ids assigned to this container.

task_id#: The unique identifier for the current task.

task_type#

The type of task. Currently one of the following string literals:

"TRIAL"
"NOTEBOOK"
"SHELL"
"COMMAND"
"TENSORBOARD"
"CHECKPOINT_GC"

Additional values may be added in the future.

property trial: TrialInfo#

The TrialInfo sub-info object for the current trial task.

Attempting to read .trial in a non-trial task type will raise a RuntimeError.

property user_data: Dict[str, Any]#

The content of the data field of the experiment configuration.

Since other types of configuration files don’t allow a data field, accessing user_data from non-trial-type tasks will always return an empty dictionary.

`determined.TrialInfo`#

class determined.TrialInfo(trial_id: int, experiment_id: int, trial_seed: int, hparams: Dict[str, Any], config: Dict[str, Any], steps_completed: int, trial_run_id: int, debug: bool, inter_node_network_interface: Optional[str])#

experiment_id#: The Experiment ID for the current task.

hparams#: The hyperparameter values selected for the current Trial.

trial_id#: The Trial ID for the current task.

trial_seed#: The random seed for the current Trial.

`determined.import_from_path`#

determined.import_from_path(path: PathLike) → Iterator#

import_from_path allows you to import from a specific directory and cleans up afterwards.

Even if you are importing identically-named files, you can import them as separate modules. This is intended to help when you have, for example, a current model_def.py, but also import an older model_def.py from a checkpoint into the same interpreter, without conflicts (so long as you import them as different names, of course).

Example:

import model_def as new_model_def

with det.import_from_path(checkpoint_dir):
    import model_def as old_model_def

    old_model = old_model_def.my_build_model()
    old_model.my_load_weights(checkpoint_dir)

current_model = new_model_def.my_build_model(
    base_layers=old_model.base_layers
)

Without import_from_path, the above code snippet would hit issues where model_def had already been imported so the second import would have been a noop and both new_model_def and old_model_def would represent the same underlying module.

det API Reference

Contents

det API Reference#

determined.ClusterInfo#

determined.TrialInfo#

determined.import_from_path#

`det` API Reference#

`determined.ClusterInfo`#

`determined.TrialInfo`#

`determined.import_from_path`#