det API Reference

determined.ClusterInfo

determined.get_cluster_info() Optional[determined._info.ClusterInfo]

Returns either the ClusterInfo object for the current task, or None if not running in a task.

class determined.ClusterInfo(master_url: str, cluster_id: str, agent_id: str, slot_ids: List[int], task_id: str, allocation_id: str, session_token: str, task_type: str, master_cert_name: Optional[str] = None, master_cert_file: Optional[str] = None, latest_checkpoint: Optional[str] = None, trial_info: Optional[determined._info.TrialInfo] = None, rendezvous_info: Optional[determined._info.RendezvousInfo] = None, resources_info: Optional[determined._info.ResourcesInfo] = None)

ClusterInfo exposes various properties that are set for tasks while running on the cluster.

Examples:

info = det.get_cluster_info()
assert info is not None, "this code only runs on-cluster!"

print("master_url", info.master_url)
print("task_id", info.task_id)
print("allocation_id", info.allocation_id)
print("session_token", info.session_token)

print("container_addrs", info.container_addrs)
print("container_rank", info.container_rank)

if info.task_type == "TRIAL":
    print("trial.id", info.trial.id)
    print("trial.hparams", info.trial.hparams)

Warning

Be careful with this object! If you depend on a ClusterInfo object during training for anything more than e.g. informational logging, you run the risk of making your training code unable to run outside of Determined. ClusterInfo is meant to be most useful to custom launch layers, which likely are not able to run outside of Determined anyway.

agent_id

The identifier of the Determined agent this container is running on.

allocation_id

The unique identifier for the current allocation.

cluster_id

The unique identifier for this cluster.

property container_addrs: List[str]

A list of addresses for all containers in the allocation, ordered by rank.

property container_rank: int

The rank assigned to this container.

When using a distributed training framework, the framework may choose a different rank for this container.

property container_slot_counts: List[int]

A list of slots for all containers in the allocation, ordered by rank.

property gpu_uuids: List[str]

The UUIDs to the gpus assigned to this container.

property latest_checkpoint: Optional[str]

The checkpoint ID of the most recent checkpoint that should be loaded.

Since non-trial-type tasks cannot currently save checkpoints, .latest_checkpoint is currently always None for non-trial-type tasks.

master_cert_file

The file location for the master certificate, if present, or “noverify” if it has been configured not to verify the master cert.

master_cert_name

The name on the master certificate, when using TLS.

master_url

The url for reaching the master.

session_token

The Determined login session token created for the current task.

slot_ids

The slot ids assigned to this container.

task_id

The unique identifier for the current task.

task_type
The type of task. Currently one of the following string literals:
  • "TRIAL"

  • "NOTEBOOK"

  • "SHELL"

  • "COMMAND"

  • "TENSORBOARD"

  • "CHECKPOINT_GC"

Additional values may be added in the future.

property trial: determined._info.TrialInfo

The TrialInfo sub-info object for the current trial task.

Attempting to read .trial in a non-trial task type will raise a RuntimeError.

property user_data: Dict[str, Any]

The content of the data field of the experiment configuration.

Since other types of configuration files don’t allow a data field, accessing user_data from non-trial-type tasks will always return an empty dictionary.

determined.import_from_path

determined.import_from_path(path: os.PathLike) Iterator

import_from_path allows you to import from a specific directory and cleans up afterwards.

Even if you are importing identically-named files, you can import them as separate modules. This is intended to help when you have, for example, a current model_def.py, but also import an older model_def.py from a checkpoint into the same interpreter, without conflicts (so long as you import them as different names, of course).

Example:

import model_def as new_model_def

with det.import_from_path(checkpoint_dir):
    import model_def as old_model_def

    old_model = old_model_def.my_build_model()
    old_model.my_load_weights(checkpoint_dir)

current_model = new_model_def.my_build_model(
    base_layers=old_model.base_layers
)

Without import_from_path, the above code snippet would hit issues where model_def had already been imported so the second import would have been a noop and both new_model_def and old_model_def would represent the same underlying module.