det
API Reference#
determined.ClusterInfo
#
- determined.get_cluster_info() Optional[ClusterInfo] #
Returns either the
ClusterInfo
object for the current task, orNone
if not running in a task.
- class determined.ClusterInfo(master_url: str, cluster_id: str, agent_id: str, slot_ids: List[int], task_id: str, allocation_id: str, session_token: str, task_type: str, master_cert_name: Optional[str] = None, master_cert_file: Optional[str] = None, latest_checkpoint: Optional[str] = None, trial_info: Optional[TrialInfo] = None, rendezvous_info: Optional[RendezvousInfo] = None, resources_info: Optional[ResourcesInfo] = None)#
ClusterInfo exposes various properties that are set for tasks while running on the cluster.
Examples:
info = det.get_cluster_info() assert info is not None, "this code only runs on-cluster!" print("master_url", info.master_url) print("task_id", info.task_id) print("allocation_id", info.allocation_id) print("session_token", info.session_token) print("container_addrs", info.container_addrs) print("container_rank", info.container_rank) if info.task_type == "TRIAL": print("trial.id", info.trial.trial_id) print("trial.hparams", info.trial.hparams)
Warning
Be careful with this object! If you depend on a ClusterInfo object during training for anything more than e.g. informational logging, you run the risk of making your training code unable to run outside of Determined. ClusterInfo is meant to be most useful to custom launch layers, which likely are not able to run outside of Determined anyway.
- agent_id#
The identifier of the Determined agent this container is running on.
- allocation_id#
The unique identifier for the current allocation.
- cluster_id#
The unique identifier for this cluster.
- property container_addrs: List[str]#
A list of addresses for all containers in the allocation, ordered by rank.
- property container_rank: int#
The rank assigned to this container.
When using a distributed training framework, the framework may choose a different rank for this container.
- property container_slot_counts: List[int]#
A list of slots for all containers in the allocation, ordered by rank.
- property gpu_uuids: List[str]#
The UUIDs to the gpus assigned to this container.
- property latest_checkpoint: Optional[str]#
The checkpoint ID of the most recent checkpoint that should be loaded.
Since non-trial-type tasks cannot currently save checkpoints,
.latest_checkpoint
is currently always None for non-trial-type tasks.
- master_cert_file#
The file location for the master certificate, if present, or “noverify” if it has been configured not to verify the master cert.
- master_cert_name#
The name on the master certificate, when using TLS.
- master_url#
The url for reaching the master.
- session_token#
The Determined login session token created for the current task.
- slot_ids#
The slot ids assigned to this container.
- task_id#
The unique identifier for the current task.
- task_type#
- The type of task. Currently one of the following string literals:
"TRIAL"
"NOTEBOOK"
"SHELL"
"COMMAND"
"TENSORBOARD"
"CHECKPOINT_GC"
Additional values may be added in the future.
- property trial: TrialInfo#
The
TrialInfo
sub-info object for the current trial task.Attempting to read
.trial
in a non-trial task type will raise a RuntimeError.
- property user_data: Dict[str, Any]#
The content of the
data
field of the experiment configuration.Since other types of configuration files don’t allow a
data
field, accessinguser_data
from non-trial-type tasks will always return an empty dictionary.
determined.import_from_path
#
- determined.import_from_path(path: PathLike) Iterator #
import_from_path allows you to import from a specific directory and cleans up afterwards.
Even if you are importing identically-named files, you can import them as separate modules. This is intended to help when you have, for example, a current model_def.py, but also import an older model_def.py from a checkpoint into the same interpreter, without conflicts (so long as you import them as different names, of course).
Example:
import model_def as new_model_def with det.import_from_path(checkpoint_dir): import model_def as old_model_def old_model = old_model_def.my_build_model() old_model.my_load_weights(checkpoint_dir) current_model = new_model_def.my_build_model( base_layers=old_model.base_layers )
Without
import_from_path
, the above code snippet would hit issues wheremodel_def
had already been imported so the secondimport
would have been a noop and bothnew_model_def
andold_model_def
would represent the same underlying module.