det API Reference#
- determined.get_cluster_info() Optional[determined._info.ClusterInfo] #
Returns either the
ClusterInfoobject for the current task, or
Noneif not running in a task.
- class determined.ClusterInfo(master_url: str, cluster_id: str, agent_id: str, slot_ids: List[int], task_id: str, allocation_id: str, session_token: str, task_type: str, master_cert_name: Optional[str] = None, master_cert_file: Optional[str] = None, latest_checkpoint: Optional[str] = None, trial_info: Optional[determined._info.TrialInfo] = None, rendezvous_info: Optional[determined._info.RendezvousInfo] = None, resources_info: Optional[determined._info.ResourcesInfo] = None)#
ClusterInfo exposes various properties that are set for tasks while running on the cluster.
info = det.get_cluster_info() assert info is not None, "this code only runs on-cluster!" print("master_url", info.master_url) print("task_id", info.task_id) print("allocation_id", info.allocation_id) print("session_token", info.session_token) print("container_addrs", info.container_addrs) print("container_rank", info.container_rank) if info.task_type == "TRIAL": print("trial.id", info.trial.trial_id) print("trial.hparams", info.trial.hparams)
Be careful with this object! If you depend on a ClusterInfo object during training for anything more than e.g. informational logging, you run the risk of making your training code unable to run outside of Determined. ClusterInfo is meant to be most useful to custom launch layers, which likely are not able to run outside of Determined anyway.
The identifier of the Determined agent this container is running on.
The unique identifier for the current allocation.
The unique identifier for this cluster.
- property container_addrs: List[str]#
A list of addresses for all containers in the allocation, ordered by rank.
- property container_rank: int#
The rank assigned to this container.
When using a distributed training framework, the framework may choose a different rank for this container.
- property container_slot_counts: List[int]#
A list of slots for all containers in the allocation, ordered by rank.
- property gpu_uuids: List[str]#
The UUIDs to the gpus assigned to this container.
- property latest_checkpoint: Optional[str]#
The checkpoint ID of the most recent checkpoint that should be loaded.
Since non-trial-type tasks cannot currently save checkpoints,
.latest_checkpointis currently always None for non-trial-type tasks.
The file location for the master certificate, if present, or “noverify” if it has been configured not to verify the master cert.
The name on the master certificate, when using TLS.
The url for reaching the master.
The Determined login session token created for the current task.
The slot ids assigned to this container.
The unique identifier for the current task.
- The type of task. Currently one of the following string literals:
Additional values may be added in the future.
- property trial: determined._info.TrialInfo#
TrialInfosub-info object for the current trial task.
Attempting to read
.trialin a non-trial task type will raise a RuntimeError.
- property user_data: Dict[str, Any]#
The content of the
datafield of the experiment configuration.
Since other types of configuration files don’t allow a
user_datafrom non-trial-type tasks will always return an empty dictionary.
- determined.import_from_path(path: os.PathLike) Iterator #
import_from_path allows you to import from a specific directory and cleans up afterwards.
Even if you are importing identically-named files, you can import them as separate modules. This is intended to help when you have, for example, a current model_def.py, but also import an older model_def.py from a checkpoint into the same interpreter, without conflicts (so long as you import them as different names, of course).
import model_def as new_model_def with det.import_from_path(checkpoint_dir): import model_def as old_model_def old_model = old_model_def.my_build_model() old_model.my_load_weights(checkpoint_dir) current_model = new_model_def.my_build_model( base_layers=old_model.base_layers )
import_from_path, the above code snippet would hit issues where
model_defhad already been imported so the second
importwould have been a noop and both
old_model_defwould represent the same underlying module.