Frequently Asked Questions¶
How do I install the CLI?¶
pip install determined-cli
For more details, see the installation instructions.
When trying to install the Determined command-line interface, I encounter this
Uninstalling a distutils installed project (...) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
If a Python library has previously been installed in your environment
with distutils or
pip may not be able to
upgrade or downgrade the library to the version required by Determined.
There are two recommended solutions:
Install the Determined command-line interface into a fresh virtualenv with no previous Python packages installed.
pipto force overwriting the library version(s).
pip install --ignore-installed determined-cli
Is it possible to install Determined on Kubernetes?¶
Yes; please see Installing Determined on Kubernetes for details.
After installing Determined on Kubernetes, I can’t reach the Determined master¶
Useful steps for debugging this include:
# Get the name of the Helm deployment. helm list # Double check the IP address and port assigned to the Determined master by looking up the master service. kubectl get service determined-master-service-development-<helm deployment name> # Check the status of master deployment. kubectl describe deployment determined-master-deployment-<helm deployment name> # Check the logs of master pod. kubectl logs <determined-master-pod-name>
Please see Useful Helm and Kubectl Commands for more debugging tips.
When installing Determined on Kubernetes, I get an
You may be trying to install a non-released version of Determined or a version in a private registry without the right secret. Please see the documentation on how to configure which version of Determined to install on Kubernetes.
Packages and Containers¶
How do I install Python packages that my model code depends on?¶
By default, workloads execute inside a Determined-provided container that includes common deep learning libraries and frameworks. If your model code has additional dependencies, the easiest way to install them is to specify a container startup hook. For more complex dependencies, you can also use a custom Docker image.
Can I use a custom container image?¶
Yes; see the documentation on custom Docker images for details.
What happens when an experiment is archived?¶
Archiving is designed to make it easier to organize experiments by omitting information about experiment runs that are no longer relevant (e.g., training jobs that failed with an error or jobs submitted as part of the model development process).
When an experiment is archived, it is hidden from the default view in both the WebUI and the CLI, but all of the metadata associated with the experiment (including checkpoints) is preserved. An experiment can subsequently be unarchived if desired, without losing any of the experiment’s metadata.
How can I delete model checkpoints that are no longer useful?¶
The best way to delete a checkpoint is to modify the garbage collection policy of the experiment that created the checkpoint. For example, to delete all of the experiments associated with an experiment, run:
det experiment set gc-policy --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 <experiment-id>
Why do my distributed training experiments never start?¶
If slots_per_trial is
greater than the number of slots on a single agent, Determined will
schedule it over multiple machines. When scheduling a multi-machine
distributed training job, Determined requires that the job uses all of
the slots (GPUs) on an agent. For example, in a cluster that consists of
8-GPU agents, an experiment with slots_per_trial set to
12 will never be
scheduled and will instead wait indefinitely. The distributed
training documentation describes this scheduling
behavior in more detail.
There may also be running tasks preventing your multi-GPU trials from
acquiring enough GPUs on a single machine. Consider adjusting
slots_per_trial or terminating existing tasks to free up slots in
Why do my multi-machine training experiments appear to be stuck?¶
Multi-machine training requires that all machines are able to connect to
each other directly. There may be firewall rules or network
configuration that prevent machines in your cluster from communicating.
Please check if agent machines can access each other outside of
Determined (e.g., using the
More rarely, if agents have multiple network interfaces and some of them are not routable, Determined may pick one of those interfaces rather than one that allows one agent to contact another. In this case, it is possible to set the network interface used for distributed training explicitly in the Cluster Configuration.
Can I use TensorFlow Core models with Determined?¶
Determined has support for TensorFlow models that use the tf.keras or Estimator APIs. For models that use the low-level TensorFlow Core APIs, we recommend porting your model to use Estimator Trial. Example of converting a TensorFlow graph into an Estimator.
Can I use TensorFlow 2 with Determined?¶
Yes; Determined supports both TensorFlow 1 and 2. The version of TensorFlow that is used for a particular experiment is controlled by the container image that has been configured for that experiment. Determined provides prebuilt Docker images that include TensorFlow 1.15, 2.2, and 2.4, respectively:
To change the container image used for an experiment, specify environment.image in the experiment configuration file. Please see Container Images for more details about configuring training environments and a more complete list of prebuilt Docker images.
Why am I seeing significantly different metrics for trials which are paused and later continued than trials which aren’t paused?¶
When a trial is paused, the current state of the trial is saved to a checkpoint. When the trial later resumes training, Determined will reload the state of the model from the most recent checkpoint. If you observe that this process degrades the model’s training or validation metrics (compared to a model trained on the same data without interruption), one explanation is that the model’s state might not be restored accurately or completely from the checkpoint. When using PyTorch, this can sometimes happen if the PyTorch API is not used correctly.
Please verify the following in the model code:
Can I log additional TensorBoard events beyond what Determined logs automatically?¶
Yes; any additional TFEvent files that are written to
/tmp/tensorboard inside a trial container will be accessible via
TensorBoard. For example, to log a custom TensorBoard event using
from torch.utils.tensorboard import SummaryWriter writer = SummaryWriter(log_dir="/tmp/tensorboard") writer.add_scalar("my_metric", np.random.random(), batch_idx)
For more details, as well as examples of how to do this with TF Estimator and TF Keras models, refer to the TensorBoard How-To Guide.
Can I use TensorBoard with PyTorch?¶
Yes! For an example of this check out the
mnist-GAN example. This model uses the
class which automatically configures the location for writing
TensorBoards. Users can also directly use
torch.utils.tensorboard.SummaryWriter as shown in the snippet above.
How can I use Determined with Spark, Airflow, or Pachyderm?¶
Determined is focused on helping teams of deep learning engineers train better models more quickly. However, Determined is also designed to easily integrate with other popular ML ecosystem tools for tasks that are related to model training, such as ETL, ML pipelines, and model serving. The Works with Determined repository includes examples of how to use Determined with a variety of ML ecosystem tools, including Pachyderm, DVC, Delta Lake, Seldon, Spark, Argo, Airflow, and Kubeflow.