• Docs >
  • Frequently Asked Questions
Shortcuts

Frequently Asked Questions

Installation

How do I install the CLI?

pip install determined

For more details, see the installation instructions.

When trying to install the Determined command-line interface, I encounter this distutils error

Uninstalling a distutils installed project (...) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.

If a Python library has previously been installed in your environment with distutils or conda, pip may not be able to upgrade or downgrade the library to the version required by Determined. There are two recommended solutions:

  1. Install the Determined command-line interface into a fresh virtualenv with no previous Python packages installed.

  2. Use --ignore-installed with pip to force overwriting the library version(s).

    pip install --ignore-installed determined-cli
    

Is it possible to install Determined on Kubernetes?

Yes; please see Installing Determined on Kubernetes for details.

After installing Determined on Kubernetes, I can’t reach the Determined master

Useful steps for debugging this include:

# Get the name of the Helm deployment.
helm list

# Double check the IP address and port assigned to the Determined master by looking up the master service.
kubectl get service determined-master-service-development-<helm deployment name>

# Check the status of master deployment.
kubectl describe deployment determined-master-deployment-<helm deployment name>

# Check the logs of master pod.
kubectl logs <determined-master-pod-name>

Please see Useful Helm and Kubectl Commands for more debugging tips.

When installing Determined on Kubernetes, I get an ImagePullBackOff error

You may be trying to install a non-released version of Determined or a version in a private registry without the right secret. Please see the documentation on how to configure which version of Determined to install on Kubernetes.

After upgrading Determined and trying to use the CLI, I get det: command not found error

When upgrading from a version earlier than 0.15.0, some users may need to uninstall Determined and then install it again by running:

pip uninstall -y determined-common determined determined-cli determined-deploy
pip install determined

Packages and Containers

How do I install Python packages that my model code depends on?

By default, workloads execute inside a Determined-provided container that includes common deep learning libraries and frameworks. If your model code has additional dependencies, the easiest way to install them is to specify a container startup hook. For more complex dependencies, you can also use a custom Docker image.

Can I use a custom container image?

Yes; see the documentation on custom Docker images for details.

Can I use Determined with a private Docker Registry?

Yes: specify the registry path as part of the custom image name. See the documentation on custom Docker images for more details.

Experiment Management

What happens when an experiment is archived?

Archiving is designed to make it easier to organize experiments by omitting information about experiment runs that are no longer relevant (e.g., training jobs that failed with an error or jobs submitted as part of the model development process).

When an experiment is archived, it is hidden from the default view in both the WebUI and the CLI, but all of the metadata associated with the experiment (including checkpoints) is preserved. An experiment can subsequently be unarchived if desired, without losing any of the experiment’s metadata.

How can I delete model checkpoints that are no longer useful?

The best way to delete a checkpoint is to modify the garbage collection policy of the experiment that created the checkpoint. For example, to delete all of the experiments associated with an experiment, run:

det experiment set gc-policy --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 <experiment-id>

Distributed Training

Why do my distributed training experiments never start?

If slots_per_trial is greater than the number of slots on a single agent, Determined will schedule it over multiple machines. When scheduling a multi-machine distributed training job, Determined requires that the job uses all of the slots (GPUs) on an agent. For example, in a cluster that consists of 8-GPU agents, an experiment with slots_per_trial set to 12 will never be scheduled and will instead wait indefinitely. The distributed training documentation describes this scheduling behavior in more detail.

There may also be running tasks preventing your multi-GPU trials from acquiring enough GPUs on a single machine. Consider adjusting slots_per_trial or terminating existing tasks to free up slots in your cluster.

Why do my multi-machine training experiments appear to be stuck?

Multi-machine training requires that all machines are able to connect to each other directly. There may be firewall rules or network configuration that prevent machines in your cluster from communicating. Please check if agent machines can access each other outside of Determined (e.g., using the ping or netcat tools).

More rarely, if agents have multiple network interfaces and some of them are not routable, Determined may pick one of those interfaces rather than one that allows one agent to contact another. In this case, it is possible to set the network interface used for distributed training explicitly in the Cluster Configuration.

Scheduling

How can I control which agents a task is scheduled on?

Agents can now be grouped into multiple resource pools and tasks can be assigned to specific resource pools. For more information, please see Resource Pools.

TensorFlow Support

Can I use TensorFlow Core models with Determined?

Determined has support for TensorFlow models that use the tf.keras or Estimator APIs. For models that use the low-level TensorFlow Core APIs, we recommend porting your model to use Estimator Trial. Example of converting a TensorFlow graph into an Estimator.

Can I use TensorFlow 2 with Determined?

Yes; Determined supports both TensorFlow 1 and 2. The version of TensorFlow that is used for a particular experiment is controlled by the container image that has been configured for that experiment. Determined provides prebuilt Docker images that include TensorFlow 2.4, 1.15, 2.5, and 2.6, respectively:

  • determinedai/environments:cuda-11.1-pytorch-1.9-lightning-1.3-tf-2.4-gpu-0.16.4 (default)

  • determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-0.16.4

  • determinedai/environments:cuda-11.2-pytorch-1.7-lightning-1.2-tf-2.5-gpu-0.16.4

  • determinedai/environments:cuda-11.2-pytorch-1.7-lightning-1.2-tf-2.6-gpu-0.16.4

To change the container image used for an experiment, specify environment.image in the experiment configuration file. Please see Container Images for more details about configuring training environments and a more complete list of prebuilt Docker images.

How do I debug a TensorFlow model inside Determined?

Please see Model Debugging in Determined.

PyTorch Support

How do I debug a PyTorch model inside Determined?

Please see Model Debugging in Determined.

Why am I seeing significantly different metrics for trials which are paused and later continued than trials which aren’t paused?

When a trial is paused, the current state of the trial is saved to a checkpoint. When the trial later resumes training, Determined will reload the state of the model from the most recent checkpoint. If you observe that this process degrades the model’s training or validation metrics (compared to a model trained on the same data without interruption), one explanation is that the model’s state might not be restored accurately or completely from the checkpoint. When using PyTorch, this can sometimes happen if the PyTorch API is not used correctly.

Please verify the following in the model code:

  • The model is wrapped with wrap_model.

  • The optimizer is wrapped with wrap_optimizer and based on the output of wrap_model, not the original unwrapped model.

  • The LR scheduler is wrapped with wrap_lr_scheduler and based on the output of wrap_optimizer, not the original unwrapped optimizer.

TensorBoard Support

Can I log additional TensorBoard events beyond what Determined logs automatically?

Yes; any additional TFEvent files that are written to /tmp/tensorboard inside a trial container will be accessible via TensorBoard. For example, to log a custom TensorBoard event using PyTorch:

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir="/tmp/tensorboard")
writer.add_scalar("my_metric", np.random.random(), batch_idx)

For more details, as well as examples of how to do this with TF Estimator and TF Keras models, refer to the TensorBoard How-To Guide.

Can I use TensorBoard with PyTorch?

Yes! For an example of this check out the mnist-GAN example. This model uses the TorchWriter class which automatically configures the location for writing TensorBoards. Users can also directly use torch.utils.tensorboard.SummaryWriter as shown in the snippet above.

Ecosystem Integrations

How can I use Determined with Spark, Airflow, or Pachyderm?

Determined is focused on helping teams of deep learning engineers train better models more quickly. However, Determined is also designed to easily integrate with other popular ML ecosystem tools for tasks that are related to model training, such as ETL, ML pipelines, and model serving. The Works with Determined repository includes examples of how to use Determined with a variety of ML ecosystem tools, including Pachyderm, DVC, Delta Lake, Seldon, Spark, Argo, Airflow, and Kubeflow.