Frequently Asked Questions¶
Installation¶
How do I install the CLI?¶
pip install determined-cli
For more details, see the installation instructions.
When trying to install the Determined command-line interface, I encounter this distutils
error¶
Uninstalling a distutils installed project (...) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
If a Python library has previously been installed in your environment with
distutils or conda, pip
may not be able to upgrade or
downgrade the library to the version required by Determined. There are two
recommended solutions:
Install the Determined command-line interface into a fresh virtualenv with no previous Python packages installed.
Use
--ignore-installed
withpip
to force overwriting the library version(s).pip install --ignore-installed determined-cli
Packages and Containers¶
How do I install Python packages that my model code depends on?¶
By default, workloads execute inside a Determined-provided container that includes common deep learning libraries and frameworks. If your model code has additional dependencies, the easiest way to install them is to specify a container startup hook. For more complex dependencies, you can also use a custom Docker image.
Can I use a custom container image?¶
Yes; see the documentation on custom Docker images for details.
Can I use Determined with a private Docker Registry?¶
Yes: specify the registry path as part of the custom image name. See the documentation on custom Docker images for more details.
Multi-GPU Training¶
Why do my multi-GPU training experiments never start?¶
It might be that slots_per_trial
in the experiment configuration is not a multiple of the number of GPUs on a
machine or that there are running tasks preventing your multi-GPU trials from
acquiring all the GPUs on a single machine. Consider adjusting
slots_per_trial
or terminating existing tasks to free up slots in your
cluster.
See Distributed Training for more details.
Why do my multi-machine training experiments appear to be stuck?¶
Multi-machine training requires that all machines be able to connect to
each other directly. There may be firewall rules or network
configuration that prevent machines in your cluster from communicating.
Please check if agent machines can access each other outside of Determined
(e.g., using the ping
or netcat
tools).
More rarely, if agents have multiple network interfaces and some of them are not routable, Determined may pick one of those interfaces rather than one that allows one agent to contact another. In this case, it is possible to set the network interface used for multi-GPU training explicitly in the Cluster Configuration.
TensorFlow Support¶
Can I train a Tensorflow Core model in Determined?¶
Determined has support for TensorFlow models that use the tf.keras or Estimator APIs. For models that use the low-level TensorFlow Core APIs, we recommend porting your model to use Estimator Trial. Example of converting a Tensorflow graph into an Estimator.