Training: Debug¶
This document aims to provide useful guidelines for debugging models with Determined. Hopefully, it will help you become a power user of Determined, but please don’t hesitate to contact us on Slack if you get stuck!
This document focuses on model debugging, not cluster debugging, so it is assumed that you have already successfully installed Determined.
Successfully running code on a Determined cluster differs from normal training scripts in the following ways:
Your code will conform to Determined’s Trial API by being organized into a subclass of Determined’s
Trial
class (indirectly through one of its concrete subclasses, such asPyTorchTrial
).The code will run in a Docker container on another machine.
Your model may be run many times in a hyperparameter search.
Your model may be run distributed across multiple GPUs or machines.
This guide will introduce each change incrementally as we work towards achieving a fully functioning model in Determined.
The basic steps for debugging are:
Model-related issues:
Does the original code run locally?
Does each method of your
Trial
subclass work locally?Does local test mode work?
Docker- or cluster-related issues:
Does the original code run in a notebook or shell?
Does each method of your
Trial
subclass work in a notebook or shell?Does local test mode work in a notebook or shell?
Higher-level issues:
Does cluster test mode work with
slots_per_trial
set to1
?Does a single-GPU experiment work?
Does a multi-GPU experiment work?
Higher-level Issues¶
7. Does cluster test mode work with slots_per_trial
set to 1
?¶
This step is conceptually similar to Step 6, except instead of launching the command from an interactive environment, we will submit it to the cluster and let Determined manage everything.
How to test: If you had to make any customizations to your command environment while testing
Steps 3, 4, or 5, make sure that you have made the same customizations in your experiment config.
Then also confirm that your experiment config either does not specify resources.slots_per_trial
at all, or that it is set to 1, like:
resources:
slots_per_trial: 1
Then create an experiment with the --test
flag (but not the --local
flag):
det experiment create myconfig.yaml my_model_dir --test
How to diagnose failures: If you were able to run local test mode inside a notebook or shell, but you are unable to successfully submit an experiment, you should focus on making sure that any customizations you made to get it to work in the notebook or shell have been properly replicated in your experiment config:
A custom Docker image (if required) is set in the experiment config.
Any
pip install
orapt install
commands needed in the interactive environment are either built into a custom Docker image or written into a file calledstartup-hook.sh
in the root of the model definition directory. See Startup Hooks for more details.Any custom bind mounts that were required in the interactive environment are also specified in the experiment config.
Environment variables are set properly in the experiment config.
If no missing customizations are to blame, there are still several new layers introduced with a cluster-managed experiment that would not cause issues with local training mode:
The
checkpoint_storage
settings are used for cluster-managed training. Ifcheckpoint_storage
was configured in neither the experiment config nor the master config, you will see an error message during experiment config validation, before the experiment or any trials are created. To correct it, simply provide acheckpoint_storage
configuration in one of those locations (Master Configuration or Experiment Configuration).The configured
checkpoint_storage
settings are validated before training starts for an experiment on the cluster. If you get a message sayingCheckpoint storage validation failed
, please review the correctness of the values in yourcheckpoint_storage
settings.The experiment config is fully validated for cluster-managed experiments, more strictly than it is for
--local --test
mode. If you get errors related tounmarshaling JSON
when trying to submit the experiment to the cluster, that is an indication that the experiment config has errors. Please review the experiment configuration.
Again, if you are unable to identify the root cause of the issue yourself, please do not hesitate to contact Determined through our community support!
8. Does a single-GPU experiment work?¶
This step is just like to Step 7, except it introduces hyperparameter search and will execute full training for each trial.
How to test: Configuration should be identical to Step 7. Again, confirm that your experiment
config either does not specify resources.slots_per_trial
at all, or that it is set to 1, like:
resources:
slots_per_trial: 1
Then create an experiment without the --test
or --local
flags (you will probably find the
--follow
or -f
flag to be helpful):
det experiment create myconfig.yaml my_model_dir -f
How to diagnose failures: If Step 7 worked but Step 8 does not, there are a few high-level categories of issues to check for:
Does the error happen when the experiment config has
searcher.source_trial_id
set? One thing that can occur in a real experiment that does not occur in a--test
experiment is the loading of a previous checkpoint. Errors when loading from a checkpoint can be caused by architecture change, where the new model code is not architecture-compatible with the old model code.Generally, issues at this step are caused by doing training and evaluation continuously, so focus on how that change may cause issues with your code.
9. Does a multi-GPU experiment work?¶
This step is like Step 8 except that it introduces distributed training. Naturally, this step is only relevant if you have multiple GPUs and you wish to use distributed training.
How to test: Configuration should be like Step 7, except you will now set
resources.slots_per_trial
to some number greater than 1:
resources:
slots_per_trial: 2
Then create your experiment:
det experiment create myconfig.yaml my_model_dir -f
How to diagnose failures: If you are using the determined
library APIs correctly, then
theoretically distributed training should “just work”. However, you should be aware of some common
pitfalls:
If your experiment is not being scheduled on the cluster, ensure that your
slots_per_trial
setting is valid for your cluster. For example, if you have 4 Determined agents running with 4 GPUs each, yourslots_per_trial
could be1
,2
,3
, or4
(which would all fit on a single machine), or it could be8
,12
, or16
(which would take up some number of complete agent machines), but it couldn’t be5
(more than one agent but not a multiple of agent size) and it couldn’t be32
(too big for the cluster). Also ensure that there are no other notebooks, shells, or experiments on the cluster that may be consuming too many resources and preventing the experiment from starting.Determined normally controls the details of distributed training for you. Attempting to also control those details yourself, such as by calling
tf.config.set_visible_devices()
in aTFKerasTrial
orEstimatorTrial
, will very likely cause issues.Some classes of metrics must be calculated specially during distributed training. Most metrics, like loss or accuracy, can be calculated piecemeal on each worker in a distributed training job and averaged afterwards. Those metrics are handled automatically by Determined and need no special handling. Other metrics, like F1 score, cannot be averaged from individual workers’ F1 scores. Determined has tooling for handling these metrics; see the docs for using custom metric reducers with PyTorch and TensorFlow Estimator.