Training: Debug

This document aims to provide useful guidelines for debugging models with Determined. Hopefully, it will help you become a power user of Determined, but please don’t hesitate to contact us on Slack if you get stuck!

This document focuses on model debugging, not cluster debugging, so it is assumed that you have already successfully installed Determined.

Successfully running code on a Determined cluster differs from normal training scripts in the following ways:

  • Your code will conform to Determined’s Trial API by being organized into a subclass of Determined’s Trial class (indirectly through one of its concrete subclasses, such as PyTorchTrial).

  • The code will run in a Docker container on another machine.

  • Your model may be run many times in a hyperparameter search.

  • Your model may be run distributed across multiple GPUs or machines.

This guide will introduce each change incrementally as we work towards achieving a fully functioning model in Determined.

The basic steps for debugging are:

Model-related issues:

  1. Does the original code run locally?

  2. Does each method of your Trial subclass work locally?

  3. Does local test mode work?

Docker- or cluster-related issues:

  1. Does the original code run in a notebook or shell?

  2. Does each method of your Trial subclass work in a notebook or shell?

  3. Does local test mode work in a notebook or shell?

Higher-level issues:

  1. Does cluster test mode work with slots_per_trial set to 1?

  2. Does a single-GPU experiment work?

  3. Does a multi-GPU experiment work?

Higher-level Issues

7. Does cluster test mode work with slots_per_trial set to 1?

This step is conceptually similar to Step 6, except instead of launching the command from an interactive environment, we will submit it to the cluster and let Determined manage everything.

How to test: If you had to make any customizations to your command environment while testing Steps 3, 4, or 5, make sure that you have made the same customizations in your experiment config. Then also confirm that your experiment config either does not specify resources.slots_per_trial at all, or that it is set to 1, like:

resources:
  slots_per_trial: 1

Then create an experiment with the --test flag (but not the --local flag):

det experiment create myconfig.yaml my_model_dir --test

How to diagnose failures: If you were able to run local test mode inside a notebook or shell, but you are unable to successfully submit an experiment, you should focus on making sure that any customizations you made to get it to work in the notebook or shell have been properly replicated in your experiment config:

  • A custom Docker image (if required) is set in the experiment config.

  • Any pip install or apt install commands needed in the interactive environment are either built into a custom Docker image or written into a file called startup-hook.sh in the root of the model definition directory. See Startup Hooks for more details.

  • Any custom bind mounts that were required in the interactive environment are also specified in the experiment config.

  • Environment variables are set properly in the experiment config.

If no missing customizations are to blame, there are still several new layers introduced with a cluster-managed experiment that would not cause issues with local training mode:

  • The checkpoint_storage settings are used for cluster-managed training. If checkpoint_storage was configured in neither the experiment config nor the master config, you will see an error message during experiment config validation, before the experiment or any trials are created. To correct it, simply provide a checkpoint_storage configuration in one of those locations (Master Configuration or Experiment Configuration).

  • The configured checkpoint_storage settings are validated before training starts for an experiment on the cluster. If you get a message saying Checkpoint storage validation failed, please review the correctness of the values in your checkpoint_storage settings.

  • The experiment config is fully validated for cluster-managed experiments, more strictly than it is for --local --test mode. If you get errors related to unmarshaling JSON when trying to submit the experiment to the cluster, that is an indication that the experiment config has errors. Please review the experiment configuration.

Again, if you are unable to identify the root cause of the issue yourself, please do not hesitate to contact Determined through our community support!

8. Does a single-GPU experiment work?

This step is just like to Step 7, except it introduces hyperparameter search and will execute full training for each trial.

How to test: Configuration should be identical to Step 7. Again, confirm that your experiment config either does not specify resources.slots_per_trial at all, or that it is set to 1, like:

resources:
  slots_per_trial: 1

Then create an experiment without the --test or --local flags (you will probably find the --follow or -f flag to be helpful):

det experiment create myconfig.yaml my_model_dir -f

How to diagnose failures: If Step 7 worked but Step 8 does not, there are a few high-level categories of issues to check for:

  • Does the error happen when the experiment config has searcher.source_trial_id set? One thing that can occur in a real experiment that does not occur in a --test experiment is the loading of a previous checkpoint. Errors when loading from a checkpoint can be caused by architecture change, where the new model code is not architecture-compatible with the old model code.

  • Generally, issues at this step are caused by doing training and evaluation continuously, so focus on how that change may cause issues with your code.

9. Does a multi-GPU experiment work?

This step is like Step 8 except that it introduces distributed training. Naturally, this step is only relevant if you have multiple GPUs and you wish to use distributed training.

How to test: Configuration should be like Step 7, except you will now set resources.slots_per_trial to some number greater than 1:

resources:
  slots_per_trial: 2

Then create your experiment:

det experiment create myconfig.yaml my_model_dir -f

How to diagnose failures: If you are using the determined library APIs correctly, then theoretically distributed training should “just work”. However, you should be aware of some common pitfalls:

  • If your experiment is not being scheduled on the cluster, ensure that your slots_per_trial setting is valid for your cluster. For example, if you have 4 Determined agents running with 4 GPUs each, your slots_per_trial could be 1, 2, 3, or 4 (which would all fit on a single machine), or it could be 8, 12, or 16 (which would take up some number of complete agent machines), but it couldn’t be 5 (more than one agent but not a multiple of agent size) and it couldn’t be 32 (too big for the cluster). Also ensure that there are no other notebooks, shells, or experiments on the cluster that may be consuming too many resources and preventing the experiment from starting.

  • Determined normally controls the details of distributed training for you. Attempting to also control those details yourself, such as by calling tf.config.set_visible_devices() in a TFKerasTrial or EstimatorTrial, will very likely cause issues.

  • Some classes of metrics must be calculated specially during distributed training. Most metrics, like loss or accuracy, can be calculated piecemeal on each worker in a distributed training job and averaged afterwards. Those metrics are handled automatically by Determined and need no special handling. Other metrics, like F1 score, cannot be averaged from individual workers’ F1 scores. Determined has tooling for handling these metrics; see the docs for using custom metric reducers with PyTorch and TensorFlow Estimator.