TensorBoard is a popular tool for visualizing and inspecting deep learning models. In Determined, you can use TensorBoard to examine individual experiments or compare multiple experiments.
Launch TensorBoard instances via the WebUI or the Determined CLI. Before launching TensorBoard instances from the CLI, install the CLI on your development machine.
Single Experiment Analysis#
To analyze a single Determined experiment using TensorBoard, use
det tensorboard start
$ det tensorboard start 7 Scheduling TensorBoard (rarely-cute-man) (id: aab49ba5-3357-4145-861c-7e6ff2d702c5)... TensorBoard (rarely-cute-man) was assigned to an agent... Scheduling tensorboard tensorboard (id: c68c9fc9-7eed-475b-a50f-fd78406d7c83)... TensorBoard is running at: http://localhost:8080/proxy/c68c9fc9-7eed-475b-a50f-fd78406d7c83/ disconnecting websocket
The Determined master schedules a TensorBoard instance within the cluster. Once the TensorBoard instance is running, The Determined CLI opens the TensorBoard web interface in your local browser.
To view information about scheduled and running TensorBoard instances, use:
$ det tensorboard list Id | Owner | Description | State | Experiment Id | Trial Ids | Exit Status --------------------------------------+------------+-------------------------------------+------------+-----------------+-------------+-------------- aab49ba5-3357-4145-861c-7e6ff2d702c5 | determined | TensorBoard (rarely-cute-man) | RUNNING | 7 | N/A | N/A
Multiple Experiment Analysis#
To analyze multiple experiments, use
det tensorboard start <experiment-id> <experiment-id> ....
Metrics might not be immediately available in TensorBoard upon opening the browser window. It usually takes up to five minutes for TensorBoard to receive data and display visualizations.
Customizing TensorBoard Instances#
Determined allows you to initialize TensorBoard with an experiment configuration (YAML) file. This can be useful for running TensorBoard with a specific container image or for enabling access to additional data through a bind-mount.
Example experiment configuration file:
environment: image: determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.8-gpu-0.20.1 bind_mounts: - host_path: /my/agent/path container_path: /my/container/path read_only: true
For detailed configuration settings, refer to the Job Configuration Reference.
To launch TensorBoard with an experiment configuration file, use
det tensorboard start
To view the configuration of a running TensorBoard instance, use
det tensorboard config
Analyzing Specific Trials#
Determined also supports analyzing specific trials from one or more experiments. This can be useful for comparing a small number of trials from an experiment with many trials, or for comparing trials from different experiments.
To analyze specific trials, use
det tensorboard start --trial-ids <trial_id 1> <trial_id 2> ....
Data in TensorBoard#
This section provides a brief overview of how Determined captures data from TensorFlow models. For a more in depth discussion on how TensorBoard visualizes data, consult the TensorBoard documentation.
TensorBoard visualizes data captured during model training and validation, which is stored in tfevent files. These files are generated by writing TensorFlow summary operations to disk using a tf.summary.FileWriter. Each deep learning framework has support for writing and upload metrics as tfevent files.
FileWriters are configured to write log files, called tfevent files, to a directory known as the
logdir. TensorBoard monitors this directory for changes and updates accordingly. The
supported by Determined is
/tmp/tensorboard. All tfevent files written to
in a trial are uploaded to persistent storage when a trial is configured with Determined TensorBoard
Determined Batch Metrics#
At the end of every training workload, batch metrics are collected and stored in the database, providing a granular view of model metrics over time. Batch metrics will appear in TensorBoard under the Determined group. The x-axis of each plot corresponds to the batch number.
For example, a point at step 5 of the plot is the metric associated with the fifth batch seen.
To configure TensorBoard for a specific framework, follow the examples below:
For models using
TFKerasTrial, add a
determined.keras.callabacks.TensorBoard callback to your trial class:
from determined.keras import TFKerasTrial from determined.keras.callbacks import TensorBoard class MyModel(TFKerasTrial): ... def keras_callbacks(self): return [TensorBoard()]
There is no configuration necessary for trials using
By default, Estimators automatically log TensorBoard events to the
model_dir, which Determined
then moves to
For a full-length example of using TensorBoard with PyTorch, check out the
TensorBoard Lifecycle Management#
Determined automatically terminates idle TensorBoard instances. A TensorBoard instance is considered
idle if it does not receive HTTP traffic (a TensorBoard that is still being viewed by a web browser
is not considered idle). TensorBoards are terminated after 5 minutes by default; however, you can
change the timeout duration by editing
tensorboard_timeout in the master config file.
You can also terminate TensorBoard instances manually by using
det tensorboard kill
$ det tensorboard kill aab49ba5-3357-4145-861c-7e6ff2d702c5
To open a web browser window connected to a previously launched TensorBoard instance, use
tensorboard open. To view the logs of an existing TensorBoard instance, use
Determined schedules TensorBoard instances in containers that run on agent machines. The Determined master will proxy HTTP requests to and from the TensorBoard container. TensorBoard instances are hosted on agent machines but they do not occupy GPUs.
Logging Additional TensorBoard Events#
Any additional TFEvent files that are written to the appropriate path during training are accessible to TensorBoard. The appropriate path varies by worker rank and can be obtained by one of the following functions:
For CoreAPI users:
For PyTorchTrial users:
For DeepSpeedTrial users:
For TFKerasTrial users:
For EstimatorTrial users:
For more details and examples, refer to the TensorBoard How-To Guide.