Quick Start Guide¶

This guide demonstrates how to train a model, perform a hyperparameter search, and run a distributed training job, all in Determined. This guide is based on the official PyTorch MNIST example and TensorFlow Fashion MNIST Tutorial.

This guide focuses on demonstrating Determined’s features at a high level. We will lightly touch on major concepts and terminology. For a more slow-paced introduction to developing models with Determined, check out PyTorch MNIST Tutorial or TensorFlow Keras Fashion MNIST Tutorial.

Prerequisites¶

Access to an installation of Determined, either on your own machine or the public cloud. If you have not yet installed Determined, refer to the installation instructions.
The Determined CLI should be installed on your local machine. For installation instructions, see here. After installing the CLI, configure it to connect to your Determined cluster by setting the DET_MASTER environment variable to the hostname or IP address where Determined is running.
For Determined clusters running through det-deploy local: When you run your first experiment, Determined needs to pull down Docker images that contain environment information. The Docker images are then cached for future experiments. We suggest running the following commands to help speed up this process.

# For CPU computations
docker pull determinedai/environments:py-3.6.9-pytorch-1.4-tf-1.14-cpu-90bf50b

# For GPU computations
docker pull determinedai/environments:cuda-10.0-pytorch-1.4-tf-1.14-gpu-90bf50b

Preparing Your First Job¶

In this guide, we will build an image classification model for the MNIST dataset. MNIST is a dataset consisting of grayscale images of handwritten digits, commonly used to test image classification models as seen below.

The code for this guide can be downloaded here: mnist_pytorch.tgz.

Next, open a terminal window and cd into the extracted mnist_pytorch directory. The directory should contain the following files:

├── mnist_pytorch
   ├── adaptive.yaml
   ├── const.yaml
   ├── data.py
   ├── distributed.yaml
   ├── layers.py
   ├── model_def.py

The directory contains Python and YAML files. The Python files contain the model and data pipeline definitions. The .yaml files are configuration files that specify the dataset location, hyperparameters, and the number of batches for training. This file also tells Determined the entry point, or where the model class is located. For example, below is the const.yaml file:

description: mnist_pytorch_const
data:
    url: https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz
hyperparameters:
    learning_rate: 1.0
    global_batch_size: 64
    n_filters1: 32
    n_filters2: 64
    dropout1: 0.25
    dropout2: 0.5
searcher:
    name: single
    metric: validation_loss
    max_steps: 9
    smaller_is_better: true
entrypoint: model_def:MNistTrial

Each YAML file is specific to the type of experiment we will run during this guide:

const.yaml: train a single model on a single GPU or CPU.
distributed.yaml: train of a single model using multiple GPUs (distributed training).
adaptive.yaml: train multiple models as part of a hyperparameter search, leveraging Determined’s adaptive hyperparameter search functionality.

Running Your First Job¶

The Determined CLI can be used to submit an experiment to the Determined cluster. An experiment is a collection of one or more trials. A trial is a training task that consists of a dataset, a deep learning model, and values for all of the model’s hyperparameters. An experiment can either train a single model (with a single trial) or perform a search over a user-defined hyperparameter space.

We will start by training a single model for a fixed number of batches and with constant values for all of the hyperparameters. Run the following command in the mnist_pytorch directory:

det experiment create const.yaml .

This command tells Determined to create a new experiment using the const.yaml configuration file. Determined also needs the directory containing the model code to upload it to the cluster. In the case above, we run the command in the mnist_pytorch directory, so we tell the model to upload all the current directory files by using ..

Once the experiment has been submitted, you should see the following output:

Preparing files (../mnist_pytorch) to send to master... 2.5KB and 4 files
Created experiment 1

We can view the experiment status and results in the Web UI. In a browser, go to http://DET_MASTER/, where DET_MASTER is the URL or IP address of your Determined cluster. If you installed locally via det-deploy local, this will likely be http://localhost:8080/ . A Determined dashboard will be displayed similar to the one below:

Here, you can see recent tasks, which includes experiments, notebooks, and tensorboards. We currently have the experiment we just submitted.

Clicking on the experiment takes you to the experiment page similar to below.

Experiment Page¶

Determined automatically tracks the metadata associated with all experiments including the hyperparameters, training and validation metrics for each model, and environment configuration. Determined is designed to foster reproducibility and collaboration among your team (or even your future self!).

For this experiment, we have one trial because we define all the hyperparameters. We can drill down into a trial to view more information by clicking on it.

A trial page contains detailed information about the model, the configuration, output logs and the training metrics. Typically, you have to code the metric frequency output, plots, and checkpointing while managing the configuration for each model; however, by integrating into Determined’s API, every experiment will automatically have these capabilities without any extra code.

../_images/pytorch_trial_completed@2x.jpg

During training, the graph on the right will update with the most current metrics you have defined. In this case, the graph displays the loss and error rate per step. A step is a workload consisting of training a model on a certain number of batches of data, where a batch is defined by the model definition’s data loader. The default is 100 batches per step, but the number can be overridden in the configuration file.

Left of the graph displays time information, the hyperparameter configuration, the best validation metric and their respective checkpoint.

Again, model_def.py does not contain any code to manage the checkpoint and will automatically checkpoint after calculating metrics on the validation dataset. Once the training has completed, you can see the total time it took, and the average batch speed. On a typical laptop, it should take about 5 minutes to train the model to reach 98% accuracy and 0.05 validation loss.

Next Steps¶

To begin implementing your first model using PyTorch, go to PyTorch MNIST Tutorial!

If you prefer using TensorFlow, use the TensorFlow Keras Fashion MNIST Tutorial instead.