Contents Menu Expand Light mode Dark mode Auto light/dark mode
Determined AI Documentation
Light Logo Dark Logo
version 0.18.1
  • Quickstart for Model Developers

Introduction to Determined

  • Tutorials
    • PyTorch MNIST Tutorial
    • PyTorch Porting Tutorial
    • TensorFlow Keras Fashion MNIST Tutorial
  • Examples
  • Model Hub
    • Transformers
      • Tutorial
      • Examples
      • API
    • MMDetection
      • API
  • Concepts
    • Elastic Infrastructure
    • Resource Pools
    • Scheduling
    • YAML Configuration
  • Interact with Cluster
    • Command-line Interface (CLI)
    • Python API
    • REST APIs

Preparation

  • Prepare Environment
    • Custom Environment
    • Custom Pod Specs
  • Prepare Data

Training

  • Training APIs
    • Core API
      • Getting Started
      • Report Metrics
      • Report Checkpoints
      • Hyperparameter Search
      • Distributed Training
      • API Reference
    • PyTorch API
      • Advanced Usage
      • Porting Checklist
      • API Reference
    • PyTorch Lightning API
    • DeepSpeed API
      • Usage Guide
      • Advanced Usage
      • PyTorchTrial to DeepSpeedTrial
      • API Reference
    • Keras API
      • API Reference
    • Estimator API
      • API Reference
    • Experiment Configuration
    • Best Practices
  • Run Training Code
  • How to Debug Models
    • How To Profile An Experiment
  • Distributed Training
    • Effective Distributed Training
  • Hyperparameter Tuning
    • Hyperparameter Search: Adaptive (Asynchronous)
    • Hyperparameter Search Constraints
    • Hyperparameter Search: Grid
    • Hyperparameter Search: Random
    • Hyperparameter Search: Single
    • Hyperparameter Tuning Defined
  • Reproducibility
  • Post Training
    • Organizing Models in the Model Registry
    • Using Checkpoints

Additional Features

  • Interactive Job Configuration
  • Commands and Shells
  • Configuration Templates
  • Job Queue Management
  • Model Registry
  • Notebooks
  • TensorBoards

Cluster Setup

  • Basics
    • Network Requirements
    • Cluster Configuration
    • Elasticsearch-backed logging
    • Historical Cluster Usage Data
    • Upgrades
    • Troubleshooting Tips
    • Users
    • OAuth 2.0 (Enterprise Edition)
    • OpenID Connect Integration (Enterprise Edition)
    • SAML Integration (Enterprise Edition)
    • SCIM Integration (Enterprise Edition)
    • Transport Layer Security
  • Deploy on AWS
    • Install Determined on AWS
    • Dynamic Agents on AWS
    • AWS Spot Instances
  • Deploy on GCP
    • Install Determined on GCP
    • Dynamic Agents on GCP
  • Deploy on Prem
    • Installation Requirements
    • Install Determined Using Docker
    • Install Determined Using det deploy
    • Install Determined Using Linux Packages
  • Deploy on Kubernetes
    • Helm Chart Configuration
    • Install Determined on Kubernetes
    • Determined on K8s Development Guide
    • Setting up an Azure Kubernetes Service (AKS) Cluster
    • Managing an AKS Cluster
    • Setting up an AWS Kubernetes (EKS) Cluster
    • Managing an EKS Cluster
    • Setting up a Google Kubernetes Engine (GKE) Cluster
    • Managing a GKE Cluster

Integrations

  • Ecosystem Integration
  • Configure Determined with Prometheus and Grafana

Further

  • Join the Community
  • Open Source Licenses
  • Release Notes

Hyperparameter SearchΒΆ

With Core API you can run advanced hyperparameter searches with arbitrary training code. The hyperparameter search logic is in the master, which coordinates many different Trials. Each trial runs a train-validate-report loop:

Train

Train until a point chosen by the hyperparameter search algorithm and obtained via the Core API. The length of training is absolute, so you have to keep track of how much you have already trained to know how much more to train.

Validate

Validate your model to obtain the metric you configure in the searcher.metric field of your experiment config.

Report

Use the Core API to report results to the master.

  1. Create a 3_hpsearch.py training script by copying the 2_checkpoints.py script you created in Report Checkpoints.

  2. In your if __name__ == "__main__" block, access the hyperparameter values chosen for this trial using the ClusterInfo API and configure the training loop accordingly:

    hparams = info.trial.hparams
    
    with det.core.init() as core_context:
        main(
            core_context=core_context,
            latest_checkpoint=latest_checkpoint,
            trial_id=trial_id,
            # NEW: configure the "model" using hparams.
            increment_by=hparams["increment_by"],
        )
    
  3. Modify main() to run the train-validate-report loop mentioned above by iterating through core_context.searcher.operations(). Each SearcherOperation from operations() has a .length attribute to specify the absolute length of training to complete. After validating, report the searcher metric value using op.report_completed().

    batch = starting_batch
    last_checkpoint_batch = None
    for op in core_context.searcher.operations():
        # NEW: Use a while loop for easier accounting of absolute lengths.
        while batch < op.length:
            x += increment_by
            steps_completed = batch + 1
            time.sleep(.1)
            logging.info(f"x is now {x}")
            if steps_completed % 10 == 0:
                core_context.train.report_training_metrics(
                    steps_completed=steps_completed, metrics={"x": x}
                )
    
                # NEW: report progress once in a while.
                op.report_progress(batch)
    
                checkpoint_metadata = {"steps_completed": steps_completed}
                with core_context.checkpoint.store_path(checkpoint_metadata) as (path, uuid):
                    save_state(x, steps_completed, trial_id, path)
                last_checkpoint_batch = steps_completed
                if core_context.preempt.should_preempt():
                    return
            batch += 1
        # NEW: After training for each op, you typically validate and report the
        # searcher metric to the master.
        core_context.train.report_validation_metrics(
            steps_completed=steps_completed, metrics={"x": x}
        )
        op.report_completed(x)
    
  4. Because the training length can vary, you might exit the train-validate-report loop before saving the last of your progress. To handle this, add a conditional save after the loop ends:

    if last_checkpoint_batch != steps_completed:
        checkpoint_metadata = {"steps_completed": steps_completed}
        with core_context.checkpoint.store_path(checkpoint_metadata) as (path, uuid):
            save_state(x, steps_completed, trial_id, path)
    
  5. Create a new 3_hpsearch.yaml file and add an entrypoint that invokes 3_hpsearch.py:

    name: core-api-stage-3
    entrypoint: ./3_hpsearch.py
    

    Add a hyperparameters section with the integer-type increment_by hyperparameter value that referenced in the training script:

    hyperparameters:
      increment_by:
        type: int
        minval: 1
        maxval: 8
    
  6. Run the code using the command:

    det e create 3_hpsearch.yaml . -f
    

The complete 3_hpsearch.py and 3_hpsearch.yaml listings used in this example can be found in the core_api.tgz download or in the Github repository.

Next
Distributed Training
Previous
Report Checkpoints
Copyright © 2022, Determined AI