Contents Menu Expand Light mode Dark mode Auto light/dark mode
Determined AI Documentation
Light Logo Dark Logo
version 0.19.10

Get Started

  • Quickstart for Model Developers
  • Tutorials
    • PyTorch MNIST Tutorial
    • PyTorch Porting Tutorial
    • TensorFlow Keras Fashion MNIST Tutorial
  • Examples
  • Model Hub
    • Huggingface Trainsformers
      • Tutorial
      • Examples
    • MMDetection
  • System Architecture

Model Developer Guide

  • Introduction to Distributed Training
  • Prepare Container Environment
    • Set Environment Images
    • Customize Environment
  • Prepare Data
  • Training API Guides
    • Core API
      • Getting Started
      • Report Metrics
      • Report Checkpoints
      • Hyperparameter Search
      • Distributed Training
    • PyTorch API
    • PyTorch Lightning API
    • Keras API
    • DeepSpeed API
      • Usage Guide
      • Advanced Usage
      • PyTorchTrial to DeepSpeedTrial
    • Estimator API
  • Hyperparameter Tuning
    • Configure Hyperparameter Ranges
    • Hyperparameter Search Constraints
    • Instrument Model Code
    • Handle Trial Errors and Early Stopping Requests
    • Search Methods
      • Adaptive (Asynchronous) Method
      • Grid Method
      • Random Method
      • Single Search Method
      • Custom Search Methods
  • Submit Experiment
  • How to Debug Models
  • Model Management
    • Checkpoints
    • Organize Models in the Model Registry
  • Best Practices

Administrator Guide

  • Basic Setup
  • Cluster Deployment
    • Deploy on Prem
      • Installation Requirements
      • Install Determined Using Docker
      • Install Determined Using det deploy
      • Install Determined Using Linux Packages
    • Deploy on AWS
      • Install Determined
      • Deploy Determined with Dynamic Agents
      • Use Spot Instances
    • Deploy on GCP
      • Install Determined
      • Deploy Determined with Dynamic Agents
    • Deploy on Kubernetes
      • Install Determined on Kubernetes
      • Set up and Manage an Azure Kubernetes Service (AKS) Cluster
      • Set up and Manage an AWS Kubernetes (EKS) Cluster
      • Set up and Manage a Google Kubernetes Engine (GKE) Cluster
      • Development Guide
      • Customize a Pod
      • Helm and Kubectl Command Examples
      • Troubleshooting
    • Deploy on Slurm/PBS
      • Installation Requirements
      • Install Determined on Slurm/PBS
      • Provide a Container Image Cache
      • Known Issues
  • Security
    • OAuth 2.0 Configuration
    • Transport Layer Security
    • OpenID Connect Integration
    • SAML Integration
    • SCIM Integration
    • RBAC
  • User Accounts
  • Workspaces and Projects
  • Logging and Elasticsearch
  • Cluster Usage History
  • Monitor Experiment Through Webhooks
    • Through Zapier
    • Through Slack
  • Upgrade
  • Troubleshooting

Reference

  • Python SDK
  • REST API
  • Training Reference
    • Core API Reference
    • PyTorch API Reference
    • PyTorch Lightning API Reference
    • Keras API Reference
    • DeepSpeed API Reference
    • Estimator API Reference
    • Experiment Configuration
  • Model Hub Reference
    • MMDetection API
    • Transformers API
  • Deployment Reference
    • Common Configuration Options
    • Master Configuration Reference
    • Agent Configuration Reference
    • Helm Chart Configuration Reference
  • Job Configuration Reference
  • Custom Searcher Reference

Tools

  • Commands and Shells
  • WebUI Interface
  • Jupyter Notebooks
  • TensorBoards

Integrations

  • Works with Determined
  • IDE Integration
  • Prometheus and Grafana
  • Open Source Licenses

Hyperparameter SearchΒΆ

With the Core API you can run advanced hyperparameter searches with arbitrary training code. The hyperparameter search logic is in the master, which coordinates many different Trials. Each trial runs a train-validate-report loop:

Train

Train until a point chosen by the hyperparameter search algorithm and obtained via the Core API. The length of training is absolute, so you have to keep track of how much you have already trained to know how much more to train.

Validate

Validate your model to obtain the metric you configured in the searcher.metric field of your experiment config.

Report

Use the Core API to report results to the master.

  1. Create a 3_hpsearch.py training script by copying the 2_checkpoints.py script you created in Report Checkpoints.

  2. In your if __name__ == "__main__" block, access the hyperparameter values chosen for this trial using the ClusterInfo API and configure the training loop accordingly:

    hparams = info.trial.hparams
    
    with det.core.init() as core_context:
        main(
            core_context=core_context,
            latest_checkpoint=latest_checkpoint,
            trial_id=trial_id,
            # NEW: configure the "model" using hparams.
            increment_by=hparams["increment_by"],
        )
    
  3. Modify main() to run the train-validate-report loop mentioned above by iterating through core_context.searcher.operations(). Each SearcherOperation from operations() has a length attribute that specifies the absolute length of training to complete. After validating, report the searcher metric value using op.report_completed().

    batch = starting_batch
    last_checkpoint_batch = None
    for op in core_context.searcher.operations():
        # NEW: Use a while loop for easier accounting of absolute lengths.
        while batch < op.length:
            x += increment_by
            steps_completed = batch + 1
            time.sleep(0.1)
            logging.info(f"x is now {x}")
            if steps_completed % 10 == 0:
                core_context.train.report_training_metrics(
                    steps_completed=steps_completed, metrics={"x": x}
                )
    
                # NEW: report progress once in a while.
                op.report_progress(batch)
    
                checkpoint_metadata = {"steps_completed": steps_completed}
                with core_context.checkpoint.store_path(checkpoint_metadata) as (path, uuid):
                    save_state(x, steps_completed, trial_id, path)
                last_checkpoint_batch = steps_completed
                if core_context.preempt.should_preempt():
                    return
            batch += 1
        # NEW: After training for each op, you typically validate and report the
        # searcher metric to the master.
        core_context.train.report_validation_metrics(
            steps_completed=steps_completed, metrics={"x": x}
        )
        op.report_completed(x)
    
  4. Because the training length can vary, you might exit the train-validate-report loop before saving the last of your progress. To handle this, add a conditional save after the loop ends:

    if last_checkpoint_batch != steps_completed:
        checkpoint_metadata = {"steps_completed": steps_completed}
        with core_context.checkpoint.store_path(checkpoint_metadata) as (path, uuid):
            save_state(x, steps_completed, trial_id, path)
    
  5. Create a new 3_hpsearch.yaml file and add an entrypoint that invokes 3_hpsearch.py:

    name: core-api-stage-3
    entrypoint: python3 3_hpsearch.py
    

    Add a hyperparameters section with the integer-type increment_by hyperparameter value that referenced in the training script:

    hyperparameters:
      increment_by:
        type: int
        minval: 1
        maxval: 8
    
  6. Run the code using the command:

    det e create 3_hpsearch.yaml . -f
    

The complete 3_hpsearch.py and 3_hpsearch.yaml listings used in this example can be found in the core_api.tgz download or in the Github repository.

Next
Distributed Training
Previous
Report Checkpoints
Copyright © 2023, Determined AI
  • Slack
  • Blog
  • GitHub
  • Hyperparameter Search