Hyperparameter Search¶

With the Core API you can run advanced hyperparameter searches with arbitrary training code. The hyperparameter search logic is in the master, which coordinates many different Trials. Each trial runs a train-validate-report loop:

Train	Train until a point chosen by the hyperparameter search algorithm and obtained via the Core API. The length of training is absolute, so you have to keep track of how much you have already trained to know how much more to train.
Validate	Validate your model to obtain the metric you configured in the `searcher.metric` field of your experiment config.
Report	Use the Core API to report results to the master.

Create a 3_hpsearch.py training script by copying the 2_checkpoints.py script you created in Report Checkpoints.

In your if __name__ == "__main__" block, access the hyperparameter values chosen for this trial using the ClusterInfo API and configure the training loop accordingly:

hparams = info.trial.hparams

with det.core.init() as core_context:
    main(
        core_context=core_context,
        latest_checkpoint=latest_checkpoint,
        trial_id=trial_id,
        # NEW: configure the "model" using hparams.
        increment_by=hparams["increment_by"],
    )

Modify main() to run the train-validate-report loop mentioned above by iterating through core_context.searcher.operations(). Each SearcherOperation from operations() has a length attribute that specifies the absolute length of training to complete. After validating, report the searcher metric value using op.report_completed().

batch = starting_batch
last_checkpoint_batch = None
for op in core_context.searcher.operations():
    # NEW: Use a while loop for easier accounting of absolute lengths.
    while batch < op.length:
        x += increment_by
        steps_completed = batch + 1
        time.sleep(0.1)
        logging.info(f"x is now {x}")
        if steps_completed % 10 == 0:
            core_context.train.report_training_metrics(
                steps_completed=steps_completed, metrics={"x": x}
            )

            # NEW: report progress once in a while.
            op.report_progress(batch)

            checkpoint_metadata = {"steps_completed": steps_completed}
            with core_context.checkpoint.store_path(checkpoint_metadata) as (path, uuid):
                save_state(x, steps_completed, trial_id, path)
            last_checkpoint_batch = steps_completed
            if core_context.preempt.should_preempt():
                return
        batch += 1
    # NEW: After training for each op, you typically validate and report the
    # searcher metric to the master.
    core_context.train.report_validation_metrics(
        steps_completed=steps_completed, metrics={"x": x}
    )
    op.report_completed(x)

Because the training length can vary, you might exit the train-validate-report loop before saving the last of your progress. To handle this, add a conditional save after the loop ends:

if last_checkpoint_batch != steps_completed:
    checkpoint_metadata = {"steps_completed": steps_completed}
    with core_context.checkpoint.store_path(checkpoint_metadata) as (path, uuid):
        save_state(x, steps_completed, trial_id, path)

Create a new 3_hpsearch.yaml file and add an entrypoint that invokes 3_hpsearch.py:
```
name: core-api-stage-3
entrypoint: python3 3_hpsearch.py
```
Add a hyperparameters section with the integer-type increment_by hyperparameter value that referenced in the training script:
```
hyperparameters:
  increment_by:
    type: int
    minval: 1
    maxval: 8
```
Run the code using the command:
```
det e create 3_hpsearch.yaml . -f
```

The complete 3_hpsearch.py and 3_hpsearch.yaml listings used in this example can be found in the core_api.tgz download or in the Github repository.