Hyperparameter SearchΒΆ
With the Core API you can run advanced hyperparameter searches with arbitrary training code. The hyperparameter search logic is in the master, which coordinates many different Trials. Each trial runs a train-validate-report loop:
Train |
Train until a point chosen by the hyperparameter search algorithm and obtained via the Core API. The length of training is absolute, so you have to keep track of how much you have already trained to know how much more to train. |
Validate |
Validate your model to obtain the metric you configured in the
|
Report |
Use the Core API to report results to the master. |
Create a
3_hpsearch.py
training script by copying the2_checkpoints.py
script you created in Report Checkpoints.In your
if __name__ == "__main__"
block, access the hyperparameter values chosen for this trial using the ClusterInfo API and configure the training loop accordingly:hparams = info.trial.hparams with det.core.init() as core_context: main( core_context=core_context, latest_checkpoint=latest_checkpoint, trial_id=trial_id, # NEW: configure the "model" using hparams. increment_by=hparams["increment_by"], )
Modify
main()
to run the train-validate-report loop mentioned above by iterating throughcore_context.searcher.operations()
. EachSearcherOperation
fromoperations()
has alength
attribute that specifies the absolute length of training to complete. After validating, report the searcher metric value usingop.report_completed()
.batch = starting_batch last_checkpoint_batch = None for op in core_context.searcher.operations(): # NEW: Use a while loop for easier accounting of absolute lengths. while batch < op.length: x += increment_by steps_completed = batch + 1 time.sleep(0.1) logging.info(f"x is now {x}") if steps_completed % 10 == 0: core_context.train.report_training_metrics( steps_completed=steps_completed, metrics={"x": x} ) # NEW: report progress once in a while. op.report_progress(batch) checkpoint_metadata = {"steps_completed": steps_completed} with core_context.checkpoint.store_path(checkpoint_metadata) as (path, uuid): save_state(x, steps_completed, trial_id, path) last_checkpoint_batch = steps_completed if core_context.preempt.should_preempt(): return batch += 1 # NEW: After training for each op, you typically validate and report the # searcher metric to the master. core_context.train.report_validation_metrics( steps_completed=steps_completed, metrics={"x": x} ) op.report_completed(x)
Because the training length can vary, you might exit the train-validate-report loop before saving the last of your progress. To handle this, add a conditional save after the loop ends:
if last_checkpoint_batch != steps_completed: checkpoint_metadata = {"steps_completed": steps_completed} with core_context.checkpoint.store_path(checkpoint_metadata) as (path, uuid): save_state(x, steps_completed, trial_id, path)
Create a new
3_hpsearch.yaml
file and add anentrypoint
that invokes3_hpsearch.py
:name: core-api-stage-3 entrypoint: python3 3_hpsearch.py
Add a
hyperparameters
section with the integer-typeincrement_by
hyperparameter value that referenced in the training script:hyperparameters: increment_by: type: int minval: 1 maxval: 8
Run the code using the command:
det e create 3_hpsearch.yaml . -f
The complete 3_hpsearch.py
and 3_hpsearch.yaml
listings used in this example can be found in
the core_api.tgz
download or in the Github repository.