hottune -- Online Hyperparameter Adaptation

hottune is an optional hotcb module that performs online, constrained, Bayesian-guided hyperparameter adaptation during training. It observes metrics, proposes bounded mutations, applies them at safe points, evaluates over a short horizon, and accepts or rolls back.

When to use it

hottune covers the gap between:

Static recipes (set once, hope for the best)
Full-run sweeps (expensive, offline)
Manual mid-run tweaking (error-prone, unrecorded)

The core loop: observe -> propose bounded mutation -> apply at safe point -> evaluate over horizon -> accept/reject -> persist learning

Installation

pip install "hotcb[tune]"   # installs optuna + pyyaml

hottune works without optuna (falls back to random proposals), but TPE-based search requires it.

Architecture

hottune consists of five layers:

Layer	Responsibility
Metric access	Adapter-provided `env["metric"]` for framework-agnostic metric reading
Actuation	Kernel-owned actuators that can snapshot/validate/apply/restore parameters
Search	TPE or random proposal over bounded mutation space
Evaluation	Short-horizon scoring with accept/reject/rollback
Recipe evolution	Cross-run learning persisted as evolved priors

Quick start

Lightning

from hotcb import HotKernel
from hotcb.adapters.lightning import HotCBLightning
from hotcb.actuators import optimizer_actuators, loss_actuators, mutable_state

# Create MutableState with all controllable params
loss_weights = {"weight_a": 1.0, "weight_b": 0.5}
ms = mutable_state(
    optimizer_actuators(optimizer) + loss_actuators(loss_weights)
)

kernel = HotKernel(
    run_dir="runs/exp1",
    tune_recipe_path="tune_recipe.yaml",  # optional
    mutable_state=ms,
)

# Adapter auto-discovers optimizer actuators if not already in MutableState
trainer = pl.Trainer(callbacks=[HotCBLightning(kernel)])
trainer.fit(model, datamodule=dm)

Then enable tuning from another terminal:

hotcb --dir runs/exp1 tune enable --mode active

Bare PyTorch

from hotcb import HotKernel
from hotcb.actuators import optimizer_actuators, mutable_state

ms = mutable_state(optimizer_actuators(optimizer))
kernel = HotKernel(run_dir="runs/exp1", mutable_state=ms)

for epoch in range(num_epochs):
    for step, batch in enumerate(dl):
        # ... train step ...
        kernel.apply(env={...}, events=["train_batch_end"])

    # Decision point for tune
    kernel.apply(
        env={"step": step, "epoch": epoch, "optimizer": optimizer,
             "metric": lambda name, default=None: metrics.get(name, default)},
        events=["val_epoch_end"],
    )

kernel.close(env={"step": step, "epoch": epoch})

Runtime modes

Mode	Behavior
`off`	No overhead beyond tiny module existence
`observe`	No mutations; collects windows, evaluates pending segments
`suggest`	Writes proposals to logs but does not apply
`active`	Proposes and applies bounded mutations
`replay`	Replays prior tune mutations from a mutations log

hotcb --dir runs/exp1 tune enable --mode active
hotcb --dir runs/exp1 tune enable --mode observe
hotcb --dir runs/exp1 tune disable

Actuators

Actuators are the bridge between the tuner and live training state.

Built-in actuators

optimizer_actuators(optimizer) -- creates per-param actuators for optimizer:

Param	Type	Description
`lr`	`LOG_FLOAT`	Learning rate (log-scale)
`weight_decay`	`LOG_FLOAT`	Weight decay (log-scale)
`betas`	`TUPLE`	Adam betas

loss_actuators(weights_dict) -- creates per-key actuators for loss weights:

Param	Type	Description
Each key	`FLOAT`	Set weight to absolute value

Each HotcbActuator has a state machine: INIT -> UNTOUCHED -> UNVERIFIED -> VERIFIED (or DISABLED).

Custom actuators

Implement the BaseActuator protocol:

class MyActuator:
    name = "my_knob"

    def snapshot(self, env: dict) -> dict: ...
    def validate(self, patch: dict, env: dict) -> ValidationResult: ...
    def apply(self, patch: dict, env: dict) -> ApplyResult: ...
    def restore(self, snapshot: dict, env: dict) -> ApplyResult: ...
    def describe_space(self) -> dict: ...

kernel.register_actuator("my_knob", MyActuator())

Tune recipe

The recipe defines the search space, objective, phases, acceptance criteria, and safety constraints.

version: 1
objective:
  primary: val/loss
  mode: min
  backup_metrics:
    - grad/norm
phases:
  early: {start_frac: 0.0, end_frac: 0.2}
  mid:   {start_frac: 0.2, end_frac: 0.7}
  late:  {start_frac: 0.7, end_frac: 1.0}
actuators:
  opt:
    enabled: true
    mutations:
      lr_mult:
        bounds: [0.7, 1.2]
        prior_center: 0.95
        cooldown: 2
        risk: low
  loss:
    enabled: true
    keys:
      sp_mse_w:
        mode: mult
        bounds: [0.5, 2.0]
        cooldown: 1
search:
  algorithm: tpe
  startup_trials: 8
acceptance:
  epsilon: 0.001
  horizon: next_val_epoch_end
  rollback_on_reject: true
safety:
  block_on_nan: true
  block_on_anomaly: true
  max_global_reject_streak: 4

Place at hotcb.tune.recipe.yaml in the run directory, or pass via tune_recipe_path to the kernel.

Safety model

Every mutation must pass:

Bounds check: value within recipe-declared bounds
Cooldown: per-mutation-family cooldown prevents thrashing
Risk class: v1 supports low and medium only
Stability blockers: NaN/inf loss, anomaly flags block all mutations
Reject streak limit: too many consecutive rejections pauses tuning

Kernel freeze/replay semantics apply to tune the same way as cb/opt/loss -- prod mode blocks all tune commands.

Evaluation and acceptance

At each val_epoch_end:

If there's an active segment, evaluate it (read post-metrics, score, accept/reject)
If accepted: reset reject streak, keep mutation applied
If rejected: rollback to snapshot, increment reject streak, enter cooldown
Propose next mutation if feasible

Scoring: score_delta = primary_metric_gain - instability_penalty

Accept if score_delta > epsilon and no stability blockers.

Run artifacts

File	Purpose
`hotcb.tune.recipe.yaml`	Tune recipe (search space, objective, constraints)
`hotcb.tune.mutations.jsonl`	Log of all proposed/applied/rejected mutations
`hotcb.tune.segments.jsonl`	Evaluation segments with pre/post metrics and decisions
`hotcb.tune.summary.json`	Run summary with accept rate, win rates per family
`hotcb.tune.study.sqlite`	Optional Optuna study state for TPE

CLI commands

hotcb --dir runs/exp1 tune enable                     # Enable active tuning
hotcb --dir runs/exp1 tune enable --mode observe      # Observe-only mode
hotcb --dir runs/exp1 tune disable                    # Disable tuning
hotcb --dir runs/exp1 tune status                     # Show tune status/summary
hotcb --dir runs/exp1 tune set acceptance.epsilon=0.002
hotcb --dir runs/exp1 tune export-recipe --out recipe_evolved.json

YAML config

Add a tune section to hotcb.yaml:

tune:
  enabled: true
  mode: active   # active, observe, suggest

Recipe evolution

After multiple runs, evolve the recipe to incorporate learned priors:

from hotcb.modules.tune.recipe import evolve_recipe
from hotcb.modules.tune.schemas import TuneRecipe

base = TuneRecipe.from_dict(load_yaml("tune_recipe.yaml"))
summaries = [load_json(f"run{i}/hotcb.tune.summary.json") for i in range(5)]
evolved = evolve_recipe(base, summaries, alpha=0.3)
save_yaml("tune_recipe_evolved.yaml", evolved.to_dict())

Adapter contract

Every supported adapter must expose:

env["metric"]: (name: str, default=None) -> Any -- framework-agnostic metric accessor
env["kernel"]: reference to the HotKernel instance
env["max_steps"]: total training steps for phase binning

The Lightning and HF adapters provide all three automatically.

Standard metric names

Adapters normalize toward these where practical:

train/loss, val/loss, val/score
lr, grad/norm
time/step_sec, system/gpu_mem_mb

Error handling

If tune deps missing: module self-disables, logs install hint
If no metric accessor: observe-only with warning
If no actuators registered: no mutations proposed
If mutation apply fails: logged, training continues
If rollback fails: logged, training continues
Respects auto_disable_on_error kernel setting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hottune -- Online Hyperparameter Adaptation

When to use it

Installation

Architecture

Quick start

Lightning

Bare PyTorch

Runtime modes

Actuators

Built-in actuators

Custom actuators

Tune recipe

Safety model

Evaluation and acceptance

Run artifacts

CLI commands

YAML config

Recipe evolution

Adapter contract

Standard metric names

Error handling

FilesExpand file tree

hottune.md

Latest commit

History

hottune.md

File metadata and controls

hottune -- Online Hyperparameter Adaptation

When to use it

Installation

Architecture

Quick start

Lightning

Bare PyTorch

Runtime modes

Actuators

Built-in actuators

Custom actuators

Tune recipe

Safety model

Evaluation and acceptance

Run artifacts

CLI commands

YAML config

Recipe evolution

Adapter contract

Standard metric names

Error handling