hottune is an optional hotcb module that performs online, constrained, Bayesian-guided hyperparameter adaptation during training. It observes metrics, proposes bounded mutations, applies them at safe points, evaluates over a short horizon, and accepts or rolls back.
hottune covers the gap between:
- Static recipes (set once, hope for the best)
- Full-run sweeps (expensive, offline)
- Manual mid-run tweaking (error-prone, unrecorded)
The core loop: observe -> propose bounded mutation -> apply at safe point -> evaluate over horizon -> accept/reject -> persist learning
pip install "hotcb[tune]" # installs optuna + pyyamlhottune works without optuna (falls back to random proposals), but TPE-based search requires it.
hottune consists of five layers:
| Layer | Responsibility |
|---|---|
| Metric access | Adapter-provided env["metric"] for framework-agnostic metric reading |
| Actuation | Kernel-owned actuators that can snapshot/validate/apply/restore parameters |
| Search | TPE or random proposal over bounded mutation space |
| Evaluation | Short-horizon scoring with accept/reject/rollback |
| Recipe evolution | Cross-run learning persisted as evolved priors |
from hotcb import HotKernel
from hotcb.adapters.lightning import HotCBLightning
from hotcb.actuators import optimizer_actuators, loss_actuators, mutable_state
# Create MutableState with all controllable params
loss_weights = {"weight_a": 1.0, "weight_b": 0.5}
ms = mutable_state(
optimizer_actuators(optimizer) + loss_actuators(loss_weights)
)
kernel = HotKernel(
run_dir="runs/exp1",
tune_recipe_path="tune_recipe.yaml", # optional
mutable_state=ms,
)
# Adapter auto-discovers optimizer actuators if not already in MutableState
trainer = pl.Trainer(callbacks=[HotCBLightning(kernel)])
trainer.fit(model, datamodule=dm)Then enable tuning from another terminal:
hotcb --dir runs/exp1 tune enable --mode activefrom hotcb import HotKernel
from hotcb.actuators import optimizer_actuators, mutable_state
ms = mutable_state(optimizer_actuators(optimizer))
kernel = HotKernel(run_dir="runs/exp1", mutable_state=ms)
for epoch in range(num_epochs):
for step, batch in enumerate(dl):
# ... train step ...
kernel.apply(env={...}, events=["train_batch_end"])
# Decision point for tune
kernel.apply(
env={"step": step, "epoch": epoch, "optimizer": optimizer,
"metric": lambda name, default=None: metrics.get(name, default)},
events=["val_epoch_end"],
)
kernel.close(env={"step": step, "epoch": epoch})| Mode | Behavior |
|---|---|
off |
No overhead beyond tiny module existence |
observe |
No mutations; collects windows, evaluates pending segments |
suggest |
Writes proposals to logs but does not apply |
active |
Proposes and applies bounded mutations |
replay |
Replays prior tune mutations from a mutations log |
hotcb --dir runs/exp1 tune enable --mode active
hotcb --dir runs/exp1 tune enable --mode observe
hotcb --dir runs/exp1 tune disableActuators are the bridge between the tuner and live training state.
optimizer_actuators(optimizer) -- creates per-param actuators for optimizer:
| Param | Type | Description |
|---|---|---|
lr |
LOG_FLOAT |
Learning rate (log-scale) |
weight_decay |
LOG_FLOAT |
Weight decay (log-scale) |
betas |
TUPLE |
Adam betas |
loss_actuators(weights_dict) -- creates per-key actuators for loss weights:
| Param | Type | Description |
|---|---|---|
| Each key | FLOAT |
Set weight to absolute value |
Each HotcbActuator has a state machine: INIT -> UNTOUCHED -> UNVERIFIED -> VERIFIED (or DISABLED).
Implement the BaseActuator protocol:
class MyActuator:
name = "my_knob"
def snapshot(self, env: dict) -> dict: ...
def validate(self, patch: dict, env: dict) -> ValidationResult: ...
def apply(self, patch: dict, env: dict) -> ApplyResult: ...
def restore(self, snapshot: dict, env: dict) -> ApplyResult: ...
def describe_space(self) -> dict: ...
kernel.register_actuator("my_knob", MyActuator())The recipe defines the search space, objective, phases, acceptance criteria, and safety constraints.
version: 1
objective:
primary: val/loss
mode: min
backup_metrics:
- grad/norm
phases:
early: {start_frac: 0.0, end_frac: 0.2}
mid: {start_frac: 0.2, end_frac: 0.7}
late: {start_frac: 0.7, end_frac: 1.0}
actuators:
opt:
enabled: true
mutations:
lr_mult:
bounds: [0.7, 1.2]
prior_center: 0.95
cooldown: 2
risk: low
loss:
enabled: true
keys:
sp_mse_w:
mode: mult
bounds: [0.5, 2.0]
cooldown: 1
search:
algorithm: tpe
startup_trials: 8
acceptance:
epsilon: 0.001
horizon: next_val_epoch_end
rollback_on_reject: true
safety:
block_on_nan: true
block_on_anomaly: true
max_global_reject_streak: 4Place at hotcb.tune.recipe.yaml in the run directory, or pass via tune_recipe_path to the kernel.
Every mutation must pass:
- Bounds check: value within recipe-declared bounds
- Cooldown: per-mutation-family cooldown prevents thrashing
- Risk class: v1 supports
lowandmediumonly - Stability blockers: NaN/inf loss, anomaly flags block all mutations
- Reject streak limit: too many consecutive rejections pauses tuning
Kernel freeze/replay semantics apply to tune the same way as cb/opt/loss -- prod mode blocks all tune commands.
At each val_epoch_end:
- If there's an active segment, evaluate it (read post-metrics, score, accept/reject)
- If accepted: reset reject streak, keep mutation applied
- If rejected: rollback to snapshot, increment reject streak, enter cooldown
- Propose next mutation if feasible
Scoring: score_delta = primary_metric_gain - instability_penalty
Accept if score_delta > epsilon and no stability blockers.
| File | Purpose |
|---|---|
hotcb.tune.recipe.yaml |
Tune recipe (search space, objective, constraints) |
hotcb.tune.mutations.jsonl |
Log of all proposed/applied/rejected mutations |
hotcb.tune.segments.jsonl |
Evaluation segments with pre/post metrics and decisions |
hotcb.tune.summary.json |
Run summary with accept rate, win rates per family |
hotcb.tune.study.sqlite |
Optional Optuna study state for TPE |
hotcb --dir runs/exp1 tune enable # Enable active tuning
hotcb --dir runs/exp1 tune enable --mode observe # Observe-only mode
hotcb --dir runs/exp1 tune disable # Disable tuning
hotcb --dir runs/exp1 tune status # Show tune status/summary
hotcb --dir runs/exp1 tune set acceptance.epsilon=0.002
hotcb --dir runs/exp1 tune export-recipe --out recipe_evolved.jsonAdd a tune section to hotcb.yaml:
tune:
enabled: true
mode: active # active, observe, suggestAfter multiple runs, evolve the recipe to incorporate learned priors:
from hotcb.modules.tune.recipe import evolve_recipe
from hotcb.modules.tune.schemas import TuneRecipe
base = TuneRecipe.from_dict(load_yaml("tune_recipe.yaml"))
summaries = [load_json(f"run{i}/hotcb.tune.summary.json") for i in range(5)]
evolved = evolve_recipe(base, summaries, alpha=0.3)
save_yaml("tune_recipe_evolved.yaml", evolved.to_dict())Every supported adapter must expose:
env["metric"]:(name: str, default=None) -> Any-- framework-agnostic metric accessorenv["kernel"]: reference to the HotKernel instanceenv["max_steps"]: total training steps for phase binning
The Lightning and HF adapters provide all three automatically.
Adapters normalize toward these where practical:
train/loss,val/loss,val/scorelr,grad/normtime/step_sec,system/gpu_mem_mb
- If tune deps missing: module self-disables, logs install hint
- If no metric accessor: observe-only with warning
- If no actuators registered: no mutations proposed
- If mutation apply fails: logged, training continues
- If rollback fails: logged, training continues
- Respects
auto_disable_on_errorkernel setting