Live Training Control Plane for PyTorch
hotcb lets you modify training behavior while your run is active β no restart, no lost progress. Every change is recorded, exportable, and replayable.
Version 2.0 expands the original live-callback system into a full control plane: you can now swap callbacks, tune optimizer parameters, and adjust loss weights β all from another terminal while the model trains.
Now tune hyperparameters in 2 not 2000 runs !!
final_demo_ai_autopilot.mp4
In the video you can see a model distillation run being balanced for classification feature tokens and spatial maps in terms of both direction(cos) and structure(mse+grad). The mutations applied here are taken by ai autonomously via hot b AI-Autopilot mode show thier impact (features vs spatial map) quantitatively(impact analysis) and qualitatively(curve steps).
| Module | What you can change live |
|---|---|
| cb | Load/unload/enable/disable/reconfigure callbacks |
| opt | Learning rate, weight decay, gradient clipping, per-group |
| loss | Loss weights, term toggles, ramp configs |
| tune | Online constrained hyperparameter adaptation (optional, hotcb[tune]) |
Plus:
- Dashboard (
hotcb serve): live metric charts, command panel, recipe editor, autopilot controls - Autopilot: rule-based and AI-driven training optimization β plateau/divergence/overfitting detection with automatic or LLM-guided intervention
- AI Autopilot (
hotcb[ai]): LLM reads compressed metric trends, analyzes alerts, and proposes/applies hotcb commands β with budget caps, safety guards, and multi-run memory - Programmatic launch API (
hotcb.launch): start training + dashboard + autopilot in one call from notebooks/scripts - Applied ledger (
hotcb.applied.jsonl): step-indexed, authoritative record of what actually happened - Recipe export + replay: export a run's changes as a portable plan, replay in future runs deterministically
- Freeze modes: production lock, deterministic replay, replay-with-adjustments
Live Mutation Scheduling Recipe editor:

Rule based Autopilot (AI mode in video above)

Mutation Recipe impact comparison:

Dynamic metric space manifold update:

pip install hotcbWith YAML config support:
pip install "hotcb[yaml]"With dashboard:
pip install "hotcb[dashboard]"With AI autopilot (LLM or Rule driven live optimization - Optionally):
pip install "hotcb[dashboard,ai]"Full extras (YAML + Lightning + HF + dashboard + AI + tune):
pip install "hotcb[all]"With online tuning (Bayesian HPO):
pip install "hotcb[tune]"hotcb --dir runs/exp1 initfrom hotcb import HotKernel
from hotcb.adapters.lightning import HotCBLightning
kernel = HotKernel(run_dir="runs/exp1", debounce_steps=10)
trainer = pl.Trainer(callbacks=[HotCBLightning(kernel)])
trainer.fit(model, datamodule=dm)
trainer.fit(model, datamodule=dm)from hotcb import HotKernel
from hotcb.adapters.hf import HotCBHFCallback
kernel = HotKernel(run_dir="runs/exp1", debounce_steps=10)
trainer = Trainer(..., callbacks=[HotCBHFCallback(kernel)])
trainer.train()from hotcb import HotKernel
kernel = HotKernel(run_dir="runs/exp1", debounce_steps=10)
for step, batch in enumerate(dl):
# ... forward, backward, optimizer step ...
kernel.apply(
env={
"framework": "torch",
"phase": "train",
"step": step,
"optimizer": optimizer,
"mutable_state": model.mutable_state,
"log": print,
},
events=["train_step_end"],
)# Load a diagnostic callback
hotcb --dir runs/exp1 cb load feat_viz \
--file /tmp/feat_viz.py \
--symbol FeatureVizCallback \
--enabled --init every=50
# Change learning rate
hotcb --dir runs/exp1 opt set_params lr=1e-4 weight_decay=0.02
# Change loss weights
hotcb --dir runs/exp1 loss set_params distill_w=0.2 depth_w=1.5
# Toggle a loss term off
hotcb --dir runs/exp1 loss set_params terms.aux_depth=false# enable/disable default to the cb module
hotcb --dir runs/exp1 enable timing
hotcb --dir runs/exp1 disable timing
# set auto-routes based on key patterns
hotcb --dir runs/exp1 set lr=5e-5 # β opt
hotcb --dir runs/exp1 set distill_w=0.25 # β loss# Dashboard only (attach to existing run)
hotcb serve --dir runs/exp1
# Dashboard + synthetic training demo
hotcb demo
hotcb demo --golden # multi-task demo with rich metrics
# Dashboard + autopilot (rule-based or AI-driven)
hotcb demo --autopilot suggest # rule-based, proposals shown in UI
hotcb demo --autopilot ai_suggest # LLM-driven, proposals shown in UI
hotcb demo --autopilot ai_auto # LLM-driven, auto-applies with safety guardsOpen http://localhost:8421 to see live charts, send commands, and monitor autopilot decisions.
from hotcb.launch import launch
handle = launch(
train_fn="my_module:train", # or a callable
autopilot="ai_suggest",
key_metric="val_loss",
max_steps=1000,
serve=True, # start dashboard
)
handle.wait() # block until done
handle.metrics() # read latest metrics
handle.set_param(lr=0.0005) # send live commands
handle.stop() # stop earlyhotcb launch --config multitask --autopilot ai_suggest --key-metric val_loss --max-steps 1000
hotcb launch --config multitask --autopilot ai_suggest --max-time 300 # 5-minute run
hotcb launch --train-fn my_module:train --autopilot ai_auto --ai-budget 2.0# Register actuators in your training script (see docs/modules/hottune.md)
# Then enable from another terminal:
hotcb --dir runs/exp1 tune enable --mode active
# Or observe-only (no mutations, just proposals):
hotcb --dir runs/exp1 tune enable --mode observe
# Check tune status:
hotcb --dir runs/exp1 tune status| File | Purpose |
|---|---|
hotcb.commands.jsonl |
What you asked for (incoming commands) |
hotcb.applied.jsonl |
What actually happened (step-indexed, authoritative) |
hotcb.recipe.jsonl |
Portable replay plan exported from the ledger |
hotcb.sources/ |
Captured callback source files for deterministic replay |
hotcb.freeze.json |
Current freeze mode state |
hotcb.tune.mutations.jsonl |
Tune mutation log (if tune enabled) |
hotcb.tune.segments.jsonl |
Tune evaluation segments (if tune enabled) |
hotcb.tune.summary.json |
Tune run summary (if tune enabled) |
hotcb.metrics.jsonl |
Training metrics stream (step, metrics dict) |
hotcb.features.jsonl |
Activation capture data (optional) |
hotcb.run.json |
Run metadata (config, seed, timestamps) |
hotcb.ai.state.json |
AI autopilot state (key metric, run history, learnings) |
| Mode | Behavior |
|---|---|
off |
Normal β all live commands accepted |
prod |
Ignore all external commands (production lock) |
replay |
Ignore external commands, replay recipe deterministically |
replay_adjusted |
Replay recipe with YAML overlay patches |
# Lock a production run
hotcb --dir runs/exp1 freeze --mode prod
# Replay a previous run exactly
hotcb --dir runs/exp1 recipe export --out runs/exp1/hotcb.recipe.jsonl
hotcb --dir runs/exp2 freeze --mode replay --recipe runs/exp1/hotcb.recipe.jsonl
# Replay with adjustments
hotcb --dir runs/exp2 freeze --mode replay_adjusted \
--recipe runs/exp1/hotcb.recipe.jsonl \
--adjust runs/exp2/hotcb.adjust.yaml
# Unlock
hotcb --dir runs/exp1 freeze --mode offhotcb never monkeypatches the trainer. It mutates only what you pass via env.
Pass env["optimizer"] (or env["resolve_optimizer"] as a callable). The Lightning and HF adapters handle this automatically.
Keep a mutable dict on your model:
self.mutable_state = {
"weights": {"distill": 0.2, "depth": 1.5},
"terms": {"aux_depth": True, "aux_heatmap": False},
"ramps": {"depth": {"type": "linear", "warmup_frac": 0.2, "end": 2.0}},
}Set env["mutable_state"] = self.mutable_state β the adapters do this automatically if the attribute exists on your LightningModule or HF model.
When you load a callback from a Python file, hotcb captures its source:
hotcb --dir runs/exp1 cb load feat_viz \
--file /tmp/feat_viz.py --symbol FeatureVizCallbackhotcb computes SHA-256, copies the file to hotcb.sources/, and records the version in the ledger. Replay mode uses the captured version β even if the original file has since changed.
# Show current freeze mode and recent applied entries
hotcb --dir runs/exp1 status
# Validate a recipe file
hotcb --dir runs/exp1 recipe validate --recipe runs/exp1/hotcb.recipe.jsonl
# Inspect the ledger directly
tail -n 20 runs/exp1/hotcb.applied.jsonlhotcb ships with ready-to-use callbacks:
| Callback | What it does |
|---|---|
HeartbeatCallback |
Periodic "I'm alive" log signal |
TimingCallback |
Step timing and throughput |
SystemStatsCallback |
CPU / RAM / GPU utilization |
TensorStatsCallback |
Tensor mean / std / min / max |
GradStatsCallback |
Gradient norm and stability |
AnomalyGuardCallback |
NaN/Inf detection with auto-disable |
JSONLLoggerCallback |
Structured append-only JSONL metrics log |
hotcb --dir runs/exp1 cb load heartbeat \
--path hotcb.modules.cb.callbacks.heartbeat \
--symbol HeartbeatCallback \
--enabled --init every=100Minimal contract β no base class required:
class MyCallback:
def __init__(self, id: str, every: int = 50):
self.id = id
self.every = every
def set_params(self, **kwargs):
if "every" in kwargs:
self.every = int(kwargs["every"])
def handle(self, event: str, env: dict):
step = env.get("step", 0)
if step % self.every == 0:
env.get("log", print)(f"[{self.id}] step={step}")hotcb includes a multi-level autopilot system:
| Mode | Behavior |
|---|---|
off |
No autopilot β manual control only |
suggest |
Rule-based: detects plateau/divergence/overfitting, proposes actions in dashboard |
auto |
Rule-based: detects and auto-applies corrective actions |
ai_suggest |
LLM-driven: reads metric trends + alerts, proposes actions for human review |
ai_auto |
LLM-driven: proposes and auto-applies with safety guards |
AI autopilot features:
- Compressed trend context: sends slope/volatility/direction summaries (not raw values) to the LLM for token efficiency
- Key metric system: primary optimization target, changeable by AI or human mid-run
- Multi-run memory: carries learnings across 2-3 runs via
hotcb.ai.state.json - Budget cap: configurable USD limit; falls back to rule-based when exhausted
- Safety guards: action bounds, cooldown between interventions, noop bias, auto-disable on divergence
Set HOTCB_AI_KEY env var for the LLM API key (any OpenAI-compatible provider).
- No training loop mutation β hotcb never touches the trainer internals
- Safe-point updates only β changes applied at batch/step boundaries
- Fail-safe β crashing callbacks and modules auto-disable, training continues
- Full audit trail β every mutation written to the applied ledger
- AI actions bounded β hard min/max on all parameter changes, minimum 10-step cooldown
- Integration Guide β minimal reference for wiring hotcb into any training project (also useful for AI agents)
- Concepts β HotKernel, ops, ledger, recipe, freeze modes, dashboard, autopilot
- CLI Reference β all commands including serve, demo, launch
- Replay β recipe export, replay modes, overlays
- Formats β JSONL, JSON, and YAML schemas
- Modules: cb | opt | loss | tune
- Examples: Lightning | HF | Bare PyTorch | Custom callback | Adjust overlay
- CLI Walkthrough β full live-control session from init to replay
MIT License (see LICENSE file).

