Target-penalty annealing: a constant slack-penalty weight either hides target violations or degrades the learned policy

## Summary

Training with a **constant** target-slack penalty weight puts the user in a lose-lose
situation: a low/moderate constant weight produces policies whose targets are not
actually followable stage-by-stage (hidden by deterministic-equivalent evaluation),
while a high constant weight from epoch 1 destroys learning. **Annealing the penalty
multiplier during training fixes both.** Independent of model/example details, we
believe the training loop (or at least the examples) should support and default to an
annealed penalty schedule.

## Setup

- `examples/HydroPowerModels`, Bolivia ACP case, 96 stages, LSTM policy (2×128).
- Penalty expressed as `multiplier × auto_base`, where `auto_base` = largest absolute
  objective coefficient of the formulation (≈ 6000 for this case); applied to both the
  L1 and L2 slack terms of the target constraints.
- Each checkpoint evaluated two ways on 40 fixed scenarios (see companion issue):
  **Eval-A** = full-horizon deterministic equivalent with the policy's targets;
  **Eval-B** = stage-wise subproblem rollout (deployment semantics).
- Metric: objective **excluding** the target-slack penalty term ("operational cost"),
  plus a target-violation share.

## Symptom (single runs; det-eq training unless noted)

| penalty schedule (× auto_base) | Eval-A cost (violation) | Eval-B cost (violation) | B − A |
|---|---|---|---|
| constant ×1 (`:auto`) | 306,556 (0.046) | **324,437 (0.408)** | **+5.8%** |
| constant ×0.1 | 302,865 (0.005) | 320,449 (0.080) | +5.8% |
| constant ×30 | 370,464 (0.093) | 370,463 (0.093) | ≈0, but cost **+20.7%** |
| **anneal ×0.1→1→10→30** | **306,912.35 (0.0008)** | **306,912.33 (0.0008)** | **0.02** |
| constant ×1, stage-wise training | 356,186 (0.025) | 366,541 (0.010) | +2.9%, cost ~+19% |
| **anneal, stage-wise training** | **310,940.8 (0.0035)** | **310,941.5 (0.0035)** | **0.69** |

With a constant moderate weight, the policy looks excellent under the deterministic-
equivalent view but is +5.8% worse and violates its own targets ~41% of the time when
executed stage by stage. With a constant ×30 the two views agree but the policy is
~21% worse than achievable. Annealing ×0.1→×1→×10→×30 (we used phase lengths
proportional to 2/2/4/16 over 24 epochs; a compressed 2/2/2/2 over 8 epochs behaves
the same) yields policies that are simultaneously consistent (A ≡ B to ~1e-7 relative)
and at the good cost level, for **both** det-eq and stage-wise training.

## Why it happens

The full-horizon coupled solve can re-time and absorb unfollowable targets via slack,
so under a weak penalty the training signal never forces stage-wise-followable
targets; under a strong penalty from scratch, followability dominates the loss before
the policy has learned anything about cost.

## Proposed fix

Add a penalty-schedule option to `train_multistage` (and `train_multiple_shooting`) —
e.g., a vector of `(epoch_range, multiplier)` applied to an auto-calibrated base —
defaulting to the annealed ladder above, with constant weights still available
explicitly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Target-penalty annealing: a constant slack-penalty weight either hides target violations or degrades the learned policy #41

Summary

Setup

Symptom (single runs; det-eq training unless noted)

Why it happens

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

penalty schedule (× auto_base)	Eval-A cost (violation)	Eval-B cost (violation)	B − A
constant ×1 (`:auto`)	306,556 (0.046)	324,437 (0.408)	+5.8%
constant ×0.1	302,865 (0.005)	320,449 (0.080)	+5.8%
constant ×30	370,464 (0.093)	370,463 (0.093)	≈0, but cost +20.7%
anneal ×0.1→1→10→30	306,912.35 (0.0008)	306,912.33 (0.0008)	0.02
constant ×1, stage-wise training	356,186 (0.025)	366,541 (0.010)	+2.9%, cost ~+19%
anneal, stage-wise training	310,940.8 (0.0035)	310,941.5 (0.0035)	0.69

Uh oh!

Target-penalty annealing: a constant slack-penalty weight either hides target violations or degrades the learned policy #41

Description

Summary

Setup

Symptom (single runs; det-eq training unless noted)

Why it happens

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions