Summary
Training with a constant target-slack penalty weight puts the user in a lose-lose
situation: a low/moderate constant weight produces policies whose targets are not
actually followable stage-by-stage (hidden by deterministic-equivalent evaluation),
while a high constant weight from epoch 1 destroys learning. Annealing the penalty
multiplier during training fixes both. Independent of model/example details, we
believe the training loop (or at least the examples) should support and default to an
annealed penalty schedule.
Setup
examples/HydroPowerModels, Bolivia ACP case, 96 stages, LSTM policy (2×128).
- Penalty expressed as
multiplier × auto_base, where auto_base = largest absolute
objective coefficient of the formulation (≈ 6000 for this case); applied to both the
L1 and L2 slack terms of the target constraints.
- Each checkpoint evaluated two ways on 40 fixed scenarios (see companion issue):
Eval-A = full-horizon deterministic equivalent with the policy's targets;
Eval-B = stage-wise subproblem rollout (deployment semantics).
- Metric: objective excluding the target-slack penalty term ("operational cost"),
plus a target-violation share.
Symptom (single runs; det-eq training unless noted)
| penalty schedule (× auto_base) |
Eval-A cost (violation) |
Eval-B cost (violation) |
B − A |
constant ×1 (:auto) |
306,556 (0.046) |
324,437 (0.408) |
+5.8% |
| constant ×0.1 |
302,865 (0.005) |
320,449 (0.080) |
+5.8% |
| constant ×30 |
370,464 (0.093) |
370,463 (0.093) |
≈0, but cost +20.7% |
| anneal ×0.1→1→10→30 |
306,912.35 (0.0008) |
306,912.33 (0.0008) |
0.02 |
| constant ×1, stage-wise training |
356,186 (0.025) |
366,541 (0.010) |
+2.9%, cost ~+19% |
| anneal, stage-wise training |
310,940.8 (0.0035) |
310,941.5 (0.0035) |
0.69 |
With a constant moderate weight, the policy looks excellent under the deterministic-
equivalent view but is +5.8% worse and violates its own targets ~41% of the time when
executed stage by stage. With a constant ×30 the two views agree but the policy is
~21% worse than achievable. Annealing ×0.1→×1→×10→×30 (we used phase lengths
proportional to 2/2/4/16 over 24 epochs; a compressed 2/2/2/2 over 8 epochs behaves
the same) yields policies that are simultaneously consistent (A ≡ B to ~1e-7 relative)
and at the good cost level, for both det-eq and stage-wise training.
Why it happens
The full-horizon coupled solve can re-time and absorb unfollowable targets via slack,
so under a weak penalty the training signal never forces stage-wise-followable
targets; under a strong penalty from scratch, followability dominates the loss before
the policy has learned anything about cost.
Proposed fix
Add a penalty-schedule option to train_multistage (and train_multiple_shooting) —
e.g., a vector of (epoch_range, multiplier) applied to an auto-calibrated base —
defaulting to the annealed ladder above, with constant weights still available
explicitly.
Summary
Training with a constant target-slack penalty weight puts the user in a lose-lose
situation: a low/moderate constant weight produces policies whose targets are not
actually followable stage-by-stage (hidden by deterministic-equivalent evaluation),
while a high constant weight from epoch 1 destroys learning. Annealing the penalty
multiplier during training fixes both. Independent of model/example details, we
believe the training loop (or at least the examples) should support and default to an
annealed penalty schedule.
Setup
examples/HydroPowerModels, Bolivia ACP case, 96 stages, LSTM policy (2×128).multiplier × auto_base, whereauto_base= largest absoluteobjective coefficient of the formulation (≈ 6000 for this case); applied to both the
L1 and L2 slack terms of the target constraints.
Eval-A = full-horizon deterministic equivalent with the policy's targets;
Eval-B = stage-wise subproblem rollout (deployment semantics).
plus a target-violation share.
Symptom (single runs; det-eq training unless noted)
:auto)With a constant moderate weight, the policy looks excellent under the deterministic-
equivalent view but is +5.8% worse and violates its own targets ~41% of the time when
executed stage by stage. With a constant ×30 the two views agree but the policy is
~21% worse than achievable. Annealing ×0.1→×1→×10→×30 (we used phase lengths
proportional to 2/2/4/16 over 24 epochs; a compressed 2/2/2/2 over 8 epochs behaves
the same) yields policies that are simultaneously consistent (A ≡ B to ~1e-7 relative)
and at the good cost level, for both det-eq and stage-wise training.
Why it happens
The full-horizon coupled solve can re-time and absorb unfollowable targets via slack,
so under a weak penalty the training signal never forces stage-wise-followable
targets; under a strong penalty from scratch, followability dominates the loss before
the policy has learned anything about cost.
Proposed fix
Add a penalty-schedule option to
train_multistage(andtrain_multiple_shooting) —e.g., a vector of
(epoch_range, multiplier)applied to an auto-calibrated base —defaulting to the annealed ladder above, with constant weights still available
explicitly.