Skip to content

Target-penalty annealing: a constant slack-penalty weight either hides target violations or degrades the learned policy #41

Description

@DJLee1208

Summary

Training with a constant target-slack penalty weight puts the user in a lose-lose
situation: a low/moderate constant weight produces policies whose targets are not
actually followable stage-by-stage (hidden by deterministic-equivalent evaluation),
while a high constant weight from epoch 1 destroys learning. Annealing the penalty
multiplier during training fixes both.
Independent of model/example details, we
believe the training loop (or at least the examples) should support and default to an
annealed penalty schedule.

Setup

  • examples/HydroPowerModels, Bolivia ACP case, 96 stages, LSTM policy (2×128).
  • Penalty expressed as multiplier × auto_base, where auto_base = largest absolute
    objective coefficient of the formulation (≈ 6000 for this case); applied to both the
    L1 and L2 slack terms of the target constraints.
  • Each checkpoint evaluated two ways on 40 fixed scenarios (see companion issue):
    Eval-A = full-horizon deterministic equivalent with the policy's targets;
    Eval-B = stage-wise subproblem rollout (deployment semantics).
  • Metric: objective excluding the target-slack penalty term ("operational cost"),
    plus a target-violation share.

Symptom (single runs; det-eq training unless noted)

penalty schedule (× auto_base) Eval-A cost (violation) Eval-B cost (violation) B − A
constant ×1 (:auto) 306,556 (0.046) 324,437 (0.408) +5.8%
constant ×0.1 302,865 (0.005) 320,449 (0.080) +5.8%
constant ×30 370,464 (0.093) 370,463 (0.093) ≈0, but cost +20.7%
anneal ×0.1→1→10→30 306,912.35 (0.0008) 306,912.33 (0.0008) 0.02
constant ×1, stage-wise training 356,186 (0.025) 366,541 (0.010) +2.9%, cost ~+19%
anneal, stage-wise training 310,940.8 (0.0035) 310,941.5 (0.0035) 0.69

With a constant moderate weight, the policy looks excellent under the deterministic-
equivalent view but is +5.8% worse and violates its own targets ~41% of the time when
executed stage by stage. With a constant ×30 the two views agree but the policy is
~21% worse than achievable. Annealing ×0.1→×1→×10→×30 (we used phase lengths
proportional to 2/2/4/16 over 24 epochs; a compressed 2/2/2/2 over 8 epochs behaves
the same) yields policies that are simultaneously consistent (A ≡ B to ~1e-7 relative)
and at the good cost level, for both det-eq and stage-wise training.

Why it happens

The full-horizon coupled solve can re-time and absorb unfollowable targets via slack,
so under a weak penalty the training signal never forces stage-wise-followable
targets; under a strong penalty from scratch, followability dominates the loss before
the policy has learned anything about cost.

Proposed fix

Add a penalty-schedule option to train_multistage (and train_multiple_shooting) —
e.g., a vector of (epoch_range, multiplier) applied to an auto-calibrated base —
defaulting to the annealed ladder above, with constant weights still available
explicitly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or requestgood first issueGood for newcomers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions