Skip to content

Deterministic-equivalent evaluation can overstate policy quality — add a stage-wise rollout evaluation and a target-violation metric #42

Description

@DJLee1208

Summary

Evaluating a trained TS-DDR policy by solving the full-horizon deterministic
equivalent
with the policy's targets can silently overstate its quality. The coupled
solve re-optimizes all stages jointly and compensates for targets that are not
followable stage by stage — which is exactly what deployment cannot do. A
stage-wise subproblem rollout evaluation (the package already has the machinery
via the stage-wise / simulate_multistage path) should be the headline metric,
together with a target-violation measure.

Symptom

Same checkpoint (Bolivia ACP example, 96 stages, LSTM 2×128, constant ×1 auto-base
slack penalty), 40 fixed evaluation scenarios, objective excluding the target-slack
penalty term:

evaluation cost target-violation share
deterministic equivalent (full horizon) 306,556 0.046
stage-wise rollout (deployment semantics) 324,437 (+5.8%) 0.408

The deterministic-equivalent view reports a policy that looks near-optimal and
target-consistent; executed stage by stage, the same policy is +5.8% more expensive
and violates its own targets ~41% of the time.

Why it matters

Stage-wise execution is the deployment semantics of a target-trajectory policy.
Reporting only deterministic-equivalent numbers can make results look better than the
policy would operate, and hides target-followability failures entirely.

Proposed fix

  1. Provide/document an evaluation helper based on the existing stage-wise rollout
    path, and recommend it as the default reported metric in the examples.
  2. Report two numbers per evaluation: (a) operational objective excluding the
    target-slack penalty term, and (b) a target-violation share (e.g., the realized
    slack-penalty value normalized by the objective), with the recommendation that
    policy comparisons are only trusted when the violation share is small (≤ ~0.05).
  3. Note the interaction with the penalty-annealing issue (#XX): with an annealed
    schedule the two evaluations agree at convergence to ~1e-7 relative — the rollout
    metric is the guard that detects when they don't.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions