Deterministic-equivalent evaluation can overstate policy quality — add a stage-wise rollout evaluation and a target-violation metric

## Summary

Evaluating a trained TS-DDR policy by solving the **full-horizon deterministic
equivalent** with the policy's targets can silently overstate its quality. The coupled
solve re-optimizes all stages jointly and compensates for targets that are not
followable stage by stage — which is exactly what deployment cannot do. A
**stage-wise subproblem rollout** evaluation (the package already has the machinery
via the stage-wise / `simulate_multistage` path) should be the headline metric,
together with a target-violation measure.

## Symptom

Same checkpoint (Bolivia ACP example, 96 stages, LSTM 2×128, constant ×1 auto-base
slack penalty), 40 fixed evaluation scenarios, objective excluding the target-slack
penalty term:

| evaluation | cost | target-violation share |
|---|---|---|
| deterministic equivalent (full horizon) | 306,556 | 0.046 |
| stage-wise rollout (deployment semantics) | **324,437 (+5.8%)** | **0.408** |

The deterministic-equivalent view reports a policy that looks near-optimal and
target-consistent; executed stage by stage, the same policy is +5.8% more expensive
and violates its own targets ~41% of the time.

## Why it matters

Stage-wise execution is the deployment semantics of a target-trajectory policy.
Reporting only deterministic-equivalent numbers can make results look better than the
policy would operate, and hides target-followability failures entirely.

## Proposed fix

1. Provide/document an evaluation helper based on the existing stage-wise rollout
   path, and recommend it as the default reported metric in the examples.
2. Report two numbers per evaluation: (a) operational objective **excluding** the
   target-slack penalty term, and (b) a target-violation share (e.g., the realized
   slack-penalty value normalized by the objective), with the recommendation that
   policy comparisons are only trusted when the violation share is small (≤ ~0.05).
3. Note the interaction with the penalty-annealing issue (#XX): with an annealed
   schedule the two evaluations agree at convergence to ~1e-7 relative — the rollout
   metric is the guard that detects when they don't.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deterministic-equivalent evaluation can overstate policy quality — add a stage-wise rollout evaluation and a target-violation metric #42

Summary

Symptom

Why it matters

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

evaluation	cost	target-violation share
deterministic equivalent (full horizon)	306,556	0.046
stage-wise rollout (deployment semantics)	324,437 (+5.8%)	0.408

Uh oh!

Deterministic-equivalent evaluation can overstate policy quality — add a stage-wise rollout evaluation and a target-violation metric #42

Description

Summary

Symptom

Why it matters

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions