Summary
Evaluating a trained TS-DDR policy by solving the full-horizon deterministic
equivalent with the policy's targets can silently overstate its quality. The coupled
solve re-optimizes all stages jointly and compensates for targets that are not
followable stage by stage — which is exactly what deployment cannot do. A
stage-wise subproblem rollout evaluation (the package already has the machinery
via the stage-wise / simulate_multistage path) should be the headline metric,
together with a target-violation measure.
Symptom
Same checkpoint (Bolivia ACP example, 96 stages, LSTM 2×128, constant ×1 auto-base
slack penalty), 40 fixed evaluation scenarios, objective excluding the target-slack
penalty term:
| evaluation |
cost |
target-violation share |
| deterministic equivalent (full horizon) |
306,556 |
0.046 |
| stage-wise rollout (deployment semantics) |
324,437 (+5.8%) |
0.408 |
The deterministic-equivalent view reports a policy that looks near-optimal and
target-consistent; executed stage by stage, the same policy is +5.8% more expensive
and violates its own targets ~41% of the time.
Why it matters
Stage-wise execution is the deployment semantics of a target-trajectory policy.
Reporting only deterministic-equivalent numbers can make results look better than the
policy would operate, and hides target-followability failures entirely.
Proposed fix
- Provide/document an evaluation helper based on the existing stage-wise rollout
path, and recommend it as the default reported metric in the examples.
- Report two numbers per evaluation: (a) operational objective excluding the
target-slack penalty term, and (b) a target-violation share (e.g., the realized
slack-penalty value normalized by the objective), with the recommendation that
policy comparisons are only trusted when the violation share is small (≤ ~0.05).
- Note the interaction with the penalty-annealing issue (#XX): with an annealed
schedule the two evaluations agree at convergence to ~1e-7 relative — the rollout
metric is the guard that detects when they don't.
Summary
Evaluating a trained TS-DDR policy by solving the full-horizon deterministic
equivalent with the policy's targets can silently overstate its quality. The coupled
solve re-optimizes all stages jointly and compensates for targets that are not
followable stage by stage — which is exactly what deployment cannot do. A
stage-wise subproblem rollout evaluation (the package already has the machinery
via the stage-wise /
simulate_multistagepath) should be the headline metric,together with a target-violation measure.
Symptom
Same checkpoint (Bolivia ACP example, 96 stages, LSTM 2×128, constant ×1 auto-base
slack penalty), 40 fixed evaluation scenarios, objective excluding the target-slack
penalty term:
The deterministic-equivalent view reports a policy that looks near-optimal and
target-consistent; executed stage by stage, the same policy is +5.8% more expensive
and violates its own targets ~41% of the time.
Why it matters
Stage-wise execution is the deployment semantics of a target-trajectory policy.
Reporting only deterministic-equivalent numbers can make results look better than the
policy would operate, and hides target-followability failures entirely.
Proposed fix
path, and recommend it as the default reported metric in the examples.
target-slack penalty term, and (b) a target-violation share (e.g., the realized
slack-penalty value normalized by the objective), with the recommendation that
policy comparisons are only trusted when the violation share is small (≤ ~0.05).
schedule the two evaluations agree at convergence to ~1e-7 relative — the rollout
metric is the guard that detects when they don't.