Summary
foreman retro over the entire campaign drafted 0 proposals — not because runs were clean, but because the outcome taxonomy labelled 38 of 43 runs legacy/blank, including every one of the 21 killed_turns failures. Retro only clusters escalated / evaluator_bounce / human_rejected, so it saw just two single-instance escalation clusters and reasonably declined to patch. The flywheel is blind to the dominant failure.
Evidence (dogfood campaign)
- Outcome distribution:
success_first_try 5, blank/legacy (unlabelled) 38 (88%).
killed_turns (21 runs, 49% of all runs) is never assigned a failure outcome.
foreman retro log: "No patch proposals were drafted." It only clustered [1×] escalated:(unspecified) and [1×] escalated:handoff.
Why it matters
This is the central Phase-2 finding. The flywheel can only improve what it can measure, and it isn't measuring the dominant failure. An operator would conclude "the system is fine, retro found nothing" — which is false.
Proposed fix
- (a) Stamp an outcome on every terminal run, including phase agents (planner / grill / slicer) and the kill reasons (
killed_turns / killed_cost / killed_timeout / error). No more blank/legacy.
- (b) Make
retro.cluster_failures cluster on terminal_reason kills, not just escalations.
- (c) Treat a high
killed_turns rate as a first-class proposal trigger (it would have proposed the turn-budget fix directly).
Acceptance criteria
Source: dogfood/ITERATION_REPORT.md BLOCKER-2; dogfood/FLYWHEEL.md; dogfood/METRICS.md.
Summary
foreman retroover the entire campaign drafted 0 proposals — not because runs were clean, but because the outcome taxonomy labelled 38 of 43 runslegacy/blank, including every one of the 21killed_turnsfailures. Retro only clustersescalated/evaluator_bounce/human_rejected, so it saw just two single-instance escalation clusters and reasonably declined to patch. The flywheel is blind to the dominant failure.Evidence (dogfood campaign)
success_first_try5, blank/legacy(unlabelled) 38 (88%).killed_turns(21 runs, 49% of all runs) is never assigned a failure outcome.foreman retrolog: "No patch proposals were drafted." It only clustered[1×] escalated:(unspecified)and[1×] escalated:handoff.Why it matters
This is the central Phase-2 finding. The flywheel can only improve what it can measure, and it isn't measuring the dominant failure. An operator would conclude "the system is fine, retro found nothing" — which is false.
Proposed fix
killed_turns/killed_cost/killed_timeout/error). No more blank/legacy.retro.cluster_failurescluster onterminal_reasonkills, not just escalations.killed_turnsrate as a first-class proposal trigger (it would have proposed the turn-budget fix directly).Acceptance criteria
retro.cluster_failuresclusters on kill reasons and surfaces a cluster forkilled_turns.Source:
dogfood/ITERATION_REPORT.mdBLOCKER-2;dogfood/FLYWHEEL.md;dogfood/METRICS.md.