Summary
The queue reviewer and plan reviewer have calibration gaps. The queue reviewer judges each slice's coherence but not aggregate over-decomposition, and plan-reviewer strictness is load-bearing but not exposed as a tunable.
⚠️ The reviewer in the dogfood run was a synthetic LLM judge, not a human. These are real calibration signals, but human review DX is deferred to an attended run.
Evidence (dogfood campaign)
- The queue reviewer approved the over-sliced F5 queue (5 issues + scope creep) — it never asks "is this the right number of slices for the request?"
- The plan reviewer, at full strictness, bounced a sound trivial plan 3× on style/truncation nits; recalibrating to "approve sound, request-changes only for substantive defects" then approved correctly. Calibration is load-bearing.
- F2 failed
doc_review because a mandated request-changes + a demanding judge never converged in 2 cycles — the revise→review loop can diverge with a cheap grill model.
Proposed fix
- Add an aggregate over-decomposition check to the queue rubric ("right number of slices for the request?").
- Expose reviewer strictness (demanding vs demanding-but-fair) as an operator-tunable.
- Guard the revise→review loop against divergence (cap cycles / detect non-convergence with cheap models).
Acceptance criteria
Source: dogfood/ITERATION_REPORT.md MINOR-7; dogfood/AUTOREVIEW_LOG.md.
Summary
The queue reviewer and plan reviewer have calibration gaps. The queue reviewer judges each slice's coherence but not aggregate over-decomposition, and plan-reviewer strictness is load-bearing but not exposed as a tunable.
Evidence (dogfood campaign)
doc_reviewbecause a mandated request-changes + a demanding judge never converged in 2 cycles — the revise→review loop can diverge with a cheap grill model.Proposed fix
Acceptance criteria
Source:
dogfood/ITERATION_REPORT.mdMINOR-7;dogfood/AUTOREVIEW_LOG.md.