Skip to content

feat(tasks): reviewer loop — independent verification on in_review (status-as-role)#206

Merged
Fullstop000 merged 16 commits into
mainfrom
feat/reviewer-loop
May 23, 2026
Merged

feat(tasks): reviewer loop — independent verification on in_review (status-as-role)#206
Fullstop000 merged 16 commits into
mainfrom
feat/reviewer-loop

Conversation

@Fullstop000
Copy link
Copy Markdown
Owner

What

Adds independent verification to the autonomous task loop: a second agent (the reviewer) consumes a task in in_review and drives it to done (approve) or back to in_progress (send back, with notes), escalating to the human via dispatch_decision when it can't cleanly decide. This makes the single-agent loop trustworthy — agents are structurally bad at knowing when they've failed, so the reviewer is the second identity whose only power is judgment.

Design — status-as-role (no schema migration, no new route, no new MCP tool)

A task in in_review whose claim (claimed_by_id) is held by an agent other than its durable assignee_agent_id is under review. The role is the status. Assigning an agent to an in_review task (the existing assign route) = a review dispatch; the reviewer drives the existing claimer-gated update_task_status.

Three live states, distinguished everywhere (claimed_by_id vs assignee_agent_id):

  • workingin_progress, claimer == assignee
  • awaiting reviewerin_review, claimer == assignee (no actor expected; lease must not fire)
  • reviewingin_review, claimer != assignee (reviewer; lease applies)

Mechanics: ClaimMode::Review (claim without status change) · self-review rejected, review-assignment human-only · send-back rebounds the claim + sole membership to the worker inline (status event built from the final row) · lease generalized from assignee → claimer · reconcile re-pokes the claimer with role-appropriate text and is snapshot-guarded against list-then-act races · InReview → Blocked edge for reviewer-stall escalation · TaskInfo.reviewer_name + UI "in review by @Reviewer" shown only in the reviewing state.

Review

Three independent codex passes caught 5 real bugs, all fixed + regression-tested:

  • self-claimed tasks broke review (orphaned claim on send-back)
  • reviewer stalls blamed the worker
  • a human-claimer 500 in update_task_status (caught by the e2e suite)
  • changing reviewers corrupted the durable worker
  • (+ two stale LSP false-alarms correctly dismissed against cargo)

Dogfood (the real bar) — verified against the DB, not self-reported

Live run with real LLM agents (claude/sonnet; kimi token had expired). Every status_changed was authored by an agent (update_task_status is claimer-gated, so the human cannot produce them); the only human-authored events are review_requested (the assign action). Confirmed in SQLite:

  • Full send-back loop: worker→in_review, reviewer→in_review→in_progress (claim rebounds to worker), worker re-submits, reviewer→done.
  • Reviewer-stall → blocked correctly blamed the reviewer; awaiting-reviewer tasks were NOT escalated (three-state lease); dead-worker task also auto-blocked.

Gates

cargo fmt · cargo clippy --all-targets -D warnings · cargo test (837 passed, incl. e2e) · cd ui && npx tsc --noEmit · vitest (139 passed). 15 commits, +~1450/−46, 13 files.

🤖 Generated with Claude Code

Fullstop000 and others added 16 commits May 23, 2026 14:55
…alation)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… caller arrives next)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When assign_task is called on a task that is already in_review, instead
of forcing in_progress it dispatches the target agent as the REVIEWER:
uses ClaimMode::Review (no status change), keeps the original worker as
durable assignee, makes the reviewer sole agent sub-channel member, posts
a review directive, and emits a ReviewRequested event. Human-only; worker
cannot self-review. Removes #[allow(dead_code)] from ClaimMode::Review
now that it has a caller. Adds three covering tests (happy path,
self-review rejection, agent-assigner rejection).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vent)

When a reviewer calls update_task_status in_review→in_progress, the claim
is handed back to the durable worker (set_claim_inner Assign), the reviewer
is evicted from sub-channel membership, a send-back directive is posted into
the sub-channel to wake the worker, and the StatusChanged event reflects the
FINAL row claimedBy (the worker name, not the reviewer's). Existing Done
archive path is unchanged. Two new tests cover approve and send-back.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the assignee-keyed lease renewal UPDATE with a claimer-keyed one
(claimed_by_id + claimed_by_type). During in_progress the claimer IS the
assignee so behavior is unchanged; during in_review the claimer is the
reviewer, so the reviewer's posts now renew the lease while the worker's
posts no longer do. Test: reviewer_post_renews_lease_nonclaimer_does_not.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
update_task_status resolved the post-rebound claimer name always as an
agent, which 500'd when the claimer is a human (operator-driven status
changes). Track final_claimer_type and resolve with it. Caught by the
e2e suite (test_task_board_e2e, task_lifecycle_*), which the reviewer
unit tests missed because they only use agent claimers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ded)

StaleTask gains status/claimed_by_id/claimed_by_type fields.
list_stale_assigned_tasks now includes reviewing tasks (in_review,
claimer != assignee) while excluding awaiting-reviewer tasks (in_review,
claimer == assignee). redispatch_task is snapshot-guarded and branches
directive text by state, re-poking the claimer. escalate_task is
snapshot-guarded and derives prevStatus from the snapshot. Four tests
cover the three-state predicate, branched directive, prevStatus, and the
no-op guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds `reviewer_name: Option<String>` to `TaskInfo`, populated via a
trailing SQL CASE column in all three task-read queries. Returns the
reviewer agent's display name ONLY when status = in_review AND
claimed_by_id != assignee_agent_id (the "reviewing" state); NULL in
every other state including "awaiting reviewer". Column appended last
so no existing index shifts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents and locks the existing update_task_status claimer gate: after
a reviewer is dispatched (assign_task in_review branch), neither the
human assigner nor the original worker may drive in_review → in_progress
because neither holds the claim. No production change — the gate was
already in place; this test makes the invariant explicit and regression-proof.

Audit finding: the only SQL paths that write in_progress are
set_claim_inner (Claim mode, CAS-guarded todo→in_progress only) and
update_task_status (claimer-gated). escalate_task only goes to blocked;
redispatch_task only bumps attempts. No admin/CLI/UI bypass exists.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The assign handler's assign_task call was using internal_err (500) for
all unexpected errors. Three review-branch violations — self-review,
human-only assigner, and review-claim-did-not-apply — are client errors
(caller got the precondition wrong), so they map to 409 Conflict.

Matches the existing done-task 409 guard's style: check the error string
for the known substrings and route to app_err!(CONFLICT) before falling
through to internal_err for genuinely unexpected store errors.

Route: POST /api/conversations/{channel_id}/tasks/{task_number}/assign
Body field: agentId

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `review_requested` to `TaskEventAction` union and VALID_ACTIONS.
Add `reviewerName?: string` to `TaskInfo`. Show 'in review by @name'
in TaskDetailView and TaskCard when reviewerName is set; omit when absent.
Add exhaustive switch arm in `formatEvent` ("requested review"). Tests
cover parse, rendering present/absent reviewerName, and the new event label.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…(codex review)

Two P2s from the end-of-branch codex review:
- Self-claimed tasks have assignee_agent_id=NULL; the review branch keyed
  the worker off the durable assignee, so self-review wasn't caught and a
  send-back stranded the task on the reviewer (orphaned claim). Derive the
  worker from the current claimer (covers assigned + self-claimed) and
  promote it to the durable assignee on review dispatch.
- escalate_task blamed the durable worker for a reviewer stall. Blame the
  actual staller (reviewer when reviewing, worker otherwise).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-review)

The prior fix derived the worker from the current claimer, which on a
reviewer change is the OUTGOING reviewer, not the worker — corrupting the
durable assignee and breaking send-back/self-review. Prefer the existing
durable assignee when present; fall back to the claimer only for a
self-claimed task's first review dispatch. Regression test added.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- CHANGELOG: reviewer loop (#206)
- VERSION: 0.17.0 -> 0.18.0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Fullstop000 Fullstop000 merged commit d0cf64c into main May 23, 2026
3 checks passed
Fullstop000 added a commit that referenced this pull request May 23, 2026
…ference) (#207)

* fix(decisions): resolve dispatch_decision agent by id, not name

handle_create_decision looked up the agent with get_agent(name) while the
bridge addresses the callback as /internal/agent/{agent_id}/decisions with
the agent's UUID id (agents are keyed by id since #142). The id never
matched the name column, so every real agent's dispatch_decision 500'd with
"agent not found" — the resume path in the same file already used
get_agent_by_id. Switch the create path to get_agent_by_id.

The existing test masked it: its fixture seeds an agent whose name == id
("bot1"), so the by-name lookup happened to match. Added a regression test
with a real UUID id != name (verified it 404s without the fix).

Found by dogfooding the reviewer loop's escalation path (#206): a reviewer
agent correctly chose to escalate via dispatch_decision and hit this.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(decisions): infer decision channel from durable state, not dead trace lookup

dispatch_decision inferred its inbox channel from `lifecycle.run_channel_id`,
backed by AgentTraceStore's in-memory agent-scoped channel. But `set_run_channel`
had no non-test caller, so the lookup always returned None and every escalation
400'd ("no active-run channel"). The Decision Inbox never worked in production;
tests passed only because the mock lifecycle set the channel by hand. Bug #1
(agent-by-name) failed earlier in the same handler and masked this.

The channel is display context (inbox label) only — delivery is agent-scoped
(resume_with_prompt), scoping is by workspace_id — so resolve it from durable
reads: claimed-task sub-channel, else current-run channel, else last chat
channel, else an honest 400. Task work drives status without a run-stamped
chat, so the claimed task's sub-channel is the reliable origin for a task
agent that escalates before chatting (the reviewer-loop case the dogfood hit).

Delete the dead set_run_channel / run_channel_id from the AgentLifecycle trait,
AgentManager, and the test mock — the trap that hid this. Add a regression test
(decision_channel_resolves_from_claimed_task) exercising the production path
with no mock channel; rework the existing decision tests off set_run_channel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: prep for v0.18.1 preview ship

- CHANGELOG: dispatch_decision now reaches the Decision Inbox (#207)
- VERSION: 0.18.0 -> 0.18.1

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant