fix: state-based session routing to eliminate double-message bug#362
Conversation
Replace the staleInMemory check (isActive boolean) with Phase 3 state-based routing using the durable getSessionState() column as the single source of truth. The old check could incorrectly kill a running query loop when EventStore and SessionRegistry diverged after navigation, causing the first message to be swallowed. Changes: - handleSwitchSession: add watch() so events reach the connection immediately; return `running` state in session_switched response - handleSendV2: reorder watch before send; replace staleInMemory with state-based routing (ACTIVE/DETACHED → route, ENDED → abort + resume) - handleInterruptV2: same Phase 3 fix as handleSendV2 - protocol-parser: dispatch SET_RUNNING on session_switched - Use stopChat() (abort signal) instead of registry.remove() for zombie sessions to prevent leaked query loops Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ac5e4f8 to
bd7162f
Compare
Centaur ReviewFound 5 issue(s) (2 warning).
|
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 7 issue(s) (3 warning).
server/ws-handler-v2.ts
Solid fix for the double-message bug with clean state-machine routing, but the CLOSING state falls through to the active path unintentionally, and the new handleSwitchSession changes (running field, early watch) lack dedicated test coverage.
- 🟡 bugs (L510): Comment lists states routed to Case 1 as "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition
storeState !== 'ENDED' && storeState !== nullalso admits CLOSING. A session in CLOSING state is winding down (CLOSING can only transition to ENDED per VALID_TRANSITIONS). Routing a new message via sendToChat to a closing session may race with the closeout — the agent could receive a new prompt while it's trying to commit/summarize. Either add CLOSING to the exclusion list alongside ENDED (treating it as a zombie) or document in the comment that CLOSING is intentionally treated as active.[fixable] - 🟡 bugs (L672): Same CLOSING-state gap in handleInterruptV2: the comment says "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition also admits CLOSING. Should be consistent with whatever decision is made for handleSendV2.
[fixable] - 🔵 style (L497): The block comments explaining the state-based routing (lines 497-511, 563-565, 576, and similarly in handleInterruptV2) are quite verbose for this codebase's conventions. CLAUDE.md says "default to writing no comments" and "only add one when the WHY is non-obvious." The initial Phase 3 header comment is useful context, but the per-case comments restate what the code already expresses through the condition structure. Consider trimming to one-line comments per case.
[fixable] - 🔵 unsafe_assumptions (L377): handleSwitchSession determines
runningviactx.sessionRegistry.isActive(found.clientId)— the in-memory registry. But the state-based routing in handleSendV2/handleInterruptV2 explicitly avoids trusting the in-memory registry alone (the whole point of this PR). A zombie session would showrunning: truehere even though the EventStore state is ENDED. Consider cross-referencingeventStore.getSessionState()for consistency, similar to how handleReconnect already does zombie detection (test at line 896).[fixable]
server/__tests__/ws-handler-v2.test.ts
Solid fix for the double-message bug with clean state-machine routing, but the CLOSING state falls through to the active path unintentionally, and the new handleSwitchSession changes (running field, early watch) lack dedicated test coverage.
- 🟡 missing_tests: No test verifies that the
session_switchedresponse includes therunningfield. The existing handleSwitchSession tests (lines 362-426) assert on mode/cwd/branch/wtId/tokens but notrunning. A test with an active session (isActive → true) should assertrunning: true, and a test with an ended session should assertrunning: false.[fixable] - 🔵 missing_tests: No test covers the early
watch()call added in handleSwitchSession (line 368). The intent — preventing a blind spot between switch and first send — is important enough to verify thatconnRegistry.watchis called before the response is sent. A spy-ordering test or at minimum an assertion that watch was called with the correct args would prevent a regression.[fixable] - 🔵 missing_tests: No test covers the CLOSING state for either handleSendV2 or handleInterruptV2. Given the ambiguity noted above, a test that sets
getSessionStateto 'CLOSING' would document the intended behavior.[fixable]
| // bug on reattach. | ||
|
|
||
| // Case 1: Registry has the session AND state says it should | ||
| // be running (ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED). |
There was a problem hiding this comment.
🟡 bugs: Comment lists states routed to Case 1 as "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition storeState !== 'ENDED' && storeState !== null also admits CLOSING. A session in CLOSING state is winding down (CLOSING can only transition to ENDED per VALID_TRANSITIONS). Routing a new message via sendToChat to a closing session may race with the closeout — the agent could receive a new prompt while it's trying to commit/summarize. Either add CLOSING to the exclusion list alongside ENDED (treating it as a zombie) or document in the comment that CLOSING is intentionally treated as active. [fixable]
| // | ||
| // Same logic as handleSendV2: use the durable state column as | ||
| // the single source of truth for routing decisions. | ||
|
|
There was a problem hiding this comment.
🟡 bugs: Same CLOSING-state gap in handleInterruptV2: the comment says "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition also admits CLOSING. Should be consistent with whatever decision is made for handleSendV2. [fixable]
| @@ -484,62 +497,83 @@ export function handleSendV2( | |||
| span.setAttribute('session.state_mismatch', mismatch.details ?? 'unknown'); | |||
There was a problem hiding this comment.
🔵 style: The block comments explaining the state-based routing (lines 497-511, 563-565, 576, and similarly in handleInterruptV2) are quite verbose for this codebase's conventions. CLAUDE.md says "default to writing no comments" and "only add one when the WHY is non-obvious." The initial Phase 3 header comment is useful context, but the per-case comments restate what the code already expresses through the condition structure. Consider trimming to one-line comments per case. [fixable]
| // "double message" bug where the first send queues behind a stale | ||
| // running=false state. | ||
| const found = ctx.sessionRegistry.findBySessionId(msg.sessionId); | ||
| const running = found ? ctx.sessionRegistry.isActive(found.clientId) : false; |
There was a problem hiding this comment.
🔵 unsafe_assumptions: handleSwitchSession determines running via ctx.sessionRegistry.isActive(found.clientId) — the in-memory registry. But the state-based routing in handleSendV2/handleInterruptV2 explicitly avoids trusting the in-memory registry alone (the whole point of this PR). A zombie session would show running: true here even though the EventStore state is ENDED. Consider cross-referencing eventStore.getSessionState() for consistency, similar to how handleReconnect already does zombie detection (test at line 896). [fixable]
- Exclude CLOSING from active routing (treats as zombie alongside ENDED) - Cross-reference eventStore.getSessionState() in handleSwitchSession running detection to avoid reporting zombie as running - Trim verbose per-case comments to one-liners - Add tests: watch() in handleSwitchSession, running field (true/false), CLOSING state routing for both send and interrupt handlers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review (R2)
Clean fix following the Phase 3 design doc. State-based routing logic is sound, CLOSING exclusion is correct, zombie abort via stopChat() prevents leaked query loops.
No blocking issues.
Minor notes for follow-up:
handleReconnectstill uses oldgetSession().isActivepattern (different code path, not a regression)- No protocol-parser test for
SET_RUNNINGdispatch (low risk)
130/130 tests passing. Approved.
Summary
handleSwitchSessiondidn't callwatch(), so events from the query loop never reached the reconnected client. Combined with astaleInMemorycheck that could incorrectly kill a running query loop when EventStore and SessionRegistry diverged, the first message after navigating back to a session was silently swallowed.staleInMemoryboolean check with Phase 3 state-based routing usinggetSessionState()as the single source of truth in bothhandleSendV2andhandleInterruptV2watch()call tohandleSwitchSession+ returnrunningstate to clientstopChat()(abort signal) instead ofregistry.remove()for zombie sessionsTest plan
🤖 Generated with Claude Code