fix: state-based session routing to eliminate double-message bug by dimakis · Pull Request #362 · dimakis/mitzo

dimakis · 2026-05-30T10:42:10Z

Summary

Root cause: handleSwitchSession didn't call watch(), so events from the query loop never reached the reconnected client. Combined with a staleInMemory check that could incorrectly kill a running query loop when EventStore and SessionRegistry diverged, the first message after navigating back to a session was silently swallowed.
Replace the staleInMemory boolean check with Phase 3 state-based routing using getSessionState() as the single source of truth in both handleSendV2 and handleInterruptV2
Add watch() call to handleSwitchSession + return running state to client
Use stopChat() (abort signal) instead of registry.remove() for zombie sessions

Test plan

125/125 ws-handler-v2 tests pass (verified)
Navigate away from active session and back — first message should get a response immediately
Repeat with session that was idle >48h (ENDED state) — should resume cleanly
Verify interrupt works after session switch without double-send

🤖 Generated with Claude Code

Replace the staleInMemory check (isActive boolean) with Phase 3 state-based routing using the durable getSessionState() column as the single source of truth. The old check could incorrectly kill a running query loop when EventStore and SessionRegistry diverged after navigation, causing the first message to be swallowed. Changes: - handleSwitchSession: add watch() so events reach the connection immediately; return `running` state in session_switched response - handleSendV2: reorder watch before send; replace staleInMemory with state-based routing (ACTIVE/DETACHED → route, ENDED → abort + resume) - handleInterruptV2: same Phase 3 fix as handleSendV2 - protocol-parser: dispatch SET_RUNNING on session_switched - Use stopChat() (abort signal) instead of registry.remove() for zombie sessions to prevent leaked query loops Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dimakis · 2026-05-30T10:46:19Z

Centaur Review

Found 5 issue(s) (2 warning).

`server/ws-handler-v2.ts`

Solid fix for the double-message bug via state-based routing. The main concern is that handleReconnect still uses the old isActive boolean pattern, leaving a consistency gap. Missing test coverage for the new running field in handleSwitchSession.

🟡 regressions (L215): handleReconnect still uses the old storeMeta.isActive boolean + sessionRegistry.remove() pattern (lines 215-223) while handleSendV2/handleInterruptV2 have been migrated to the state-based storeState + stopChat() approach. This inconsistency means the same stale-state divergence that caused the double-message bug in send/interrupt could still manifest during reconnect. Consider migrating handleReconnect to the same state-based routing for consistency. [fixable]
🔵 style (L482): Case 1 comment lists states "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition storeState !== 'ENDED' also matches CLOSING. The comment should include CLOSING for accuracy, since CLOSING sessions are routable (they accept messages during graceful closeout). [fixable]

`server/tests/ws-handler-v2.test.ts`

Solid fix for the double-message bug via state-based routing. The main concern is that handleReconnect still uses the old isActive boolean pattern, leaving a consistency gap. Missing test coverage for the new running field in handleSwitchSession.

🟡 missing_tests (L362): handleSwitchSession tests don't assert the new running field in the session_switched response. The existing test uses expect.objectContaining which silently ignores unasserted fields. Should add test cases for: (1) running=true when session is active in registry, (2) running=false when session is not in registry. [fixable]

`server/chat.ts`

Solid fix for the double-message bug via state-based routing. The main concern is that handleReconnect still uses the old isActive boolean pattern, leaving a consistency gap. Missing test coverage for the new running field in handleSwitchSession.

🔵 style (L898): Double cast msg as unknown as Record<string, unknown> is unnecessarily permissive. BootContextMessage is an object type with string keys, so a single as Record<string, unknown> would suffice and preserve more type safety. The as unknown intermediate effectively turns off the type checker. [fixable]
🔵 unsafe_assumptions (L142): localBootContextFallback hardcodes a list of 5 source paths (lines 142-148) with the comment "coupled to SOURCE_PATHS in scripts/build_boot_context.py". If the Python script's source list changes, these will silently diverge. Consider extracting to a shared constant or reading the script output's actual source list. [fixable]

dimakis

Centaur Review

Found 7 issue(s) (3 warning).

`server/ws-handler-v2.ts`

Solid fix for the double-message bug with clean state-machine routing, but the CLOSING state falls through to the active path unintentionally, and the new handleSwitchSession changes (running field, early watch) lack dedicated test coverage.

🟡 bugs (L510): Comment lists states routed to Case 1 as "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition storeState !== 'ENDED' && storeState !== null also admits CLOSING. A session in CLOSING state is winding down (CLOSING can only transition to ENDED per VALID_TRANSITIONS). Routing a new message via sendToChat to a closing session may race with the closeout — the agent could receive a new prompt while it's trying to commit/summarize. Either add CLOSING to the exclusion list alongside ENDED (treating it as a zombie) or document in the comment that CLOSING is intentionally treated as active. [fixable]
🟡 bugs (L672): Same CLOSING-state gap in handleInterruptV2: the comment says "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition also admits CLOSING. Should be consistent with whatever decision is made for handleSendV2. [fixable]
🔵 style (L497): The block comments explaining the state-based routing (lines 497-511, 563-565, 576, and similarly in handleInterruptV2) are quite verbose for this codebase's conventions. CLAUDE.md says "default to writing no comments" and "only add one when the WHY is non-obvious." The initial Phase 3 header comment is useful context, but the per-case comments restate what the code already expresses through the condition structure. Consider trimming to one-line comments per case. [fixable]
🔵 unsafe_assumptions (L377): handleSwitchSession determines running via ctx.sessionRegistry.isActive(found.clientId) — the in-memory registry. But the state-based routing in handleSendV2/handleInterruptV2 explicitly avoids trusting the in-memory registry alone (the whole point of this PR). A zombie session would show running: true here even though the EventStore state is ENDED. Consider cross-referencing eventStore.getSessionState() for consistency, similar to how handleReconnect already does zombie detection (test at line 896). [fixable]

`server/tests/ws-handler-v2.test.ts`

Solid fix for the double-message bug with clean state-machine routing, but the CLOSING state falls through to the active path unintentionally, and the new handleSwitchSession changes (running field, early watch) lack dedicated test coverage.

🟡 missing_tests: No test verifies that the session_switched response includes the running field. The existing handleSwitchSession tests (lines 362-426) assert on mode/cwd/branch/wtId/tokens but not running. A test with an active session (isActive → true) should assert running: true, and a test with an ended session should assert running: false. [fixable]
🔵 missing_tests: No test covers the early watch() call added in handleSwitchSession (line 368). The intent — preventing a blind spot between switch and first send — is important enough to verify that connRegistry.watch is called before the response is sent. A spy-ordering test or at minimum an assertion that watch was called with the correct args would prevent a regression. [fixable]
🔵 missing_tests: No test covers the CLOSING state for either handleSendV2 or handleInterruptV2. Given the ambiguity noted above, a test that sets getSessionState to 'CLOSING' would document the intended behavior. [fixable]

dimakis · 2026-05-30T10:49:27Z

+          // bug on reattach.
+
+          // Case 1: Registry has the session AND state says it should
+          // be running (ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED).


🟡 bugs: Comment lists states routed to Case 1 as "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition storeState !== 'ENDED' && storeState !== null also admits CLOSING. A session in CLOSING state is winding down (CLOSING can only transition to ENDED per VALID_TRANSITIONS). Routing a new message via sendToChat to a closing session may race with the closeout — the agent could receive a new prompt while it's trying to commit/summarize. Either add CLOSING to the exclusion list alongside ENDED (treating it as a zombie) or document in the comment that CLOSING is intentionally treated as active. [fixable]

dimakis · 2026-05-30T10:49:27Z

+      //
+      // Same logic as handleSendV2: use the durable state column as
+      // the single source of truth for routing decisions.
+


🟡 bugs: Same CLOSING-state gap in handleInterruptV2: the comment says "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition also admits CLOSING. Should be consistent with whatever decision is made for handleSendV2. [fixable]

dimakis · 2026-05-30T10:49:27Z

@@ -484,62 +497,83 @@ export function handleSendV2(
            span.setAttribute('session.state_mismatch', mismatch.details ?? 'unknown');


🔵 style: The block comments explaining the state-based routing (lines 497-511, 563-565, 576, and similarly in handleInterruptV2) are quite verbose for this codebase's conventions. CLAUDE.md says "default to writing no comments" and "only add one when the WHY is non-obvious." The initial Phase 3 header comment is useful context, but the per-case comments restate what the code already expresses through the condition structure. Consider trimming to one-line comments per case. [fixable]

dimakis · 2026-05-30T10:49:27Z

+      // "double message" bug where the first send queues behind a stale
+      // running=false state.
+      const found = ctx.sessionRegistry.findBySessionId(msg.sessionId);
+      const running = found ? ctx.sessionRegistry.isActive(found.clientId) : false;


🔵 unsafe_assumptions: handleSwitchSession determines running via ctx.sessionRegistry.isActive(found.clientId) — the in-memory registry. But the state-based routing in handleSendV2/handleInterruptV2 explicitly avoids trusting the in-memory registry alone (the whole point of this PR). A zombie session would show running: true here even though the EventStore state is ENDED. Consider cross-referencing eventStore.getSessionState() for consistency, similar to how handleReconnect already does zombie detection (test at line 896). [fixable]

- Exclude CLOSING from active routing (treats as zombie alongside ENDED) - Cross-reference eventStore.getSessionState() in handleSwitchSession running detection to avoid reporting zombie as running - Trim verbose per-case comments to one-liners - Add tests: watch() in handleSwitchSession, running field (true/false), CLOSING state routing for both send and interrupt handlers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dimakis

Centaur Review (R2)

Clean fix following the Phase 3 design doc. State-based routing logic is sound, CLOSING exclusion is correct, zombie abort via stopChat() prevents leaked query loops.

No blocking issues.

Minor notes for follow-up:

handleReconnect still uses old getSession().isActive pattern (different code path, not a regression)
No protocol-parser test for SET_RUNNING dispatch (low risk)

130/130 tests passing. Approved.

dimakis force-pushed the session/2026-05-30-aed44118ad7b branch from ac5e4f8 to bd7162f Compare May 30, 2026 10:44

dimakis commented May 30, 2026

View reviewed changes

dimakis commented Jun 1, 2026

View reviewed changes

dimakis merged commit 91a652f into main Jun 1, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: state-based session routing to eliminate double-message bug#362

fix: state-based session routing to eliminate double-message bug#362
dimakis merged 2 commits into
mainfrom
session/2026-05-30-aed44118ad7b

dimakis commented May 30, 2026

Uh oh!

dimakis commented May 30, 2026

Uh oh!

dimakis left a comment

Uh oh!

dimakis May 30, 2026

Uh oh!

dimakis May 30, 2026

Uh oh!

dimakis May 30, 2026

Uh oh!

dimakis May 30, 2026

Uh oh!

dimakis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -484,62 +497,83 @@ export function handleSendV2(
		span.setAttribute('session.state_mismatch', mismatch.details ?? 'unknown');

Conversation

dimakis commented May 30, 2026

Summary

Test plan

Uh oh!

dimakis commented May 30, 2026

Centaur Review

server/ws-handler-v2.ts

server/__tests__/ws-handler-v2.test.ts

server/chat.ts

Uh oh!

dimakis left a comment

Choose a reason for hiding this comment

Centaur Review

server/ws-handler-v2.ts

server/__tests__/ws-handler-v2.test.ts

Uh oh!

dimakis May 30, 2026

Choose a reason for hiding this comment

Uh oh!

dimakis May 30, 2026

Choose a reason for hiding this comment

Uh oh!

dimakis May 30, 2026

Choose a reason for hiding this comment

Uh oh!

dimakis May 30, 2026

Choose a reason for hiding this comment

Uh oh!

dimakis left a comment

Choose a reason for hiding this comment

Centaur Review (R2)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`server/ws-handler-v2.ts`

`server/tests/ws-handler-v2.test.ts`

`server/chat.ts`

`server/ws-handler-v2.ts`

`server/tests/ws-handler-v2.test.ts`