Skip to content

fix: state-based session routing to eliminate double-message bug#362

Merged
dimakis merged 2 commits into
mainfrom
session/2026-05-30-aed44118ad7b
Jun 1, 2026
Merged

fix: state-based session routing to eliminate double-message bug#362
dimakis merged 2 commits into
mainfrom
session/2026-05-30-aed44118ad7b

Conversation

@dimakis
Copy link
Copy Markdown
Owner

@dimakis dimakis commented May 30, 2026

Summary

  • Root cause: handleSwitchSession didn't call watch(), so events from the query loop never reached the reconnected client. Combined with a staleInMemory check that could incorrectly kill a running query loop when EventStore and SessionRegistry diverged, the first message after navigating back to a session was silently swallowed.
  • Replace the staleInMemory boolean check with Phase 3 state-based routing using getSessionState() as the single source of truth in both handleSendV2 and handleInterruptV2
  • Add watch() call to handleSwitchSession + return running state to client
  • Use stopChat() (abort signal) instead of registry.remove() for zombie sessions

Test plan

  • 125/125 ws-handler-v2 tests pass (verified)
  • Navigate away from active session and back — first message should get a response immediately
  • Repeat with session that was idle >48h (ENDED state) — should resume cleanly
  • Verify interrupt works after session switch without double-send

🤖 Generated with Claude Code

Replace the staleInMemory check (isActive boolean) with Phase 3
state-based routing using the durable getSessionState() column as the
single source of truth. The old check could incorrectly kill a running
query loop when EventStore and SessionRegistry diverged after
navigation, causing the first message to be swallowed.

Changes:
- handleSwitchSession: add watch() so events reach the connection
  immediately; return `running` state in session_switched response
- handleSendV2: reorder watch before send; replace staleInMemory with
  state-based routing (ACTIVE/DETACHED → route, ENDED → abort + resume)
- handleInterruptV2: same Phase 3 fix as handleSendV2
- protocol-parser: dispatch SET_RUNNING on session_switched
- Use stopChat() (abort signal) instead of registry.remove() for zombie
  sessions to prevent leaked query loops

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis dimakis force-pushed the session/2026-05-30-aed44118ad7b branch from ac5e4f8 to bd7162f Compare May 30, 2026 10:44
@dimakis
Copy link
Copy Markdown
Owner Author

dimakis commented May 30, 2026

Centaur Review

Found 5 issue(s) (2 warning).

server/ws-handler-v2.ts

Solid fix for the double-message bug via state-based routing. The main concern is that handleReconnect still uses the old isActive boolean pattern, leaving a consistency gap. Missing test coverage for the new running field in handleSwitchSession.

  • 🟡 regressions (L215): handleReconnect still uses the old storeMeta.isActive boolean + sessionRegistry.remove() pattern (lines 215-223) while handleSendV2/handleInterruptV2 have been migrated to the state-based storeState + stopChat() approach. This inconsistency means the same stale-state divergence that caused the double-message bug in send/interrupt could still manifest during reconnect. Consider migrating handleReconnect to the same state-based routing for consistency. [fixable]
  • 🔵 style (L482): Case 1 comment lists states "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition storeState !== 'ENDED' also matches CLOSING. The comment should include CLOSING for accuracy, since CLOSING sessions are routable (they accept messages during graceful closeout). [fixable]

server/__tests__/ws-handler-v2.test.ts

Solid fix for the double-message bug via state-based routing. The main concern is that handleReconnect still uses the old isActive boolean pattern, leaving a consistency gap. Missing test coverage for the new running field in handleSwitchSession.

  • 🟡 missing_tests (L362): handleSwitchSession tests don't assert the new running field in the session_switched response. The existing test uses expect.objectContaining which silently ignores unasserted fields. Should add test cases for: (1) running=true when session is active in registry, (2) running=false when session is not in registry. [fixable]

server/chat.ts

Solid fix for the double-message bug via state-based routing. The main concern is that handleReconnect still uses the old isActive boolean pattern, leaving a consistency gap. Missing test coverage for the new running field in handleSwitchSession.

  • 🔵 style (L898): Double cast msg as unknown as Record<string, unknown> is unnecessarily permissive. BootContextMessage is an object type with string keys, so a single as Record<string, unknown> would suffice and preserve more type safety. The as unknown intermediate effectively turns off the type checker. [fixable]
  • 🔵 unsafe_assumptions (L142): localBootContextFallback hardcodes a list of 5 source paths (lines 142-148) with the comment "coupled to SOURCE_PATHS in scripts/build_boot_context.py". If the Python script's source list changes, these will silently diverge. Consider extracting to a shared constant or reading the script output's actual source list. [fixable]

Copy link
Copy Markdown
Owner Author

@dimakis dimakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 7 issue(s) (3 warning).

server/ws-handler-v2.ts

Solid fix for the double-message bug with clean state-machine routing, but the CLOSING state falls through to the active path unintentionally, and the new handleSwitchSession changes (running field, early watch) lack dedicated test coverage.

  • 🟡 bugs (L510): Comment lists states routed to Case 1 as "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition storeState !== 'ENDED' && storeState !== null also admits CLOSING. A session in CLOSING state is winding down (CLOSING can only transition to ENDED per VALID_TRANSITIONS). Routing a new message via sendToChat to a closing session may race with the closeout — the agent could receive a new prompt while it's trying to commit/summarize. Either add CLOSING to the exclusion list alongside ENDED (treating it as a zombie) or document in the comment that CLOSING is intentionally treated as active. [fixable]
  • 🟡 bugs (L672): Same CLOSING-state gap in handleInterruptV2: the comment says "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition also admits CLOSING. Should be consistent with whatever decision is made for handleSendV2. [fixable]
  • 🔵 style (L497): The block comments explaining the state-based routing (lines 497-511, 563-565, 576, and similarly in handleInterruptV2) are quite verbose for this codebase's conventions. CLAUDE.md says "default to writing no comments" and "only add one when the WHY is non-obvious." The initial Phase 3 header comment is useful context, but the per-case comments restate what the code already expresses through the condition structure. Consider trimming to one-line comments per case. [fixable]
  • 🔵 unsafe_assumptions (L377): handleSwitchSession determines running via ctx.sessionRegistry.isActive(found.clientId) — the in-memory registry. But the state-based routing in handleSendV2/handleInterruptV2 explicitly avoids trusting the in-memory registry alone (the whole point of this PR). A zombie session would show running: true here even though the EventStore state is ENDED. Consider cross-referencing eventStore.getSessionState() for consistency, similar to how handleReconnect already does zombie detection (test at line 896). [fixable]

server/__tests__/ws-handler-v2.test.ts

Solid fix for the double-message bug with clean state-machine routing, but the CLOSING state falls through to the active path unintentionally, and the new handleSwitchSession changes (running field, early watch) lack dedicated test coverage.

  • 🟡 missing_tests: No test verifies that the session_switched response includes the running field. The existing handleSwitchSession tests (lines 362-426) assert on mode/cwd/branch/wtId/tokens but not running. A test with an active session (isActive → true) should assert running: true, and a test with an ended session should assert running: false. [fixable]
  • 🔵 missing_tests: No test covers the early watch() call added in handleSwitchSession (line 368). The intent — preventing a blind spot between switch and first send — is important enough to verify that connRegistry.watch is called before the response is sent. A spy-ordering test or at minimum an assertion that watch was called with the correct args would prevent a regression. [fixable]
  • 🔵 missing_tests: No test covers the CLOSING state for either handleSendV2 or handleInterruptV2. Given the ambiguity noted above, a test that sets getSessionState to 'CLOSING' would document the intended behavior. [fixable]

Comment thread server/ws-handler-v2.ts Outdated
// bug on reattach.

// Case 1: Registry has the session AND state says it should
// be running (ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED).
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: Comment lists states routed to Case 1 as "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition storeState !== 'ENDED' && storeState !== null also admits CLOSING. A session in CLOSING state is winding down (CLOSING can only transition to ENDED per VALID_TRANSITIONS). Routing a new message via sendToChat to a closing session may race with the closeout — the agent could receive a new prompt while it's trying to commit/summarize. Either add CLOSING to the exclusion list alongside ENDED (treating it as a zombie) or document in the comment that CLOSING is intentionally treated as active. [fixable]

Comment thread server/ws-handler-v2.ts Outdated
//
// Same logic as handleSendV2: use the durable state column as
// the single source of truth for routing decisions.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: Same CLOSING-state gap in handleInterruptV2: the comment says "ACTIVE, DETACHED, SUSPENDED, STARTING, CREATED" but the condition also admits CLOSING. Should be consistent with whatever decision is made for handleSendV2. [fixable]

Comment thread server/ws-handler-v2.ts
@@ -484,62 +497,83 @@ export function handleSendV2(
span.setAttribute('session.state_mismatch', mismatch.details ?? 'unknown');
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The block comments explaining the state-based routing (lines 497-511, 563-565, 576, and similarly in handleInterruptV2) are quite verbose for this codebase's conventions. CLAUDE.md says "default to writing no comments" and "only add one when the WHY is non-obvious." The initial Phase 3 header comment is useful context, but the per-case comments restate what the code already expresses through the condition structure. Consider trimming to one-line comments per case. [fixable]

Comment thread server/ws-handler-v2.ts Outdated
// "double message" bug where the first send queues behind a stale
// running=false state.
const found = ctx.sessionRegistry.findBySessionId(msg.sessionId);
const running = found ? ctx.sessionRegistry.isActive(found.clientId) : false;
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: handleSwitchSession determines running via ctx.sessionRegistry.isActive(found.clientId) — the in-memory registry. But the state-based routing in handleSendV2/handleInterruptV2 explicitly avoids trusting the in-memory registry alone (the whole point of this PR). A zombie session would show running: true here even though the EventStore state is ENDED. Consider cross-referencing eventStore.getSessionState() for consistency, similar to how handleReconnect already does zombie detection (test at line 896). [fixable]

- Exclude CLOSING from active routing (treats as zombie alongside ENDED)
- Cross-reference eventStore.getSessionState() in handleSwitchSession
  running detection to avoid reporting zombie as running
- Trim verbose per-case comments to one-liners
- Add tests: watch() in handleSwitchSession, running field (true/false),
  CLOSING state routing for both send and interrupt handlers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@dimakis dimakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review (R2)

Clean fix following the Phase 3 design doc. State-based routing logic is sound, CLOSING exclusion is correct, zombie abort via stopChat() prevents leaked query loops.

No blocking issues.

Minor notes for follow-up:

  • handleReconnect still uses old getSession().isActive pattern (different code path, not a regression)
  • No protocol-parser test for SET_RUNNING dispatch (low risk)

130/130 tests passing. Approved.

@dimakis dimakis merged commit 91a652f into main Jun 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant