Skip to content

Session pool reset detection via message count drop#7

Open
sterling-prog wants to merge 30 commits into
wende:mainfrom
sterling-prog:build/session-pool-reset-detection
Open

Session pool reset detection via message count drop#7
sterling-prog wants to merge 30 commits into
wende:mainfrom
sterling-prog:build/session-pool-reset-detection

Conversation

@sterling-prog
Copy link
Copy Markdown

Summary

  • Adds lastMessageCount to PooledProcess (init: 0) and messageCount to PendingRequest
  • Detects gateway session resets (via /new, idle timeout, compaction) by tracking incoming messageCount per request
  • When messageCount < proc.lastMessageCount on a locked process: recycles stale CLI process and routes to a fresh warm/cold slot
  • Idle process: recycles immediately (clearSessionLock + killAndRespawn)
  • Busy process: marks orphaned=true, clears lock from map; releaseProcess kills after current request completes
  • Threads messageCount through all call sites: execute(), routeToProcess(), assignToProcess(), enqueueOnProcess(), enqueueOnSentinel(), drainNextRequest(), cold spawn .then()
  • Routes layer passes body.messages.length to router.execute()

Test plan

  • 5 requests with increasing message counts → all route to same locked process, no recycle
  • Request with message count lower than previous → session_reset_detected logged, fresh process claimed
  • messageCount: 0 edge case → no recycle (guard: messageCount > 0)
  • First request on fresh process → no recycle (guard: proc.lastMessageCount > 0)
  • Busy process during reset → process enters orphaned, completes current request, then recycles
  • npm run build passes clean ✅

🤖 Generated with Claude Code

Scope and others added 30 commits March 21, 2026 18:06
Addresses 3 critical (request serialization, orphan reclamation, context
accumulation), 5 major (process death, flex zone, disconnect, auth, unknown
model), and 6 minor (DST, stats, testing, commits, logging, caps) findings
from Opus architectural review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three major findings resolved:
- N1: Orphan queue rejection (reject stale-session requests, don't drain)
- N2: Per-request timeout (POOL_REQUEST_TIMEOUT_MS, 5 min default)
- N3: Atomic pool claim via PENDING_SENTINEL (prevents race on new keys)

Four minor findings resolved:
- N4: agentChannel format validation with fallback
- N5: Off-by-one wording fix in backpressure validation
- N6: ClaudeSubprocess fallback documented as uncapped (intentional)
- N7: Aggregate route-hit counters added to stats()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…review

N8: Canonical clearSessionLock() for all unlock paths
N9: Failed cold spawn rejects queued requests before sentinel deletion
N10: Fallback serialization loss documented as known degradation
N11: Sweep refill checks cap per-spawn, not once
N12: Shutdown closes listening socket before draining
N13: Health endpoint includes locked counts per model
N14: Sweep skips processes in recycling state

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… review

N15: Router checks POOLED_MODELS set, not type-level unknowns (extractModel defaults to opus)
N16: PENDING_SENTINEL specified as lightweight object with requestQueue
N17: clearSessionLock called after drain completes, not before
N18: standalone.ts owns full shutdown sequence (no pool threading into index.ts)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…weep (session-pooling.spec.md)

- M1: SessionPoolRouter (src/subprocess/router.ts)
  - Per-model warm pools (opus, sonnet) with configurable sizes
  - Session-key locking with lockedSessions map
  - PendingSentinel atomic pool claim (prevents race on new-key requests)
  - Per-process FIFO request queue (max depth 3) with 429 backpressure
  - Lineage tracking (agentChannel) for orphan detection on session reset
  - Orphan queue rejection: queued requests rejected 503, not drained
  - Context accumulation guard: recycle process when requestCount > threshold
  - canonical clearSessionLock() used by all unlock paths
  - Failed cold spawn recovery: reject queued requests before deleting sentinel
  - Per-request timeout (POOL_REQUEST_TIMEOUT_MS): hung process treated as dead
  - Process death recovery: atomic cleanup, reject queue, spawn replacement
  - Auth error detection via stderr pattern matching
  - ClaudeSubprocess fallback for non-pooled models and headerless requests
  - Nightly sweep: idle > 2h or requestCount > threshold; refill warm pool
  - stats() returns full RouterStats (total, locked, warm, busy, queued, etc.)

- M2: Route integration (src/server/routes.ts)
  - Extracts x-openclaw-session-key header; routes through SessionPoolRouter.execute()
  - Fallback to ClaudeSubprocess if no session key or no pool
  - Client disconnect: detach emitter for pooled process (let it finish); kill for subprocess fallback
  - Health endpoint includes pool stats from stats()
  - Structured logging per request: sessionKey, model, pid, latencyMs

- M3: Server startup and sweep scheduling (src/server/standalone.ts)
  - Initializes SessionPoolRouter on startup with env var config
  - Nightly sweep via node-cron at 3 AM ET (DST-aware timezone: America/New_York)
  - Graceful shutdown: close socket immediately, wait 30s, router.shutdown(), exit
  - All pool sizes configurable via env vars (POOL_OPUS_SIZE, POOL_SONNET_SIZE, etc.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ate CLAUDE.md

- Remove src/subprocess/pool.ts (shared stateless pool prototype — superseded by SessionPoolRouter)
- Remove src/server/standalone-pool.ts (prototype server entry point — superseded by production standalone.ts)
- Update CLAUDE.md with session-pooled architecture docs, request routing diagram, env var table

These were untracked prototype files. The production implementation is in router.ts (M1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n, M2-02 agent-id header, M2-01 queueDepth+cacheHit logging

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two inline lockedSessions.delete() calls at sentinel cleanup points
(pool-at-capacity fallback and cold-spawn-failure) bypassed clearSessionLock().
These are PendingSentinel entries (not PooledProcess), so the canonical
function signature is extended to accept null for proc.

The spec grep validation — lockedSessions.delete appearing ONLY inside
clearSessionLock() — now passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation

Session Pooling — Claude Max API Proxy
The gateway does not send x-openclaw-session-key headers on outbound
provider calls. Without this fallback, all production requests bypassed
the session pool and fell through to one-shot ClaudeSubprocess.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ash-handler.spec.md)

Adds process-level error handlers so silent crashes log [FATAL] to
stderr before exiting. Fixes the 04:35/06:14/06:32 silent crash pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…safe-emit.spec.md)

Prevents uncaught exception crash when error is emitted into a bare EventEmitter
after routes.ts has stripped listeners via removeAllListeners().

Sites fixed: child process error handler, request timeout, handleProcessDeath, shutdown loop.
Suppressed errors are logged to stderr with process ID and error message.
Converts safeEmitter(emitter) wrapper to safeEmitter() factory — EventEmitter
is created inside the function, preventing future call sites from forgetting
the guard. All 7 creation sites updated; build passes clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The complex version (activeListeners counter + monkey-patched on/off/removeListener +
prependListener guard) was over-engineered. A permanent `emitter.on('error', ...)`
at creation time is sufficient — it guarantees listenerCount('error') >= 1, preventing
Node's uncaught exception on emit('error') with zero listeners.

Root cause: routes.ts calls removeAllListeners() on client disconnect, stripping all
listeners. If the CLI process emits an error after cleanup, Node throws (EventEmitter
contract: emit('error') with 0 listeners = throw). In Node v25, unhandled exceptions
exit silently with code 1.

Defense layers:
1. safeEmitter() — permanent error listener prevents the throw (this commit)
2. uncaughtException/unhandledRejection handlers in standalone.ts (commit 4ef1787)

Known limitation: removeAllListeners() in routes.ts:178 can strip the permanent
listener. A follow-up should change to targeted removeAllListeners('data') etc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ghtException handlers

Adds [FATAL] logging for previously-silent crashes (Node v25 exits code 1 with no
output on unhandled rejections/exceptions).
…stener

Root cause fix for recurring proxy crashes. EventEmitter error events emitted after
removeAllListeners() caused Node uncaught exceptions → silent exit code 1 in Node v25.

safeEmitter() factory attaches a permanent error listener at creation time, preventing
the zero-listener throw. Applied to all 7 EventEmitter creation sites in router.ts.
router.ts:
- Add safeListenerCount() helper — wraps listenerCount() in try-catch,
  returns 0 if emitter is corrupted (prevents [FATAL] crash)
- Update timeout handler and process error handler to use safeListenerCount()
- Add else-branch logging when listeners are absent (observability)

routes.ts:
- Replace removeAllListeners() with targeted removeListener() calls
- Convert inline emitter.on() to named function refs (onTextBlockStart,
  onContentDelta, onAssistant, onResult, onError)
- Client disconnect now removes only request-specific listeners,
  preserving safeEmitter()'s permanent error listener

Fixes: pool crash on corrupted emitter + removeAllListeners stripping
permanent error guard
Logs arrival time, sessionKey, model, messageCount, contentLength
at the top of handleChatCompletions — before any routing logic.
Also logs route decision (routeType, pid, queueDepth, elapsedMs).

Purpose: diagnose dropped requests where gateway times out but
proxy has no record of receiving the request.
Root cause: safeEmitter patches prevented crashes but left processes in
zombie 'busy' state when the gateway timed out and disconnected. The
process would stay locked to the session, still processing a response
nobody was consuming. Next request to the same session routed to the
zombie process → hung → cascading timeout.

Fix: When res.on('close') fires and the request isn't complete:
1. Remove request-specific listeners (existing)
2. Call router.cancelRequest(sessionKey) (NEW)
3. cancelRequest kills the stuck CLI process and respawns a fresh one

This preserves safeEmitter's crash prevention while ensuring processes
don't stay in zombie state after client disconnect.
Send ': keep-alive' SSE comments every 15s while waiting for the CLI
to produce the first real token. SSE comments (lines starting with ':')
are ignored by OpenAI SDK parsers but reset the HTTP connection idle
timer, preventing the gateway's 60s timeout from firing during opus
thinking time on large prompts.

Timer is cleared on: first content, result, error, or client disconnect.

Root cause: opus on 100KB+ prompts can take >60s to produce first token.
The gateway's OpenAI SDK times out at 60s idle, triggering cascade to
cyberdyne/deepseek even though the proxy and CLI are working correctly.
The SSE comment approach was wrong — OpenClaw's idle timeout tracks the
parsed token stream, not raw HTTP bytes, so comment lines don't reset it.
Worse, the ': keep-alive' SSE comments appear to have confused the
gateway's stream parser, causing responses to complete at the proxy but
not reach Discord.

The correct fix is agents.defaults.llm.idleTimeoutSeconds = 300 in
openclaw.json (already applied), which configures the idle timer directly.
Bug: cancelRequest() killed the process but never called clearSessionLock().
The lockedSessions map kept pointing at the dead process. Every subsequent
request for the same session found the dead process, enqueued behind it
(queueDepth=1), waited 300s for idleTimeout, cancelled, and repeated
indefinitely — a permanent cascade loop.

Fix: call clearSessionLock(sessionKey, proc) before killAndRespawn so
the next request routes to a fresh warm process instead of queuing behind
a dead one.

Evidence: pid 341522 was spawned at 18:49, received 4 requests over 3.5h,
all cancelled at 300s, never logged a process_death, always showed
queueDepth=1 — the lock was never released.
…n pooled processes (session-pool-context-fix.spec.md)
safeEmitter() guarantees a permanent error listener, making all
listenerCount("error") > 0 checks always-true dead code. Remove all 8
guards and emit unconditionally. Also remove PendingRequest.resolve
which was always a no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uard (session-pool-context-fix.spec.md)

- onError() now removes all event listeners before resolving (was leaving
  text_block_start/content_delta/assistant/result attached after error)
- extractText() filters blocks where text is null/undefined to prevent
  silent empty returns on malformed content arrays

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Track lastMessageCount on PooledProcess; recycle stale CLI process when
incoming messageCount drops below stored value, indicating a gateway reset
(/new, idle timeout, or compaction). Threads messageCount through execute(),
routeToProcess(), assignToProcess(), enqueueOnProcess(), enqueueOnSentinel(),
drainNextRequest(), and PendingRequest. Routes layer passes messages.length.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In SessionPoolRouter.execute(), the busy-process session reset path directly calls lockedSessions.delete(sessionKey) and mutates proc.lockedTo rather than going through clearSessionLock(), which breaks the invariant that all unlocks use the canonical helper and leaves fields like agentChannel/requestCount uncleared; consider refactoring this branch to reuse clearSessionLock() (or a variant that marks the process orphaned) to keep lock state consistent.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In SessionPoolRouter.execute(), the busy-process session reset path directly calls `lockedSessions.delete(sessionKey)` and mutates `proc.lockedTo` rather than going through `clearSessionLock()`, which breaks the invariant that all unlocks use the canonical helper and leaves fields like `agentChannel`/`requestCount` uncleared; consider refactoring this branch to reuse `clearSessionLock()` (or a variant that marks the process orphaned) to keep lock state consistent.

## Individual Comments

### Comment 1
<location path="src/subprocess/router.ts" line_range="385-387" />
<code_context>
+    this.warmPool.get(model)!.push(proc);
+  }
+
+  private async spawnCold(model: PooledModel): Promise<PooledProcess> {
+    const proc = this.spawnProcess(model);
+    await new Promise<void>((resolve) => {
+      if (proc.ready) {
+        resolve();
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The spawnCold readiness wait never actually triggers because `proc.ready` is set synchronously in spawnProcess.

Because `spawnProcess` sets `pooled.ready = true` just before returning, the `spawnCold` promise always resolves immediately and never uses the `stdout.once("data", onReady)`/timeout path. As a result, `spawnCold` effectively behaves like `spawnWarm`.

If you do want to wait for CLI output before marking readiness, consider moving the `pooled.ready = true` assignment to the point where the first init message is parsed and use that here. If not, it would be clearer to remove the unused readiness logic from `spawnCold`.

Suggested implementation:

```typescript
  private async spawnCold(model: PooledModel): Promise<PooledProcess> {
    return this.spawnProcess(model);
  }

```

If the existing implementation of `spawnCold` in your local file differs slightly in formatting or structure (e.g. missing `return proc;` or different brace layout), adjust the SEARCH block accordingly so that the entire function body (including the `await new Promise(...)` readiness logic) is replaced with the simplified implementation that just returns `this.spawnProcess(model)`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/subprocess/router.ts
Comment on lines +385 to +387
private async spawnCold(model: PooledModel): Promise<PooledProcess> {
const proc = this.spawnProcess(model);
await new Promise<void>((resolve) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): The spawnCold readiness wait never actually triggers because proc.ready is set synchronously in spawnProcess.

Because spawnProcess sets pooled.ready = true just before returning, the spawnCold promise always resolves immediately and never uses the stdout.once("data", onReady)/timeout path. As a result, spawnCold effectively behaves like spawnWarm.

If you do want to wait for CLI output before marking readiness, consider moving the pooled.ready = true assignment to the point where the first init message is parsed and use that here. If not, it would be clearer to remove the unused readiness logic from spawnCold.

Suggested implementation:

  private async spawnCold(model: PooledModel): Promise<PooledProcess> {
    return this.spawnProcess(model);
  }

If the existing implementation of spawnCold in your local file differs slightly in formatting or structure (e.g. missing return proc; or different brace layout), adjust the SEARCH block accordingly so that the entire function body (including the await new Promise(...) readiness logic) is replaced with the simplified implementation that just returns this.spawnProcess(model).

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a session-aware process pool for the Claude CLI proxy, transitioning from a subprocess-per-request model to persistent, session-locked processes to improve performance and prevent context leakage. The changes include the SessionPoolRouter for pool management, request queuing, and orphan reclamation, along with integration into the Express routes and standalone server. Review feedback emphasizes the need for architectural consistency by using the canonical clearSessionLock method in all session termination paths, immediately rejecting queued requests upon session reset, and ensuring that client disconnects do not kill processes, as required by the specification.

Comment thread src/subprocess/router.ts
Comment on lines +288 to +290
this.lockedSessions.delete(sessionKey);
proc.lockedTo = null;
proc.orphaned = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block manually deletes the session lock and clears lockedTo instead of using the canonical clearSessionLock() method. This violates the "Major" rule defined in the specification (Finding N8). Additionally, it should call this.rejectProcessQueue(proc) here to immediately reject any stale requests that might be queued for this session, ensuring they are failed as soon as the reset is detected rather than waiting for the active request to complete.

Suggested change
this.lockedSessions.delete(sessionKey);
proc.lockedTo = null;
proc.orphaned = true;
this.rejectProcessQueue(proc);
this.clearSessionLock(sessionKey, proc);
proc.orphaned = true;
References
  1. Every code path that unlocks a session MUST call the canonical clearSessionLock() function to prevent stale artifacts (Finding N8).

Comment thread src/subprocess/router.ts
// Clear the session lock BEFORE kill so the next request gets a fresh
// warm process instead of enqueuing behind the dead one.
this.clearSessionLock(sessionKey, proc);
this.killAndRespawn(proc);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation kills and respawns the process when a client disconnects. This contradicts the specification in specs/session-pooling.spec.md (Module 2 AC and "Client Disconnect Handling"), which states that the process should be allowed to finish generating its response and then return to an idle state. While killing the process prevents context divergence, it increases latency for subsequent requests due to the need for a fresh process claim or cold start.

Comment thread src/subprocess/router.ts
} else {
// Busy — mark for reclamation after current request
proc.orphaned = true;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the busy branch of orphan reclamation, the session lock is not cleared. While releaseProcess will eventually clear it, the specification (Finding N8) requires all unlock paths to use the canonical clearSessionLock() method. Clearing it here immediately (as is done for the idle case at line 951) would be more consistent and ensure the old session key is unbound immediately.

          this.clearSessionLock(key, proc);
          proc.orphaned = true;
References
  1. Every code path that unlocks a session MUST call the canonical clearSessionLock() function to prevent stale artifacts (Finding N8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant