Session pool reset detection via message count drop#7
Conversation
Addresses 3 critical (request serialization, orphan reclamation, context accumulation), 5 major (process death, flex zone, disconnect, auth, unknown model), and 6 minor (DST, stats, testing, commits, logging, caps) findings from Opus architectural review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three major findings resolved: - N1: Orphan queue rejection (reject stale-session requests, don't drain) - N2: Per-request timeout (POOL_REQUEST_TIMEOUT_MS, 5 min default) - N3: Atomic pool claim via PENDING_SENTINEL (prevents race on new keys) Four minor findings resolved: - N4: agentChannel format validation with fallback - N5: Off-by-one wording fix in backpressure validation - N6: ClaudeSubprocess fallback documented as uncapped (intentional) - N7: Aggregate route-hit counters added to stats() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…review N8: Canonical clearSessionLock() for all unlock paths N9: Failed cold spawn rejects queued requests before sentinel deletion N10: Fallback serialization loss documented as known degradation N11: Sweep refill checks cap per-spawn, not once N12: Shutdown closes listening socket before draining N13: Health endpoint includes locked counts per model N14: Sweep skips processes in recycling state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… review N15: Router checks POOLED_MODELS set, not type-level unknowns (extractModel defaults to opus) N16: PENDING_SENTINEL specified as lightweight object with requestQueue N17: clearSessionLock called after drain completes, not before N18: standalone.ts owns full shutdown sequence (no pool threading into index.ts) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…weep (session-pooling.spec.md) - M1: SessionPoolRouter (src/subprocess/router.ts) - Per-model warm pools (opus, sonnet) with configurable sizes - Session-key locking with lockedSessions map - PendingSentinel atomic pool claim (prevents race on new-key requests) - Per-process FIFO request queue (max depth 3) with 429 backpressure - Lineage tracking (agentChannel) for orphan detection on session reset - Orphan queue rejection: queued requests rejected 503, not drained - Context accumulation guard: recycle process when requestCount > threshold - canonical clearSessionLock() used by all unlock paths - Failed cold spawn recovery: reject queued requests before deleting sentinel - Per-request timeout (POOL_REQUEST_TIMEOUT_MS): hung process treated as dead - Process death recovery: atomic cleanup, reject queue, spawn replacement - Auth error detection via stderr pattern matching - ClaudeSubprocess fallback for non-pooled models and headerless requests - Nightly sweep: idle > 2h or requestCount > threshold; refill warm pool - stats() returns full RouterStats (total, locked, warm, busy, queued, etc.) - M2: Route integration (src/server/routes.ts) - Extracts x-openclaw-session-key header; routes through SessionPoolRouter.execute() - Fallback to ClaudeSubprocess if no session key or no pool - Client disconnect: detach emitter for pooled process (let it finish); kill for subprocess fallback - Health endpoint includes pool stats from stats() - Structured logging per request: sessionKey, model, pid, latencyMs - M3: Server startup and sweep scheduling (src/server/standalone.ts) - Initializes SessionPoolRouter on startup with env var config - Nightly sweep via node-cron at 3 AM ET (DST-aware timezone: America/New_York) - Graceful shutdown: close socket immediately, wait 30s, router.shutdown(), exit - All pool sizes configurable via env vars (POOL_OPUS_SIZE, POOL_SONNET_SIZE, etc.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ate CLAUDE.md - Remove src/subprocess/pool.ts (shared stateless pool prototype — superseded by SessionPoolRouter) - Remove src/server/standalone-pool.ts (prototype server entry point — superseded by production standalone.ts) - Update CLAUDE.md with session-pooled architecture docs, request routing diagram, env var table These were untracked prototype files. The production implementation is in router.ts (M1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n, M2-02 agent-id header, M2-01 queueDepth+cacheHit logging Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two inline lockedSessions.delete() calls at sentinel cleanup points (pool-at-capacity fallback and cold-spawn-failure) bypassed clearSessionLock(). These are PendingSentinel entries (not PooledProcess), so the canonical function signature is extended to accept null for proc. The spec grep validation — lockedSessions.delete appearing ONLY inside clearSessionLock() — now passes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation Session Pooling — Claude Max API Proxy
The gateway does not send x-openclaw-session-key headers on outbound provider calls. Without this fallback, all production requests bypassed the session pool and fell through to one-shot ClaudeSubprocess. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ash-handler.spec.md) Adds process-level error handlers so silent crashes log [FATAL] to stderr before exiting. Fixes the 04:35/06:14/06:32 silent crash pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…safe-emit.spec.md) Prevents uncaught exception crash when error is emitted into a bare EventEmitter after routes.ts has stripped listeners via removeAllListeners(). Sites fixed: child process error handler, request timeout, handleProcessDeath, shutdown loop. Suppressed errors are logged to stderr with process ID and error message.
Converts safeEmitter(emitter) wrapper to safeEmitter() factory — EventEmitter is created inside the function, preventing future call sites from forgetting the guard. All 7 creation sites updated; build passes clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The complex version (activeListeners counter + monkey-patched on/off/removeListener +
prependListener guard) was over-engineered. A permanent `emitter.on('error', ...)`
at creation time is sufficient — it guarantees listenerCount('error') >= 1, preventing
Node's uncaught exception on emit('error') with zero listeners.
Root cause: routes.ts calls removeAllListeners() on client disconnect, stripping all
listeners. If the CLI process emits an error after cleanup, Node throws (EventEmitter
contract: emit('error') with 0 listeners = throw). In Node v25, unhandled exceptions
exit silently with code 1.
Defense layers:
1. safeEmitter() — permanent error listener prevents the throw (this commit)
2. uncaughtException/unhandledRejection handlers in standalone.ts (commit 4ef1787)
Known limitation: removeAllListeners() in routes.ts:178 can strip the permanent
listener. A follow-up should change to targeted removeAllListeners('data') etc.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ghtException handlers Adds [FATAL] logging for previously-silent crashes (Node v25 exits code 1 with no output on unhandled rejections/exceptions).
…stener Root cause fix for recurring proxy crashes. EventEmitter error events emitted after removeAllListeners() caused Node uncaught exceptions → silent exit code 1 in Node v25. safeEmitter() factory attaches a permanent error listener at creation time, preventing the zero-listener throw. Applied to all 7 EventEmitter creation sites in router.ts.
router.ts: - Add safeListenerCount() helper — wraps listenerCount() in try-catch, returns 0 if emitter is corrupted (prevents [FATAL] crash) - Update timeout handler and process error handler to use safeListenerCount() - Add else-branch logging when listeners are absent (observability) routes.ts: - Replace removeAllListeners() with targeted removeListener() calls - Convert inline emitter.on() to named function refs (onTextBlockStart, onContentDelta, onAssistant, onResult, onError) - Client disconnect now removes only request-specific listeners, preserving safeEmitter()'s permanent error listener Fixes: pool crash on corrupted emitter + removeAllListeners stripping permanent error guard
Logs arrival time, sessionKey, model, messageCount, contentLength at the top of handleChatCompletions — before any routing logic. Also logs route decision (routeType, pid, queueDepth, elapsedMs). Purpose: diagnose dropped requests where gateway times out but proxy has no record of receiving the request.
Root cause: safeEmitter patches prevented crashes but left processes in
zombie 'busy' state when the gateway timed out and disconnected. The
process would stay locked to the session, still processing a response
nobody was consuming. Next request to the same session routed to the
zombie process → hung → cascading timeout.
Fix: When res.on('close') fires and the request isn't complete:
1. Remove request-specific listeners (existing)
2. Call router.cancelRequest(sessionKey) (NEW)
3. cancelRequest kills the stuck CLI process and respawns a fresh one
This preserves safeEmitter's crash prevention while ensuring processes
don't stay in zombie state after client disconnect.
Send ': keep-alive' SSE comments every 15s while waiting for the CLI to produce the first real token. SSE comments (lines starting with ':') are ignored by OpenAI SDK parsers but reset the HTTP connection idle timer, preventing the gateway's 60s timeout from firing during opus thinking time on large prompts. Timer is cleared on: first content, result, error, or client disconnect. Root cause: opus on 100KB+ prompts can take >60s to produce first token. The gateway's OpenAI SDK times out at 60s idle, triggering cascade to cyberdyne/deepseek even though the proxy and CLI are working correctly.
The SSE comment approach was wrong — OpenClaw's idle timeout tracks the parsed token stream, not raw HTTP bytes, so comment lines don't reset it. Worse, the ': keep-alive' SSE comments appear to have confused the gateway's stream parser, causing responses to complete at the proxy but not reach Discord. The correct fix is agents.defaults.llm.idleTimeoutSeconds = 300 in openclaw.json (already applied), which configures the idle timer directly.
Bug: cancelRequest() killed the process but never called clearSessionLock(). The lockedSessions map kept pointing at the dead process. Every subsequent request for the same session found the dead process, enqueued behind it (queueDepth=1), waited 300s for idleTimeout, cancelled, and repeated indefinitely — a permanent cascade loop. Fix: call clearSessionLock(sessionKey, proc) before killAndRespawn so the next request routes to a fresh warm process instead of queuing behind a dead one. Evidence: pid 341522 was spawned at 18:49, received 4 requests over 3.5h, all cancelled at 300s, never logged a process_death, always showed queueDepth=1 — the lock was never released.
…n pooled processes (session-pool-context-fix.spec.md)
safeEmitter() guarantees a permanent error listener, making all
listenerCount("error") > 0 checks always-true dead code. Remove all 8
guards and emit unconditionally. Also remove PendingRequest.resolve
which was always a no-op.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uard (session-pool-context-fix.spec.md) - onError() now removes all event listeners before resolving (was leaving text_block_start/content_delta/assistant/result attached after error) - extractText() filters blocks where text is null/undefined to prevent silent empty returns on malformed content arrays Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Track lastMessageCount on PooledProcess; recycle stale CLI process when incoming messageCount drops below stored value, indicating a gateway reset (/new, idle timeout, or compaction). Threads messageCount through execute(), routeToProcess(), assignToProcess(), enqueueOnProcess(), enqueueOnSentinel(), drainNextRequest(), and PendingRequest. Routes layer passes messages.length. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In SessionPoolRouter.execute(), the busy-process session reset path directly calls
lockedSessions.delete(sessionKey)and mutatesproc.lockedTorather than going throughclearSessionLock(), which breaks the invariant that all unlocks use the canonical helper and leaves fields likeagentChannel/requestCountuncleared; consider refactoring this branch to reuseclearSessionLock()(or a variant that marks the process orphaned) to keep lock state consistent.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In SessionPoolRouter.execute(), the busy-process session reset path directly calls `lockedSessions.delete(sessionKey)` and mutates `proc.lockedTo` rather than going through `clearSessionLock()`, which breaks the invariant that all unlocks use the canonical helper and leaves fields like `agentChannel`/`requestCount` uncleared; consider refactoring this branch to reuse `clearSessionLock()` (or a variant that marks the process orphaned) to keep lock state consistent.
## Individual Comments
### Comment 1
<location path="src/subprocess/router.ts" line_range="385-387" />
<code_context>
+ this.warmPool.get(model)!.push(proc);
+ }
+
+ private async spawnCold(model: PooledModel): Promise<PooledProcess> {
+ const proc = this.spawnProcess(model);
+ await new Promise<void>((resolve) => {
+ if (proc.ready) {
+ resolve();
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The spawnCold readiness wait never actually triggers because `proc.ready` is set synchronously in spawnProcess.
Because `spawnProcess` sets `pooled.ready = true` just before returning, the `spawnCold` promise always resolves immediately and never uses the `stdout.once("data", onReady)`/timeout path. As a result, `spawnCold` effectively behaves like `spawnWarm`.
If you do want to wait for CLI output before marking readiness, consider moving the `pooled.ready = true` assignment to the point where the first init message is parsed and use that here. If not, it would be clearer to remove the unused readiness logic from `spawnCold`.
Suggested implementation:
```typescript
private async spawnCold(model: PooledModel): Promise<PooledProcess> {
return this.spawnProcess(model);
}
```
If the existing implementation of `spawnCold` in your local file differs slightly in formatting or structure (e.g. missing `return proc;` or different brace layout), adjust the SEARCH block accordingly so that the entire function body (including the `await new Promise(...)` readiness logic) is replaced with the simplified implementation that just returns `this.spawnProcess(model)`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| private async spawnCold(model: PooledModel): Promise<PooledProcess> { | ||
| const proc = this.spawnProcess(model); | ||
| await new Promise<void>((resolve) => { |
There was a problem hiding this comment.
suggestion (bug_risk): The spawnCold readiness wait never actually triggers because proc.ready is set synchronously in spawnProcess.
Because spawnProcess sets pooled.ready = true just before returning, the spawnCold promise always resolves immediately and never uses the stdout.once("data", onReady)/timeout path. As a result, spawnCold effectively behaves like spawnWarm.
If you do want to wait for CLI output before marking readiness, consider moving the pooled.ready = true assignment to the point where the first init message is parsed and use that here. If not, it would be clearer to remove the unused readiness logic from spawnCold.
Suggested implementation:
private async spawnCold(model: PooledModel): Promise<PooledProcess> {
return this.spawnProcess(model);
}If the existing implementation of spawnCold in your local file differs slightly in formatting or structure (e.g. missing return proc; or different brace layout), adjust the SEARCH block accordingly so that the entire function body (including the await new Promise(...) readiness logic) is replaced with the simplified implementation that just returns this.spawnProcess(model).
There was a problem hiding this comment.
Code Review
This pull request implements a session-aware process pool for the Claude CLI proxy, transitioning from a subprocess-per-request model to persistent, session-locked processes to improve performance and prevent context leakage. The changes include the SessionPoolRouter for pool management, request queuing, and orphan reclamation, along with integration into the Express routes and standalone server. Review feedback emphasizes the need for architectural consistency by using the canonical clearSessionLock method in all session termination paths, immediately rejecting queued requests upon session reset, and ensuring that client disconnects do not kill processes, as required by the specification.
| this.lockedSessions.delete(sessionKey); | ||
| proc.lockedTo = null; | ||
| proc.orphaned = true; |
There was a problem hiding this comment.
This block manually deletes the session lock and clears lockedTo instead of using the canonical clearSessionLock() method. This violates the "Major" rule defined in the specification (Finding N8). Additionally, it should call this.rejectProcessQueue(proc) here to immediately reject any stale requests that might be queued for this session, ensuring they are failed as soon as the reset is detected rather than waiting for the active request to complete.
| this.lockedSessions.delete(sessionKey); | |
| proc.lockedTo = null; | |
| proc.orphaned = true; | |
| this.rejectProcessQueue(proc); | |
| this.clearSessionLock(sessionKey, proc); | |
| proc.orphaned = true; |
References
- Every code path that unlocks a session MUST call the canonical clearSessionLock() function to prevent stale artifacts (Finding N8).
| // Clear the session lock BEFORE kill so the next request gets a fresh | ||
| // warm process instead of enqueuing behind the dead one. | ||
| this.clearSessionLock(sessionKey, proc); | ||
| this.killAndRespawn(proc); |
There was a problem hiding this comment.
The implementation kills and respawns the process when a client disconnects. This contradicts the specification in specs/session-pooling.spec.md (Module 2 AC and "Client Disconnect Handling"), which states that the process should be allowed to finish generating its response and then return to an idle state. While killing the process prevents context divergence, it increases latency for subsequent requests due to the need for a fresh process claim or cold start.
| } else { | ||
| // Busy — mark for reclamation after current request | ||
| proc.orphaned = true; | ||
| } |
There was a problem hiding this comment.
In the busy branch of orphan reclamation, the session lock is not cleared. While releaseProcess will eventually clear it, the specification (Finding N8) requires all unlock paths to use the canonical clearSessionLock() method. Clearing it here immediately (as is done for the idle case at line 951) would be more consistent and ensure the old session key is unbound immediately.
this.clearSessionLock(key, proc);
proc.orphaned = true;References
- Every code path that unlocks a session MUST call the canonical clearSessionLock() function to prevent stale artifacts (Finding N8).
Summary
lastMessageCounttoPooledProcess(init: 0) andmessageCounttoPendingRequest/new, idle timeout, compaction) by tracking incomingmessageCountper requestmessageCount < proc.lastMessageCounton a locked process: recycles stale CLI process and routes to a fresh warm/cold slotclearSessionLock+killAndRespawn)orphaned=true, clears lock from map;releaseProcesskills after current request completesmessageCountthrough all call sites:execute(),routeToProcess(),assignToProcess(),enqueueOnProcess(),enqueueOnSentinel(),drainNextRequest(), cold spawn.then()body.messages.lengthtorouter.execute()Test plan
session_reset_detectedlogged, fresh process claimedmessageCount: 0edge case → no recycle (guard:messageCount > 0)proc.lastMessageCount > 0)orphaned, completes current request, then recyclesnpm run buildpasses clean ✅🤖 Generated with Claude Code