feat(orchestrator): startup hygiene — stale per-PID files + orphan detection#8
Open
evannadeau wants to merge 2 commits into
Open
feat(orchestrator): startup hygiene — stale per-PID files + orphan detection#8evannadeau wants to merge 2 commits into
evannadeau wants to merge 2 commits into
Conversation
Per-PID active-session-<pid> files (introduced in 0.30.19+) make session_id lookup race-free under concurrent sessions, but nothing has been reaping them when the owning claude process exits. On a developer machine with many short-lived sessions per day, they accumulate indefinitely — 8 stale files observed in one project on 2026-05-13, from claude PIDs long since dead. The files are cosmetic in the sense that the legacy single `active-session` file remains the primary lookup, but a slow directory listing eventually becomes a real cost on a hot-spot workstation. This patch adds a startup sweep that walks `<project>/.orchestrator-state/`, matches files of shape `active-session-<pid>`, probes liveness via `process.kill(pid, 0)`, and unlinks dead-PID entries. The probe is cross-platform via Node's API. Cheap, idempotent, race-safe (we only unlink files whose PID is verifiably gone). Lost races with concurrent sessions are tolerated — next startup retries. Runs once at MCP startup, unconditionally (even when the no-claude-ancestor branch is about to exit, so future startups benefit). Tested: bun run typecheck clean, bun test 516 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Complements the existing orphan-bun watchdog (which catches "parent dies while I'm alive" cases for the current process). The watchdog only protects processes that LOADED the watchdog code - older bun processes whose in-memory bytecode predates a fix do not benefit from that fix, and can survive forever if their original parent claude died without triggering whatever watchdog they happen to be running. Concretely: on a developer machine that pulls plugin updates, an MCP process loaded at time T1 may still be alive after the on-disk `dist/server.js` is rebuilt at T2 > T1. If the parent claude that spawned T1's bun dies after T2, the T1 bun's in-memory watchdog code is the version from T1 - any later improvements to watchdog detection are invisible to it. We observed this 2026-05-13: an orphan bun survived ~30 minutes across multiple watchdog tick intervals before manual cleanup via `kill -9`. This patch adds a startup-time scan (Linux only) that walks /proc for bun processes whose cmdline references `orchestrator/dist/server.js` and whose parent chain contains no live `claude` process within 8 hops. Suspects are logged with diagnostic guidance; we do NOT auto-kill, because sibling MCPs may co-own infrastructure shared across live sessions (the python sidecar is deliberately shared via `.sidecar-port`). Detection surfaces the issue; the operator decides. Windows is unchanged - killOlderDuplicateMcps already handles a related case (siblings sharing our parent claude). Pure orphans on Windows are rare because parent death typically reaps children. Tested: bun run typecheck clean, bun test 516 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
4 tasks
evannadeau
added a commit
to evannadeau/claude-plugins
that referenced
this pull request
May 31, 2026
…reaper regression (#4) Two related fixes that share a root cause: a fresh single-claude launch can end up with the MCP server's selfSession.session_id drifted from the harness session_id, and nothing reaps the per-PID active-session-<pid> files that feed the drift. Bug 1 (phantom-sibling false positive): On a cold start with no /resume, if a stale active-session-<pid> file from a prior session collides on PID with the new claude process (or the legacy single-file fallback wins a race), the MCP self-registers in agent_channel.db under an id8 that is NOT the harness session_id. The every-turn cross-session injection then compares against the harness id only and reports the MCP's own row as "1 sibling session active" for the whole session lifetime - heartbeat keeps advancing so the bug never self-clears. Fix: live_sessions.ts gains a process-wide self-id filter. server.ts registers the MCP's selfSession id at startAgentChannel time AND on the first explicit session_id observed via resolveSessionId. getLive- OtherSessionIds excludes BOTH the caller id and the registered self id, preventing the phantom even when the two disagree. Bug 2 (stale per-PID active-session file reaper): The per-PID active-session-<pid> scheme (0.30.19+) makes session_id lookup race-free under concurrent sessions, but nothing reaps the files when the owning claude process exits. They accumulate indefinitely - one project hit 30 stale files in ~12 days. Worse, PID reuse hands the fallback resolver a session_id that is no longer live, feeding directly into Bug 1. Fix: extract reapStaleActiveSessionFiles into its own engine module (testable in isolation with an injectable liveness probe) and wire it at MCP server startup. Cheap, idempotent, race-safe via process.kill(pid, 0). Originally proposed upstream as spawnbox-dev/claude-plugins PR SpawnBox-dev#8; that PR never merged so this fork carries the fix. Tests: - tests/engine/live_sessions_phantom_self.test.ts (3 cases) - covers drift, genuine siblings still surface, idempotent self-id updates. - tests/engine/startup_hygiene.test.ts (6 cases) - covers reap of dead- PID files, non-PID files ignored, missing dir no-op, empty dir no-op, real process.kill probe smoke test, pid<=0 rejection. - Full suite: 599 pass / 0 fail (was 590 baseline). - tsc --noEmit clean. Version: 0.30.52 -> 0.30.53. Co-authored-by: Evan Nadeau <1878498+evannadeau@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two additive startup-time hygiene improvements in
plugins/orchestrator/mcp/server.ts. Both run once per MCP startup, both are detection/cleanup (never auto-kill), both are no-ops in the steady state.1. Reap stale per-PID
active-session-<pid>files (b7f43b7)Per-PID
active-session-<pid>files (introduced in 0.30.19+) make session_id lookup race-free under concurrent sessions, but nothing has been reaping them when the owning claude process exits. On a developer machine with many short-lived sessions per day, they accumulate indefinitely — 8 stale files observed in one project on 2026-05-13, from claude PIDs long since dead.They are cosmetic — the legacy single
active-sessionfile remains the primary lookup — but a slow directory listing eventually becomes a real cost.This patch adds a startup sweep that walks
<project>/.orchestrator-state/, matches files of shapeactive-session-<pid>, probes liveness viaprocess.kill(pid, 0), and unlinks dead-PID entries. Cross-platform; cheap; idempotent; race-safe (only unlinks PIDs verified gone). Lost races with concurrent sessions tolerated — next startup retries.2. Warn about likely-orphan sibling MCPs (
a28388e)Complements the existing orphan-bun watchdog (which catches "parent dies while I'm alive" for the current process). The watchdog only protects processes that LOADED the watchdog code — older bun processes whose in-memory bytecode predates a fix do not benefit, and can survive forever if their original parent claude died without triggering whatever watchdog they happen to be running.
Concretely observed 2026-05-13: an orphan bun survived ~30 minutes across multiple watchdog tick intervals before manual
kill -9. The on-diskdist/server.jshad been rebuilt while the orphan was running, so any subsequent watchdog improvements were invisible to it.This patch adds a startup-time scan (Linux only) that walks
/procfor bun processes whose cmdline referencesorchestrator/dist/server.jsand whose parent chain contains no liveclaudeprocess within 8 hops. Suspects are logged with diagnostic guidance:Detection only — does NOT auto-kill. Sibling MCPs may co-own infrastructure shared across live sessions (the python sidecar is deliberately shared via
.sidecar-port— killing an unrelated bun could take down a live session's embeddings). The operator decides whether to clean up.Windows is unchanged —
killOlderDuplicateMcpsalready handles a related case (siblings sharing our parent claude). Pure orphans on Windows are rare because parent death typically reaps children.Why "detection, not auto-kill"
The sidecar reuse pattern at
startSidecar()line 338–350 deliberately shares the python embedding server across MCPs. An auto-kill could take down a sidecar that a live session depends on. Surfacing the problem at startup gives the operator the information without the risk.Files changed
plugins/orchestrator/mcp/server.ts— 2 new functions + 1 startup block wiring them inplugins/orchestrator/dist/server.js— rebuilt viabun run buildImports updated: added
readdirSyncandunlinkSyncto the existingnode:fsimport line.Tested
bun run typecheck— cleanbun test— 516 pass / 0 fail / 38 files / 1207 assertions (no test changes)active-session-<pid>files in a real workspace were correctly identified as dead and could be removed by the same logic the reaper applies.Test plan
.orchestrator-state/directory — both functions should be silent no-ops.active-session-<pid>files — verify the startup log line reports the reap count and the files are gone.claudeancestor).🤖 Generated with Claude Code