fix(runtime): clean up stale brokers and lock state updates#343
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6142e6cd08
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
6142e6c to
bc4de41
Compare
|
Good catch. The timeout needs to be longer than the stale-lock recovery window, otherwise a fresh orphaned lock can make every writer fail before recovery is allowed. I updated the PR so |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc4de41deb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| killProcess = terminateProcessTree | ||
| }) { | ||
| if (Number.isFinite(pid)) { | ||
| try { | ||
| killProcess(pid); |
There was a problem hiding this comment.
Validate stale-session PID before killing process tree
This change makes teardownBrokerSession kill any finite persisted pid by default, but ensureBrokerSession can reach this path using stale broker.json data when the endpoint is no longer reachable. In that scenario, PID reuse can cause terminateProcessTree to signal an unrelated process (or process group) that now owns the same PID, so simply restarting the companion after a crash can terminate the wrong workload. Please gate termination on broker identity (for example, command/start-time checks) before sending signals.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. I updated the broker teardown path so the default signal is now gated on broker identity before calling terminateProcessTree: the pid file must still point at the same pid and the live command line must look like app-server-broker.mjs serve for the same endpoint/pid file. If that check fails, teardown only removes the stale metadata/socket files and leaves the process alone.
I also added regression coverage for both sides: one test verifies a matching stale broker is terminated, and another verifies a reused stale pid running unrelated Node code is not killed. Reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.
|
Hitting the same broker-leak symptoms on macOS (Claude Code with |
bc4de41 to
df73eb3
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df73eb3129
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Thanks for the macOS details. The latest patch keeps the SessionEnd/stale cleanup behavior aimed at the broker leak, but now validates the persisted pid before signalling: the live command has to match this broker for the same endpoint/pid file. That should still reap the leaked broker process group while avoiding PID-reuse kills of unrelated workloads. |
df73eb3 to
8aea41f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8aea41f757
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
8aea41f to
217aa3c
Compare
Problem
Stale broker sessions could remove their socket and metadata without terminating the recorded broker process when no test-only
killProcessoverride was supplied. State updates also used an unlocked load/mutate/save sequence, so concurrent workers could overwrite each other and drop job records.Changes
terminateProcessTreeand pass that default through stale-session and failed-start cleanup.upsertJobcalls preserve each other.Testing
node --test tests/broker-lifecycle.test.mjs tests/state.test.mjsnode --test tests/broker-endpoint.test.mjs tests/bump-version.test.mjs tests/commands.test.mjs tests/git.test.mjs tests/process.test.mjs tests/render.test.mjsgit diff --cached --checkNote: local
npm run buildreachestscbut fails on generated app-server type drift unrelated to this change (requestAttestation/experimentalRawEvents).