Skip to content

fix(runtime): clean up stale brokers and lock state updates#343

Open
60ke wants to merge 1 commit into
openai:mainfrom
60ke:codex/fix-broker-state-races
Open

fix(runtime): clean up stale brokers and lock state updates#343
60ke wants to merge 1 commit into
openai:mainfrom
60ke:codex/fix-broker-state-races

Conversation

@60ke
Copy link
Copy Markdown

@60ke 60ke commented May 22, 2026

Problem

Stale broker sessions could remove their socket and metadata without terminating the recorded broker process when no test-only killProcess override was supplied. State updates also used an unlocked load/mutate/save sequence, so concurrent workers could overwrite each other and drop job records.

Changes

  • Default broker teardown to terminateProcessTree and pass that default through stale-session and failed-start cleanup.
  • Add a small cross-process state lock around the full state transaction so concurrent upsertJob calls preserve each other.
  • Add regression coverage for stale broker termination and concurrent job state writes.

Testing

  • node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs
  • node --test tests/broker-endpoint.test.mjs tests/bump-version.test.mjs tests/commands.test.mjs tests/git.test.mjs tests/process.test.mjs tests/render.test.mjs
  • git diff --cached --check

Note: local npm run build reaches tsc but fails on generated app-server type drift unrelated to this change (requestAttestation / experimentalRawEvents).

@60ke 60ke requested a review from a team May 22, 2026 09:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6142e6cd08

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread plugins/codex/scripts/lib/state.mjs Outdated
@60ke 60ke force-pushed the codex/fix-broker-state-races branch from 6142e6c to bc4de41 Compare May 22, 2026 09:41
@60ke
Copy link
Copy Markdown
Author

60ke commented May 22, 2026

Good catch. The timeout needs to be longer than the stale-lock recovery window, otherwise a fresh orphaned lock can make every writer fail before recovery is allowed. I updated the PR so STATE_LOCK_TIMEOUT_MS is derived from STATE_LOCK_STALE_MS + 5000, then reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc4de41deb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 181 to 185
killProcess = terminateProcessTree
}) {
if (Number.isFinite(pid)) {
try {
killProcess(pid);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate stale-session PID before killing process tree

This change makes teardownBrokerSession kill any finite persisted pid by default, but ensureBrokerSession can reach this path using stale broker.json data when the endpoint is no longer reachable. In that scenario, PID reuse can cause terminateProcessTree to signal an unrelated process (or process group) that now owns the same PID, so simply restarting the companion after a crash can terminate the wrong workload. Please gate termination on broker identity (for example, command/start-time checks) before sending signals.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I updated the broker teardown path so the default signal is now gated on broker identity before calling terminateProcessTree: the pid file must still point at the same pid and the live command line must look like app-server-broker.mjs serve for the same endpoint/pid file. If that check fails, teardown only removes the stale metadata/socket files and leaves the process alone.

I also added regression coverage for both sides: one test verifies a matching stale broker is terminated, and another verifies a reused stale pid running unrelated Node code is not killed. Reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.

@Pelton
Copy link
Copy Markdown

Pelton commented May 22, 2026

Hitting the same broker-leak symptoms on macOS (Claude Code with codex@openai-codex v1.0.4): app-server-broker.mjs plus node codex app-server and the Rust codex app-server survive past SessionEnd, reparent to launchd (PPID 1), and accumulate across sessions.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df73eb3129

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread plugins/codex/scripts/lib/state.mjs Outdated
@60ke
Copy link
Copy Markdown
Author

60ke commented May 23, 2026

Hitting the same broker-leak symptoms on macOS (Claude Code with codex@openai-codex v1.0.4): app-server-broker.mjs plus node codex app-server and the Rust codex app-server survive past SessionEnd, reparent to launchd (PPID 1), and accumulate across sessions.

Thanks for the macOS details. The latest patch keeps the SessionEnd/stale cleanup behavior aimed at the broker leak, but now validates the persisted pid before signalling: the live command has to match this broker for the same endpoint/pid file. That should still reap the leaked broker process group while avoiding PID-reuse kills of unrelated workloads.

@60ke 60ke force-pushed the codex/fix-broker-state-races branch from df73eb3 to 8aea41f Compare May 23, 2026 04:43
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8aea41f757

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread plugins/codex/scripts/lib/broker-lifecycle.mjs Outdated
Comment thread plugins/codex/scripts/lib/state.mjs Outdated
@60ke 60ke force-pushed the codex/fix-broker-state-races branch from 8aea41f to 217aa3c Compare May 23, 2026 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants