Skip to content

fix(client): unref reconnect/connection timers so leaked test clients can't hang Build·Test·Lint#42

Merged
ivkan merged 6 commits into
mainfrom
fix/sf-451-client-test-reconnect-leak
Jun 10, 2026
Merged

fix(client): unref reconnect/connection timers so leaked test clients can't hang Build·Test·Lint#42
ivkan merged 6 commits into
mainfrom
fix/sf-451-client-test-reconnect-leak

Conversation

@ivkan

@ivkan ivkan commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

Fixes the Build · Test · Lint (Unit tests) job hanging on main and being cancelled
at its 18-minute cap — the regression exposed once the prior CI-reliability change removed
--forceExit.

Root cause (precise)

packages/client/src/__tests__/ORMapPersistence.test.ts creates two TopGunClients inside
test bodies (the "restore … from storage" cases) that are never closed. The file mocks
WebSocket (fires onopen), so each leaked client connects and starts a heartbeat. The
suite's afterAll then restores the real WebSocket — so when the leaked clients'
heartbeat fails (the mock never pongs), they drop into a reconnect loop that now hits real
undici
(ws://localhost:1234), keeping Jest's worker alive past the last expect().

On CI, with --forceExit removed, this hung the client package's jest process — the log
shows Jest did not exit one second after the test run has completed immediately after
Tests: 613 passed — which blocks pnpm -r from advancing to the next package, so the job
runs to the 18-min timeout-minutes cap and is cancelled. The two leaked clients exactly
match the two parallel reconnect sequences in the CI log.

SingleServerProvider.close() is correct (it clears the reconnect/connection timers and
rejects the in-flight connect promise) — the defect is purely that these two test clients are
never disposed.

Fix

Track both in-test clients and dispose them in afterEach (same pattern already used in
TopGunClient.test.ts).

Verification

npx jest --runInBand in packages/client, without --forceExit:

  • EXIT=0 (clean process exit, not a 124 timeout)
  • 0 "Jest did not exit" warnings
  • 613 passed in ~20s

Before: jest hung indefinitely on the reconnect loop. Closes the open AC1 from the prior
CI-reliability spec (observe a real green Build·Test·Lint run).

Two TopGunClients created in ORMapPersistence test bodies (the restore-from-
storage cases) were never closed. The mocked WebSocket fires onopen so they
start a heartbeat; once the suite's afterAll restores the real WebSocket, the
leaked clients' heartbeat failure drops into a reconnect loop against real
undici (ws://localhost:1234), keeping Jest's worker alive past the last expect().

On CI (no --forceExit after the prior CI-reliability change) this hung the client
package's jest process — "Jest did not exit" right after 613 passed — which
blocked `pnpm -r` from advancing and got Build·Test·Lint cancelled at the 18-min
cap. Two leaked clients matched the two parallel reconnect sequences in the logs.

Track both in-test clients and dispose them in afterEach. Client suite now exits
cleanly (exit 0, no "Jest did not exit", 613 passed in ~20s) without --forceExit.
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 10, 2026

Copy link
Copy Markdown

Deploying topgun with  Cloudflare Pages  Cloudflare Pages

Latest commit: 0d1cbda
Status: ✅  Deploy successful!
Preview URL: https://28e36bc0.topgun-f45.pages.dev
Branch Preview URL: https://fix-sf-451-client-test-recon.topgun-f45.pages.dev

View logs

…t hang jest

The Build·Test·Lint job hung because test-created TopGunClients that are never
close()d leave SingleServerProvider's reconnect + connection-timeout setTimeouts
referenced on the event loop, so the jest worker never exits (masked previously
by --forceExit). The ORMapPersistence fix closed two such clients, but the class
recurs in any suite that forgets close() (detectOpenHandles flagged more in
adapter-better-auth).

Systematic fix at the source: unref() both background timers. They are background
machinery and must never be the sole reason a process/Jest worker stays alive.
unref() is a no-op in browsers (setTimeout returns a number), so behavior there is
unchanged; in Node a pending reconnect/timeout no longer blocks process exit.

Also dispose the clients started in adapter-better-auth's BetterAuth integration
suite (afterAll/afterEach) as accompanying hygiene.

Verified: full `pnpm -r exec jest --runInBand` exits cleanly (FULL_EXIT=0) with
ZERO "Jest did not exit" warnings across all 10 packages, without --forceExit.
@ivkan ivkan changed the title fix(client): close leaked test clients to stop Build·Test·Lint jest hang fix(client): unref reconnect/connection timers so leaked test clients can't hang Build·Test·Lint Jun 10, 2026
@ivkan

ivkan commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

Updated: root cause is broader than the initial two clients. Added the systematic source fix — unref() on SingleServerProvider's reconnect + connection-timeout timers — so any unclosed test client can no longer keep the Jest worker alive. Verified locally: full pnpm -r exec jest --runInBand exits cleanly (FULL_EXIT=0) with 0 "Jest did not exit" across all 10 packages, no --forceExit.

ivkan added 4 commits June 10, 2026 12:02
…est hang

Revert after identification. Prints a marker per pnpm -r package and bounds each
with a hard 240s timeout so the hanging package (which does not reproduce locally)
names itself and fails fast instead of cancelling the whole job at the 18m cap.
…hang CI

The "should NOT throw when --admin admin source is absent" test ran `topgun dev`
via execSync without a TOPGUN_SERVER_BINARY override and without a timeout. The
server binary is resolved from the installed @topgunbuild/server package, not from
cwd, so on CI (where the binary is present) `dev` booted a real foreground server
that never returns — execSync hung forever, hanging the cli package and cancelling
Build·Test·Lint at the 18-minute cap. It passed locally only because binary
resolution from the temp cwd happened to miss.

Force TOPGUN_SERVER_BINARY=/nonexistent (matching the sibling test) plus a 30s
execSync timeout so the "binary not found → exit 1" path is deterministic on every
platform. Also reverts the temporary per-package CI diagnostic that pinpointed this.
doctor exits non-zero when an optional dependency is absent (the Rust toolchain
is not installed on the Node CI runner). The raw execSync threw before any
assertion ran, so the three doctor tests failed — latent failures never seen
before because the suite previously hung on the dev test (alphabetically before
doctor) and, before that, --forceExit masked everything.

Use the runCli helper (captures stdout/stderr + exit code without throwing) and
assert on combined output, which still contains the full environment-check report.
A prettier violation in the SPEC-299 close() change reached main unformatted —
the Build·Test·Lint job never ran format:check before because it hung on the cli
dev test (now fixed). Whitespace-only (method-chain wrap); no behavior change.
@ivkan ivkan merged commit d5aa579 into main Jun 10, 2026
6 checks passed
@ivkan ivkan deleted the fix/sf-451-client-test-reconnect-leak branch June 10, 2026 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant