Skip to content

Fix connection pool lifecycle to stop leaking connections#177

Open
jens wants to merge 1 commit into
crystaldba:mainfrom
jens:fix/connection-pool-lifecycle
Open

Fix connection pool lifecycle to stop leaking connections#177
jens wants to merge 1 commit into
crystaldba:mainfrom
jens:fix/connection-pool-lifecycle

Conversation

@jens

@jens jens commented Jun 5, 2026

Copy link
Copy Markdown

Problem

The stdio server can outlive its client and hold pooled connections indefinitely, eventually exhausting the database role's connection limit. Once that happens, every newly started server fails to connect with FATAL: too many connections for role "...".

This is the leak described in #98.

Root cause

An MCP client signals shutdown by closing the server's stdin, which makes run_stdio_async() return. But if the client is force-killed (crash, kill -9, or a supervisor tearing down the process tree), stdin EOF never arrives — run_stdio_async() never returns and the process is reparented to init, lingering indefinitely.

Each lingering process keeps its pool open. With min_size=1 the pool pins at least one connection for the whole life of the process, so a handful of orphaned servers is enough to exhaust a modest per-role CONNECTION LIMIT.

Fix

  • Zero idle footprint. min_size=0 + max_idle=60, so an idle server holds no connections and drifts back to zero between queries; max_size lowered 5 → 3.
  • Orphan watchdog (stdio only). A lightweight task runs alongside the transport and detects orphaning (the process is reparented to PID 1), closes the pool, and exits. The transport coroutine can't be cancelled cleanly — it blocks in a stdin reader thread — so the watchdog releases the pool and hard-exits rather than trying to unwind it.
  • Guaranteed release. The pool is closed in a finally on every clean exit path (stdin EOF, signal, error), and the watchdog is cancelled there.

Either change alone reduces the impact; together, an idle or orphaned server holds zero connections, and an orphaned one exits within a couple of seconds.

Tests

tests/unit/test_lifecycle.py covers: the watchdog closes the pool before exiting, waits until actually orphaned, is started exactly once and cancelled on clean stdio exit, and is never started for the sse/streamable-http transports.

Notes

Orphan detection keys on reparenting to PID 1, the common case on Linux and macOS. Under a PID namespace or a custom subreaper the parent may differ; there the watchdog simply doesn't fire, and the min_size=0 change still bounds the leak.

Fixes #98

The stdio server could outlive its client and hold pooled connections
indefinitely. A client normally signals shutdown by closing stdin, which
makes run_stdio_async() return. But if the client is force-killed, stdin EOF
never arrives, run_stdio_async() never returns, and the process is reparented
to init — lingering forever. With min_size=1 each such process pins at least
one connection, so a few orphans exhaust a low per-role connection limit,
after which every new server fails with "too many connections for role".

- Pool: min_size=0 + max_idle=60 so an idle server drifts back to zero
  connections; lower max_size 5 -> 3.
- stdio: run a lightweight watchdog alongside the transport that detects
  orphaning (parent reparented to PID 1), closes the pool, and exits. The
  transport coroutine can't be cancelled cleanly (it blocks in a stdin reader
  thread), so the watchdog releases the pool and hard-exits.
- Release the pool in a finally on every clean exit path (stdin EOF, signal,
  or error), and stop the watchdog there.

Add unit tests for the watchdog behavior and per-transport wiring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misshandling of Postgres Pools

1 participant