Skip to content

feat: add Postgres query timeouts (AIPLAT-921)#177

Open
subpath wants to merge 2 commits into
mainfrom
feat-set-pg-statement-timeouts-AIPLAT-921
Open

feat: add Postgres query timeouts (AIPLAT-921)#177
subpath wants to merge 2 commits into
mainfrom
feat-set-pg-statement-timeouts-AIPLAT-921

Conversation

@subpath

@subpath subpath commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Context

This PR is a prevention measure for a recent DB incident.

On 2026-04-30 our database ran out of connections for about an hour. Active
queries went from a normal ~3.5 up to 217, total connections hit 251, and the DB
started refusing new ones. The same thing happened in smaller bursts on 05-24,
05-25 and 05-26.

One of the reason is that our connection pool has no query timeout. So one slow query just
runs forever, holds its connection, and everything else piles up behind it until
the pool is full. This PR stops that.

What's new

Two timeouts on the asyncpg pool, set on the Postgres side so the query is killed
even when the client is stuck:

  • statement_timeout = 3s: a runaway query gets killed after 3s. This is the one
    that actually fixes Apr 30, it stops active queries from piling up.
  • idle_in_transaction_session_timeout = 10s: kills transactions left open and
    idle. We basically never hit this (max 2 in 30 days), so it's just a safety net.

Both are config driven (PG_*_MS, 0 = no limit) and re-applied on every
reconnect, so you can tune them per env without touching code.

Connections also set application_name="mlpa:{db}" now, so MLPA shows up in the
CloudSQL num_backends_by_application metric and we can finally see how many
connections are ours.

Ticket:

AIPLAT-921

@subpath subpath requested a review from a team as a code owner June 16, 2026 08:45
self.statement_timeout_ms = (
statement_timeout_ms
if statement_timeout_ms is not None
else env.PG_STATEMENT_TIMEOUT_MS

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: If we're defining it here, do we still need to do

async with self.statement_timeout(
            env.PG_MAINTENANCE_STATEMENT_TIMEOUT_MS
        ) as conn:

on every query?

`{base_user_id}:{service_type}`.
"""
try:
rows = await self.pool.fetch(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Are we sure we can safely remove all the pool methods?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants