feat: add Postgres query timeouts (AIPLAT-921)#177
Open
subpath wants to merge 2 commits into
Open
Conversation
| self.statement_timeout_ms = ( | ||
| statement_timeout_ms | ||
| if statement_timeout_ms is not None | ||
| else env.PG_STATEMENT_TIMEOUT_MS |
Collaborator
There was a problem hiding this comment.
question: If we're defining it here, do we still need to do
async with self.statement_timeout(
env.PG_MAINTENANCE_STATEMENT_TIMEOUT_MS
) as conn:
on every query?
| `{base_user_id}:{service_type}`. | ||
| """ | ||
| try: | ||
| rows = await self.pool.fetch( |
Collaborator
There was a problem hiding this comment.
question: Are we sure we can safely remove all the pool methods?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
This PR is a prevention measure for a recent DB incident.
On 2026-04-30 our database ran out of connections for about an hour. Active
queries went from a normal ~3.5 up to 217, total connections hit 251, and the DB
started refusing new ones. The same thing happened in smaller bursts on 05-24,
05-25 and 05-26.
One of the reason is that our connection pool has no query timeout. So one slow query just
runs forever, holds its connection, and everything else piles up behind it until
the pool is full. This PR stops that.
What's new
Two timeouts on the asyncpg pool, set on the Postgres side so the query is killed
even when the client is stuck:
statement_timeout = 3s: a runaway query gets killed after 3s. This is the onethat actually fixes Apr 30, it stops active queries from piling up.
idle_in_transaction_session_timeout = 10s: kills transactions left open andidle. We basically never hit this (max 2 in 30 days), so it's just a safety net.
Both are config driven (
PG_*_MS,0= no limit) and re-applied on everyreconnect, so you can tune them per env without touching code.
Connections also set
application_name="mlpa:{db}"now, so MLPA shows up in theCloudSQL
num_backends_by_applicationmetric and we can finally see how manyconnections are ours.
Ticket:
AIPLAT-921