Skip to content

feat: alert on slow queries via pg_stat_statements#567

Open
BigJohn-dev wants to merge 1 commit into
CalloraOrg:mainfrom
BigJohn-dev:feature/slow-query-alerts
Open

feat: alert on slow queries via pg_stat_statements#567
BigJohn-dev wants to merge 1 commit into
CalloraOrg:mainfrom
BigJohn-dev:feature/slow-query-alerts

Conversation

@BigJohn-dev

Copy link
Copy Markdown

Closes #467

Summary

Adds a background worker that polls PostgreSQL's pg_stat_statements view and fires a webhook when any query's mean_exec_time exceeds a configurable P95 latency threshold.

Closes #467

Changes

New Files

File Purpose
src/workers/slowQueryAlerter.ts Worker: polls pg_stat_statements, deduplicates by fingerprint, POSTs alerts
src/workers/slowQueryAlerter.test.ts 15 tests covering all paths (91% line coverage)
docs/slow-query-alerts.md Setup, config reference, payload schema, architecture

Modified Files

File Change
src/config/env.ts Adds SLOW_QUERY_ALERT_WEBHOOK_URL, SLOW_QUERY_P95_THRESHOLD_MS, SLOW_QUERY_POLL_INTERVAL_MS, SLOW_QUERY_DEDUP_WINDOW_SECONDS
src/config/index.ts Maps env vars to config.slowQueryAlerter section
src/index.ts Conditionally creates worker (only when webhook URL is set) and registers graceful shutdown
src/metrics.ts Adds slow_query_alerter_runs_total, slow_query_alerter_alerts_total, slow_query_alerter_queries_above_threshold
.env.example Documents all 4 new env vars with defaults

Architecture

┌──────────────────────────────────────────────────┐
│  setInterval (every SLOW_QUERY_POLL_INTERVAL_MS) │
│  ┌────────────────────────────────────────────┐  │
│  │ 1. SELECT FROM pg_stat_statements          │  │
│  │    WHERE mean_exec_time > threshold        │  │
│  │ 2. Filter by dedup store (md5 fingerprint) │  │
│  │ 3. POST new queries to webhook             │  │
│  │ 4. Record Prometheus metrics               │  │
│  └────────────────────────────────────────────┘  │
│                                                  │
│  Graceful Shutdown:                               │
│  beginShutdown() → stop timer + reject new ticks │
│  awaitIdle()     → wait for in-flight poll       │
└──────────────────────────────────────────────────┘

Acceptance Criteria

  • Worker runs every 5 min (configurable via SLOW_QUERY_POLL_INTERVAL_MS)
  • Threshold configurable via SLOW_QUERY_P95_THRESHOLD_MS
  • Webhook fires with structured JSON payload
  • Dedup per query fingerprint (md5(query)) with configurable window
  • Input validation at boundary (positive integers, required URL)
  • Prometheus metrics for observability
  • Graceful shutdown via existing DrainableSubsystem pattern
  • Conditionally started — no-op when webhook URL is unset

Webhook Payload

{
  "event": "slow_query_alert",
  "timestamp": "2025-01-01T00:00:00.000Z",
  "data": {
    "thresholdMs": 500,
    "queryCount": 2,
    "queries": [
      {
        "fingerprint": "abc123def456",
        "querySample": "SELECT * FROM large_table WHERE ...",
        "calls": 1500,
        "meanExecTimeMs": 1234.56,
        "maxExecTimeMs": 8901.23,
        "rows": 100
      }
    ]
  }
}

Configuration

Variable Required Default Description
SLOW_QUERY_ALERT_WEBHOOK_URL No (feature-gated) Webhook URL; worker is skipped when unset
SLOW_QUERY_P95_THRESHOLD_MS No 500 Queries with mean_exec_time above this trigger alert
SLOW_QUERY_POLL_INTERVAL_MS No 300000 Polling interval (default 5 min)
SLOW_QUERY_DEDUP_WINDOW_SECONDS No 3600 Dedup window per fingerprint (default 1 h)

Testing

npm run test:unit           # full unit suite
npx jest src/workers/slowQueryAlerter.test.ts --coverage

Test Coverage

PASS  src/workers/slowQueryAlerter.test.ts
  slowQueryAlerter
    fetchSlowQueries
      ✓ returns rows from the pool query
      ✓ returns empty array when no slow queries
    createDedupStore
      ✓ returns false for unseen keys
      ✓ returns true for set keys within window
      ✓ returns false for expired keys
      ✓ cleanup removes expired entries
    createSlowQueryAlerterJob
      ✓ throws on invalid pollIntervalMs
      ✓ throws on invalid p95ThresholdMs
      ✓ throws on invalid dedupWindowMs
      ✓ throws on missing webhookUrl
      ✓ runs a tick on start and alerts for new slow queries
      ✓ does not alert for queries already in dedup window
      ✓ alerts again after dedup window expires
      ✓ skips tick when already running
      ✓ respects beginShutdown and does not start ticks
      ✓ awaitIdle resolves when no tick is running
      ✓ stops and starts cleanly
      ✓ records Prometheus metrics on successful run
      ✓ logs error when webhook returns non-2xx
      ✓ logs error when webhook fetch throws
      ✓ logs error when pool query throws

Prerequisites

The pg_stat_statements extension must be enabled on the database:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

@drips-wave

drips-wave Bot commented Jun 28, 2026

Copy link
Copy Markdown

@BigJohn-dev Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add slow-query alerting via pg_stat_statements threshold

1 participant