|
| 1 | +# Reaper Batch Consolidation — Implementation Summary |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This document summarizes the implementation of the pgwatch reaper goroutine consolidation, which replaces the previous one-goroutine-per-(source × metric) architecture with a one-goroutine-per-source model using pgx Batch queries. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## What Changed |
| 10 | + |
| 11 | +### Architecture: Before vs After |
| 12 | + |
| 13 | +| Aspect | Before | After | |
| 14 | +|--------|--------|-------| |
| 15 | +| Goroutine model | 1 goroutine per (source × metric) | 1 goroutine per source | |
| 16 | +| SQL execution | 1 `Query()` call per metric per tick | 1 `SendBatch()` call per source per tick | |
| 17 | +| Query protocol | Individual queries, each a network round-trip | pgx pipeline protocol — multiple queries in one round-trip | |
| 18 | +| Cancel granularity | Per `source¤¤¤metric` key | Per source name | |
| 19 | +| Config hot-reload | Cancel + re-spawn per metric goroutine | `UpdateSchedules()` on existing `SourceReaper` | |
| 20 | + |
| 21 | +### Goroutine Reduction |
| 22 | + |
| 23 | +| Scenario | Before | After | Reduction | |
| 24 | +|----------|--------|-------|-----------| |
| 25 | +| 10 sources × exhaustive (32 metrics) | 320 | 10 | **97%** | |
| 26 | +| 50 sources × exhaustive | 1,600 | 50 | **97%** | |
| 27 | +| 1 source × basic (4 metrics) | 4 | 1 | **75%** | |
| 28 | + |
| 29 | +### Network Round-Trip Reduction |
| 30 | + |
| 31 | +With the `exhaustive` preset at 60-second alignment, ~12 SQL metrics are due simultaneously. |
| 32 | + |
| 33 | +| Before | After | Reduction | |
| 34 | +|--------|-------|-----------| |
| 35 | +| 12 separate `Query()` calls | 1 `SendBatch()` call | **~92%** | |
| 36 | + |
| 37 | +At peak alignment (t = 7200s, all 32 metrics due): **32 → 1 round-trip = 97% reduction**. |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## Implementation Phases |
| 42 | + |
| 43 | +### Phase 1: Core Infrastructure ✅ |
| 44 | + |
| 45 | +- Added `SendBatch(ctx, *pgx.Batch) pgx.BatchResults` to `PgxPoolIface` interface |
| 46 | +- Created `SourceReaper` struct with per-source state: metric schedules, tick interval, connection |
| 47 | +- Implemented `GCD()` / `GCDSlice()` for computing tick interval from metric intervals |
| 48 | +- Implemented `isDue()` check with zero-value = "never fetched" semantics |
| 49 | +- Minimum tick interval floor: **5 seconds** (prevents excessive wake-ups for coprime intervals) |
| 50 | + |
| 51 | +### Phase 2: Batch Query Execution ✅ |
| 52 | + |
| 53 | +- `executeBatch()`: Builds `pgx.Batch` from due metrics, sends in one round-trip, dispatches results per-metric |
| 54 | +- Preserves: instance-level caching, primary/standby filtering, `AddSysinfoToMeasurements`, server restart detection |
| 55 | +- `fetchSequentialMetric()`: Fallback for non-Postgres sources (pgbouncer, pgpool) using simple protocol |
| 56 | +- `fetchOSMetric()`, `fetchSpecialMetric()`: Handle gopsutil and special metrics inline |
| 57 | +- `BatchQueryMeasurements()`: Standalone batch helper with deterministic (sorted) key ordering |
| 58 | + |
| 59 | +### Phase 3: Main Loop Integration ✅ |
| 60 | + |
| 61 | +- `Reap()` now spawns `go sr.Run(sourceCtx)` per source instead of per-metric goroutines |
| 62 | +- `cancelFuncs` map simplified from `map[string]context.CancelFunc` keyed by `db¤¤¤metric` to keyed by source name |
| 63 | +- Added `sourceReapers map[string]*SourceReaper` to `Reaper` struct |
| 64 | +- `ShutdownOldWorkers()` simplified — only checks if source was removed from config |
| 65 | +- Removed dead `reapMetricMeasurements()` function (was ~100 lines) |
| 66 | + |
| 67 | +### Phase 4: Change Detection Batching ✅ |
| 68 | + |
| 69 | +- `GetObjectChangesMeasurement()` now calls `prefetchChangeDetectionData()` to batch-fetch all hash queries (`sproc_hashes`, `table_hashes`, `index_hashes`, `privilege_changes`) in one `pgx.Batch` |
| 70 | +- Added `Detect*ChangesWithData()` variants that accept pre-fetched data, falling back to original methods if nil |
| 71 | +- Configuration changes (`DetectConfigurationChanges`) remain unbatched — different `Scan()` pattern with typed variables |
| 72 | + |
| 73 | +### Phase 5: Cleanup & Observability ✅ |
| 74 | + |
| 75 | +- New Prometheus metrics in `observability.go`: |
| 76 | + - `pgwatch_reaper_batch_size` (histogram) — queries per batch |
| 77 | + - `pgwatch_reaper_batch_duration_seconds` (histogram) — wall-clock time per batch |
| 78 | + - `pgwatch_reaper_metric_fetch_total` (counter, labels: source, status) — success/error counts |
| 79 | + - `pgwatch_reaper_active_source_reapers` (gauge) — currently running source reapers |
| 80 | +- Batch timeout: 80% of tick interval, prevents slow queries from blocking the next tick |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## Files Changed |
| 85 | + |
| 86 | +### New Files |
| 87 | + |
| 88 | +| File | Lines | Purpose | |
| 89 | +|------|-------|---------| |
| 90 | +| `internal/reaper/source_reaper.go` | 494 | SourceReaper struct, GCD, batch execution, Run loop | |
| 91 | +| `internal/reaper/source_reaper_test.go` | 471 | pgxmock unit tests (12 test functions, 20+ subtests) | |
| 92 | +| `internal/reaper/source_reaper_integration_test.go` | 301 | testcontainers integration tests (6 test functions) | |
| 93 | +| `internal/reaper/observability.go` | 38 | Prometheus metrics for batch observability | |
| 94 | + |
| 95 | +### Modified Files |
| 96 | + |
| 97 | +| File | Changes | Purpose | |
| 98 | +|------|---------|---------| |
| 99 | +| `internal/db/conn.go` | +1 line | Added `SendBatch` to `PgxPoolIface` interface | |
| 100 | +| `internal/reaper/reaper.go` | +21 / -186 lines | Per-source goroutines, simplified cancel management | |
| 101 | +| `internal/reaper/database.go` | +303 lines | Batched change detection, `prefetchChangeDetectionData` | |
| 102 | +| `internal/reaper/reaper_test.go` | +15 / -14 lines | Updated tests for per-source cancel pattern | |
| 103 | + |
| 104 | +**Total: 4 new files (1,304 lines), 4 modified files (+340 / -200 net)** |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Test Coverage |
| 109 | + |
| 110 | +### Unit Tests (pgxmock) — 12 test functions, 20+ subtests |
| 111 | + |
| 112 | +| Test | What it verifies | |
| 113 | +|------|-----------------| |
| 114 | +| `TestGCD` | Euclidean GCD for two integers | |
| 115 | +| `TestGCDSlice` | GCD across slices: empty, single, coprime, exhaustive preset (30s) | |
| 116 | +| `TestCalcTickInterval` | Tick interval calculation: normal, floor to 5s, single metric, empty | |
| 117 | +| `TestIsDue` | Never-fetched (always due), recently fetched (not due), past interval (due) | |
| 118 | +| `TestSourceReaper_DueMetrics` | Correct partitioning of due vs not-yet-due metrics | |
| 119 | +| `TestNewSourceReaper` | Constructor sets schedules and tick interval | |
| 120 | +| `TestUpdateSchedules` | Hot-reload preserves lastFetch for retained metrics, purges removed | |
| 121 | +| `TestSourceReaper_ExecuteBatch` | Full batch path: 2 metrics → pgxmock batch → 2 envelopes on channel | |
| 122 | +| `TestSourceReaper_RunOneIteration` | Run() loop fires, collects metrics, exits on context cancel | |
| 123 | +| `TestSourceReaper_DetectServerRestart` | Detects uptime regression → emits `object_changes` envelope | |
| 124 | +| `TestSourceReaper_NonPostgresSequential` | Sequential fallback path for postgres source | |
| 125 | +| `TestBatchQueryMeasurements` | Standalone batch (Postgres) and sequential (non-Postgres) paths | |
| 126 | + |
| 127 | +### Integration Tests (testcontainers) — 6 test functions |
| 128 | + |
| 129 | +| Test | What it verifies | |
| 130 | +|------|-----------------| |
| 131 | +| `TestIntegration_BatchQueryMeasurements` | 4 real SQL queries batched against Postgres 18, all return correct data | |
| 132 | +| `TestIntegration_ExecuteBatch` | Full `executeBatch()` path with 2 metric definitions → envelopes arrive | |
| 133 | +| `TestIntegration_SourceReaper_RunCollectsMetrics` | `Run()` loop starts, collects 2 metrics within 15s, exits cleanly | |
| 134 | +| `TestIntegration_BatchVsSequentialConsistency` | Batch and sequential paths return identical results for same query | |
| 135 | +| `TestIntegration_BatchEmptySQL` | Empty/whitespace SQL queries are silently skipped | |
| 136 | +| `TestIntegration_BatchMultipleMetricsSameRoundTrip` | 10 queries sent in one batch, all 10 return results | |
| 137 | + |
| 138 | +### Existing Tests — 0 regressions |
| 139 | + |
| 140 | +All 152 test cases in `internal/reaper/` pass, including all pre-existing tests for `DetectSprocChanges`, `DetectTableChanges`, `DetectIndexChanges`, `DetectPrivilegeChanges`, `DetectConfigurationChanges`, `FetchMetric`, `LoadSources`, `LoadMetrics`, log parser tests, and OS metric tests. |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +## Design Decisions |
| 145 | + |
| 146 | +| Decision | Rationale | |
| 147 | +|----------|-----------| |
| 148 | +| GCD-based tick loop (Option A) | Zero external dependencies, natural fit with `context` cancellation, simple reasoning | |
| 149 | +| No external scheduler (gocron) | gocron v2 uses one goroutine per job, defeating the consolidation purpose | |
| 150 | +| 5-second minimum tick | Prevents excessive wake-ups for coprime intervals (e.g., GCD(7, 13) = 1) | |
| 151 | +| Sequential fallback for non-Postgres | pgbouncer/pgpool use `SimpleProtocol` and don't support pipeline batching | |
| 152 | +| `server_log_event_counts` keeps its own goroutine | Streaming CSV parser with long-running I/O, not batchable | |
| 153 | +| Batch timeout = 80% of tick interval | Prevents slow queries from blocking the next tick cycle | |
| 154 | +| Sorted keys in `BatchQueryMeasurements` | Ensures deterministic batch ordering for testing and debugging | |
| 155 | +| `Detect*ChangesWithData()` variants | Allow pre-fetched batch data while preserving original methods as fallbacks | |
| 156 | + |
| 157 | +--- |
| 158 | + |
| 159 | +## Expected Production Impact |
| 160 | + |
| 161 | +### Resource Usage |
| 162 | + |
| 163 | +- **Memory**: ~97% reduction in goroutine stack allocations (320 × 8KB default stack → 10 × 8KB) |
| 164 | +- **CPU scheduling**: Fewer goroutines means less scheduler overhead and context switching |
| 165 | +- **Connection pool**: Batch acquires 1 connection per tick instead of N concurrent acquires; reduces pool contention |
| 166 | + |
| 167 | +### Network |
| 168 | + |
| 169 | +- **Round-trips**: ~92% reduction at 60s alignment for exhaustive preset |
| 170 | +- **Latency**: Batch queries benefit from TCP connection reuse and PostgreSQL's pipeline protocol |
| 171 | +- **Bandwidth**: Slight reduction from fewer TCP handshakes and acknowledgements |
| 172 | + |
| 173 | +### Observability |
| 174 | + |
| 175 | +- `pgwatch_reaper_batch_size` histogram reveals how many queries are batched per tick |
| 176 | +- `pgwatch_reaper_batch_duration_seconds` tracks end-to-end batch latency |
| 177 | +- `pgwatch_reaper_metric_fetch_total` with source/status labels enables per-source error rate alerting |
| 178 | +- `pgwatch_reaper_active_source_reapers` gauge shows current source count |
| 179 | + |
| 180 | +--- |
| 181 | + |
| 182 | +## Future Enhancements |
| 183 | + |
| 184 | +1. **Overflow workers (Option D)**: Offload known-slow metrics (e.g., `table_bloat_approx_summary_sql`) to a separate goroutine if they exceed a time threshold, preventing them from blocking the batch |
| 185 | +2. **Adaptive tick interval**: Dynamically adjust tick interval based on observed query latency |
| 186 | +3. **Per-metric batch timeout**: Use `SET LOCAL statement_timeout` within the batch for metrics with `StatementTimeoutSeconds` configured |
| 187 | +4. **Batch configuration changes**: Batch the `DetectConfigurationChanges` hash queries (currently excluded due to different `Scan()` pattern) |
0 commit comments