Skip to content

Commit e1b1f47

Browse files
DavidLiedleclaude
andcommitted
Add Chapter 17: Observability
Covers the pg_stat_* view family, pg_stat_statements query analysis, EXPLAIN ANALYZE interpretation, slow query logging, pgBadger log analysis, postgres_exporter with Prometheus/Grafana, wait event analysis, and a complete observability stack recommendation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 10c7e4e commit e1b1f47

1 file changed

Lines changed: 349 additions & 0 deletions

File tree

src/ch17-observability.md

Lines changed: 349 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,349 @@
1+
# Observability
2+
3+
You can't improve what you can't measure. Postgres ships with a remarkably complete observability system built in — the `pg_stat_*` family of views, `EXPLAIN ANALYZE`, slow query logging, wait event tracking, and more. This chapter is about using those instruments effectively.
4+
5+
The goal of database observability is to be able to answer: "What is the database doing right now? What has it been doing? What's slow? Why?" Without good observability, you're flying blind. With it, performance investigations that used to take days take minutes.
6+
7+
## The `pg_stat_*` Views
8+
9+
Postgres maintains statistics about nearly everything it does, surfaced as views in the `pg_catalog` schema. These views are essential for understanding what your database is doing.
10+
11+
### `pg_stat_activity`
12+
13+
The most important view for real-time diagnostics. Shows all active connections and what they're doing.
14+
15+
```sql
16+
SELECT
17+
pid,
18+
usename,
19+
application_name,
20+
client_addr,
21+
state,
22+
wait_event_type,
23+
wait_event,
24+
left(query, 100) AS query_snippet,
25+
now() - query_start AS duration,
26+
now() - state_change AS state_duration
27+
FROM pg_stat_activity
28+
WHERE state != 'idle'
29+
ORDER BY duration DESC NULLS LAST;
30+
```
31+
32+
The `state` column:
33+
- `active`: Actively executing a query
34+
- `idle`: Waiting for a query from the client
35+
- `idle in transaction`: In a transaction, waiting between statements
36+
- `idle in transaction (aborted)`: In a failed transaction (must be rolled back)
37+
- `fastpath function call`: Executing a fast-path function
38+
39+
**`idle in transaction` is a red flag.** A connection that's `idle in transaction` holds locks for the duration. If a long transaction is idle (application code is doing slow work between database calls), those locks block other queries. Alert on connections that are `idle in transaction` for more than 30 seconds.
40+
41+
The `wait_event_type` and `wait_event` columns tell you what the backend is waiting for:
42+
43+
Common wait events:
44+
- `Lock / relation`: Waiting for a table lock
45+
- `Lock / tuple`: Waiting for a row lock
46+
- `LWLock / buffer_content`: Waiting for a shared buffer lock
47+
- `IO / DataFileRead`: Waiting for a page read from disk (cache miss)
48+
- `Client / ClientRead`: Waiting for client to send the next query
49+
- `IPC / BgWorkerShutdown`: Internal — background worker coordination
50+
51+
### `pg_stat_user_tables`
52+
53+
Per-table statistics:
54+
55+
```sql
56+
SELECT
57+
relname,
58+
seq_scan,
59+
idx_scan,
60+
n_live_tup,
61+
n_dead_tup,
62+
round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 2) AS bloat_pct,
63+
last_vacuum,
64+
last_autovacuum,
65+
last_analyze,
66+
last_autoanalyze,
67+
pg_size_pretty(pg_total_relation_size(relid)) AS total_size
68+
FROM pg_stat_user_tables
69+
ORDER BY seq_scan DESC;
70+
```
71+
72+
High `seq_scan` relative to `idx_scan` suggests missing or unused indexes. High `n_dead_tup` suggests autovacuum isn't keeping up. `last_autovacuum` and `last_autoanalyze` tell you when maintenance last ran.
73+
74+
### `pg_stat_user_indexes`
75+
76+
Per-index usage statistics:
77+
78+
```sql
79+
SELECT
80+
schemaname,
81+
relname AS table_name,
82+
indexrelname AS index_name,
83+
idx_scan,
84+
idx_tup_read,
85+
idx_tup_fetch,
86+
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
87+
FROM pg_stat_user_indexes
88+
ORDER BY idx_scan DESC;
89+
```
90+
91+
`idx_scan = 0` means this index has never been used since the last statistics reset. These are candidates for removal — they add write overhead and waste space without helping any query.
92+
93+
### `pg_stat_bgwriter`
94+
95+
Background writer and checkpoint statistics:
96+
97+
```sql
98+
SELECT
99+
checkpoints_timed,
100+
checkpoints_req,
101+
round(checkpoints_req::numeric / nullif(checkpoints_timed + checkpoints_req, 0) * 100, 2) AS checkpoint_req_pct,
102+
buffers_checkpoint,
103+
buffers_clean,
104+
maxwritten_clean,
105+
buffers_alloc,
106+
stats_reset
107+
FROM pg_stat_bgwriter;
108+
```
109+
110+
`checkpoint_req_pct` high (> 20%) means checkpoints are being triggered by WAL size rather than time — increase `max_wal_size`. `maxwritten_clean > 0` means the background writer hit its limit — increase `bgwriter_lru_maxpages`.
111+
112+
### `pg_stat_replication`
113+
114+
Covered in Chapter 14. Essential for monitoring standby lag.
115+
116+
### `pg_stat_database`
117+
118+
Database-level aggregates:
119+
120+
```sql
121+
SELECT
122+
datname,
123+
numbackends AS active_connections,
124+
xact_commit,
125+
xact_rollback,
126+
round(xact_rollback::numeric / nullif(xact_commit + xact_rollback, 0) * 100, 2) AS rollback_pct,
127+
blks_read,
128+
blks_hit,
129+
round(blks_hit::numeric / nullif(blks_read + blks_hit, 0) * 100, 2) AS cache_hit_pct,
130+
deadlocks,
131+
temp_files,
132+
pg_size_pretty(temp_bytes) AS temp_bytes_used
133+
FROM pg_stat_database
134+
WHERE datname = current_database();
135+
```
136+
137+
`cache_hit_pct` below 95% suggests insufficient `shared_buffers` or a workload with poor cache locality. `temp_files` and `temp_bytes` nonzero indicates sorts and hash joins spilling to disk — increase `work_mem`. `deadlocks` counts should be near zero; any count warrants investigation.
138+
139+
## `pg_stat_statements`: Your Query Analysis Workhorse
140+
141+
Already covered in Chapter 11, but it bears repeating: `pg_stat_statements` is the single most important observability tool in Postgres.
142+
143+
```sql
144+
-- Most time-consuming queries total
145+
SELECT
146+
round(total_exec_time::numeric, 2) AS total_ms,
147+
calls,
148+
round(mean_exec_time::numeric, 2) AS mean_ms,
149+
round(stddev_exec_time::numeric, 2) AS stddev_ms,
150+
round(rows::numeric / calls, 1) AS avg_rows,
151+
left(query, 120) AS query
152+
FROM pg_stat_statements
153+
WHERE calls > 50
154+
ORDER BY total_exec_time DESC
155+
LIMIT 20;
156+
157+
-- High variance queries (sometimes fast, sometimes slow)
158+
SELECT
159+
round(mean_exec_time::numeric, 2) AS mean_ms,
160+
round(stddev_exec_time::numeric, 2) AS stddev_ms,
161+
round(stddev_exec_time::numeric / nullif(mean_exec_time, 0) * 100, 2) AS cv_pct,
162+
calls,
163+
left(query, 120) AS query
164+
FROM pg_stat_statements
165+
WHERE calls > 100
166+
ORDER BY cv_pct DESC
167+
LIMIT 20;
168+
169+
-- Queries with most cache misses (lots of disk reads)
170+
SELECT
171+
round(shared_blks_read::numeric / calls, 1) AS avg_disk_reads,
172+
round(shared_blks_hit::numeric / calls, 1) AS avg_cache_hits,
173+
calls,
174+
left(query, 120) AS query
175+
FROM pg_stat_statements
176+
WHERE calls > 10
177+
ORDER BY avg_disk_reads DESC
178+
LIMIT 20;
179+
```
180+
181+
Reset statistics to get a fresh baseline:
182+
```sql
183+
SELECT pg_stat_reset(); -- Reset all pg_stat_* views
184+
SELECT pg_stat_statements_reset(); -- Reset pg_stat_statements only
185+
```
186+
187+
## EXPLAIN ANALYZE: Query-Level Diagnostics
188+
189+
`EXPLAIN (ANALYZE, BUFFERS)` is the microscope for individual queries. Know how to read it.
190+
191+
```sql
192+
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT TEXT)
193+
SELECT u.id, u.email, count(o.id) AS order_count
194+
FROM users u
195+
LEFT JOIN orders o ON o.user_id = u.id
196+
WHERE u.created_at > now() - interval '30 days'
197+
GROUP BY u.id, u.email
198+
ORDER BY order_count DESC
199+
LIMIT 20;
200+
```
201+
202+
Example output breakdown:
203+
204+
```
205+
Sort (cost=14523.00..14523.05 rows=20 width=48) (actual time=234.1..234.1 rows=20 loops=1)
206+
Sort Key: (count(o.id)) DESC
207+
Sort Method: top-N heapsort Memory: 26kB
208+
Buffers: shared hit=8234 read=126
209+
-> HashAggregate (cost=14500.00..14520.00 rows=2000 width=48) (actual time=231.4..234.0 rows=1523 loops=1)
210+
Group Key: u.id, u.email
211+
Batches: 1 Memory Usage: 369kB
212+
Buffers: shared hit=8234 read=126
213+
-> Hash Left Join (cost=... rows=... (actual time=... loops=1)
214+
Hash Cond: (o.user_id = u.id)
215+
Buffers: shared hit=8234 read=126
216+
-> Seq Scan on orders (cost=0.00..5280.00 rows=... (actual time=... loops=1)
217+
Buffers: shared hit=2100 read=80
218+
-> Hash (cost=... (actual time=... loops=1)
219+
Buckets: 2048 Batches: 1 Memory Usage: 121kB
220+
Buffers: shared hit=6134 read=46
221+
-> Index Scan using idx_users_created on users (cost=... (actual time=...)
222+
Index Cond: (created_at > ...)
223+
Buffers: shared hit=6134 read=46
224+
Planning Time: 0.8 ms
225+
Execution Time: 234.2 ms
226+
```
227+
228+
Key patterns to recognize:
229+
230+
**Seq Scan on a large table:** The query is reading the whole table. Look for missing indexes.
231+
232+
**Nested Loop with large outer:**
233+
```
234+
Nested Loop (cost=... rows=100000)
235+
-> Seq Scan on large_table (rows=100000)
236+
-> Index Scan on other_table (rows=1)
237+
```
238+
This is `100000 × 1` index lookups. Fine if the outer is small; expensive if large.
239+
240+
**Rows estimate vs. actual significantly different:**
241+
```
242+
Seq Scan on orders (cost=... rows=100 ...) (actual ... rows=95000 ...)
243+
```
244+
The planner thought there'd be 100 rows but found 95,000 — bad statistics. Run `ANALYZE orders`.
245+
246+
**`Buffers: read` is high:** Pages being read from disk, not buffer cache. Either the data isn't hot in the cache, or `shared_buffers` is too small for this workload.
247+
248+
**`Sort Method: external merge`:** Sort spilled to disk. Increase `work_mem` for this type of query.
249+
250+
## Slow Query Logging
251+
252+
Configure Postgres to log queries that exceed a duration threshold:
253+
254+
```ini
255+
log_min_duration_statement = 1000 # Log queries > 1 second
256+
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
257+
```
258+
259+
This generates log entries for slow queries, including the full query text, duration, and context. Feed these logs to pgBadger for analysis.
260+
261+
**Careful with `log_statement = 'all'`:** This logs every statement and generates enormous log volumes on a busy system. Use `log_min_duration_statement` instead.
262+
263+
## pgBadger: Log Analysis
264+
265+
pgBadger is a Perl script that analyzes Postgres log files and generates HTML reports showing:
266+
- Slowest queries (sorted by count, total time, average time)
267+
- Most frequent queries
268+
- Queries per second over time
269+
- Connections, errors, and lock waits over time
270+
- Wait event statistics
271+
272+
```bash
273+
pgbadger /var/log/postgresql/postgresql.log -o report.html
274+
```
275+
276+
Run pgBadger daily on your log files and review the report. The "top 10 slowest queries by total time" section is where you start performance investigations.
277+
278+
## Prometheus Integration: `postgres_exporter`
279+
280+
For infrastructure-level metrics and alerting, `postgres_exporter` (from Prometheus community) exposes Postgres metrics in Prometheus format:
281+
282+
```yaml
283+
# docker-compose.yml
284+
services:
285+
postgres_exporter:
286+
image: prometheuscommunity/postgres-exporter
287+
environment:
288+
DATA_SOURCE_NAME: "postgresql://exporter:password@postgres:5432/mydb?sslmode=disable"
289+
ports:
290+
- "9187:9187"
291+
```
292+
293+
Key metrics exported:
294+
- `pg_stat_database_blks_hit` / `pg_stat_database_blks_read` (cache hit ratio)
295+
- `pg_stat_user_tables_n_dead_tup` (bloat)
296+
- `pg_stat_replication_*` (replication lag)
297+
- `pg_stat_bgwriter_*` (checkpoint and writer stats)
298+
- `pg_locks_count` (lock contention)
299+
- `pg_stat_activity_count` (active connections)
300+
301+
**Essential Grafana dashboards:**
302+
- PostgreSQL Database (by Percona or official Postgres exporter)
303+
- Dashboard ID 9628 or 12273 on grafana.com
304+
305+
Set up alerts for:
306+
- Cache hit ratio < 95%
307+
- Replication lag > 60 seconds
308+
- Dead tuple percentage > 20% on any table
309+
- Number of `idle in transaction` connections > 5
310+
- Checkpoint requested count growing (WAL pressure)
311+
- Total connections > 80% of `max_connections`
312+
313+
## Wait Event Analysis
314+
315+
When Postgres is slow and you're not sure why, wait events tell you where time is being spent:
316+
317+
```sql
318+
SELECT wait_event_type, wait_event, count(*)
319+
FROM pg_stat_activity
320+
WHERE state = 'active'
321+
GROUP BY 1, 2
322+
ORDER BY count DESC;
323+
```
324+
325+
What the common wait events mean:
326+
- **`Lock/relation`**: Table-level lock contention — migrations, VACUUM FULL, or high DDL rate
327+
- **`Lock/tuple`**: Row-level lock contention — hot rows being updated by many connections
328+
- **`LWLock/buffer_content`**: Shared buffer access contention — many connections reading/writing the same buffer
329+
- **`IO/DataFileRead`**: Disk reads — cache misses, large sequential scans
330+
- **`IO/WALWrite`**: WAL write overhead — heavy write workload
331+
- **`Client/ClientRead`**: Waiting for application — application is slow sending the next query
332+
333+
Wait events give you a diagnostic shortcut: instead of guessing why queries are slow, you can observe what they're blocked on and fix that specifically.
334+
335+
## The Observability Stack
336+
337+
A complete observability setup for a production Postgres instance:
338+
339+
```
340+
Postgres logs → pgBadger → daily HTML report
341+
Postgres metrics → postgres_exporter → Prometheus → Grafana dashboards
342+
pg_stat_statements → periodic snapshot → trend analysis
343+
pg_stat_activity → alerting → PagerDuty/Slack (on blocked queries, lock waits)
344+
EXPLAIN ANALYZE → when investigating specific slow queries
345+
```
346+
347+
This is not complex to set up, and the payoff is enormous. The difference between a team that's reactive to database performance issues and one that catches them early is almost always whether they have this observability stack in place.
348+
349+
The best observability investment: add `pg_stat_statements` to `shared_preload_libraries`, set up `postgres_exporter` with Prometheus and Grafana, and configure `log_min_duration_statement = 1000`. These three steps give you 80% of the visibility you need, and they can be done in an afternoon.

0 commit comments

Comments
 (0)