Skip to content

Commit acdf394

Browse files
DavidLiedleclaude
andcommitted
Add Chapter 13: Performance Tuning
Covers memory configuration (shared_buffers, work_mem, effective_cache_size), checkpoint and WAL settings, autovacuum tuning, PgBouncer pool modes, optimal connection counts, the EXPLAIN diagnostic habit, statistics, parallel query, bloat detection, and configuration tools. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 022158c commit acdf394

1 file changed

Lines changed: 380 additions & 0 deletions

File tree

src/ch13-performance-tuning.md

Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
# Performance Tuning
2+
3+
A default Postgres installation is configured conservatively — appropriate for a development machine or a small VM, but leaving significant performance on the table for a production server. This chapter covers the configuration parameters that matter, the autovacuum behavior you need to understand, connection pooling as a force multiplier, and the query optimization habits that prevent most performance problems.
4+
5+
Postgres performance tuning is iterative. There is no single configuration change that makes everything fast. The process is: measure, identify bottlenecks, tune, measure again.
6+
7+
## Memory Configuration
8+
9+
### `shared_buffers`
10+
11+
The most important memory setting. Controls the size of Postgres's shared buffer cache — the pool of memory that all backends share for caching data pages.
12+
13+
**Default:** 128MB (egregiously low for production)
14+
15+
**Recommendation:** ~25% of available RAM
16+
17+
```ini
18+
shared_buffers = 8GB # on a 32GB server
19+
```
20+
21+
Setting `shared_buffers` higher than 25% of RAM has diminishing returns because the OS page cache also caches frequently-read data. Going above 40% of RAM can actually hurt performance by reducing the OS page cache.
22+
23+
### `effective_cache_size`
24+
25+
Not a memory allocation — it's a hint to the query planner about how much total memory (shared_buffers + OS cache) is available for caching. The planner uses this to decide between index scans and sequential scans.
26+
27+
**Recommendation:** 50–75% of total RAM
28+
29+
```ini
30+
effective_cache_size = 24GB # on a 32GB server
31+
```
32+
33+
If this is set too low, the planner incorrectly thinks the disk is slow and prefers sequential scans when it should use indexes.
34+
35+
### `work_mem`
36+
37+
Memory available per sort or hash operation. Each sort, hash join, and hash aggregate can use up to `work_mem`. A query with multiple operations can use `work_mem` multiple times. With many concurrent connections, total memory usage can be `work_mem × connections × operations_per_query`.
38+
39+
**Default:** 4MB (too low for complex queries)
40+
41+
**Recommendation:** Balance between query complexity and connection count. A formula:
42+
```
43+
work_mem = (RAM - shared_buffers) / (max_connections × 2)
44+
```
45+
46+
For a 32GB server with 8GB shared_buffers and 100 max_connections:
47+
```
48+
work_mem = (32 - 8) / (100 × 2) ≈ 120MB
49+
```
50+
51+
Setting `work_mem` too high with many connections causes OOM. Set conservatively globally and increase per-session for complex analytical queries:
52+
53+
```sql
54+
SET work_mem = '256MB';
55+
SELECT ... ORDER BY ... LIMIT ...; -- benefits from higher work_mem
56+
RESET work_mem;
57+
```
58+
59+
Watch for "external sort" operations in `EXPLAIN` output — they indicate the sort spilled to disk because `work_mem` was too low.
60+
61+
### `maintenance_work_mem`
62+
63+
Memory for maintenance operations: `VACUUM`, `CREATE INDEX`, `ALTER TABLE`, etc.
64+
65+
**Recommendation:** 1GB or more for systems with large tables. This speeds up index creation dramatically.
66+
67+
```ini
68+
maintenance_work_mem = 2GB
69+
```
70+
71+
### `max_wal_size`
72+
73+
Controls when checkpoint processing forces a WAL flush. Too small causes frequent checkpoints, generating I/O spikes. Too large increases recovery time after a crash.
74+
75+
**Recommendation:** 2–4GB for most systems, higher for write-heavy workloads.
76+
77+
```ini
78+
max_wal_size = 4GB
79+
```
80+
81+
Watch `pg_stat_bgwriter.checkpoint_req` — if this is high, checkpoints are being triggered by WAL size rather than time. Increase `max_wal_size`.
82+
83+
## Checkpoint Configuration
84+
85+
Checkpoints flush dirty pages from the buffer cache to disk. A checkpoint that runs too fast causes I/O spikes (all the dirty pages written in a burst). A checkpoint that runs too slow causes long recovery times after a crash.
86+
87+
```ini
88+
checkpoint_completion_target = 0.9 # Spread I/O over 90% of checkpoint interval
89+
checkpoint_timeout = 15min # Maximum time between checkpoints
90+
```
91+
92+
`checkpoint_completion_target = 0.9` (the default in recent Postgres versions) is good. It tells Postgres to spread checkpoint writes over 90% of the interval between checkpoints, smoothing I/O.
93+
94+
## WAL Settings
95+
96+
```ini
97+
wal_level = replica # Minimum for streaming replication
98+
wal_compression = on # Compress WAL records (reduces WAL volume)
99+
wal_buffers = 16MB # WAL write buffer (usually auto-configured from shared_buffers)
100+
synchronous_commit = on # Don't change this unless you understand the trade-off
101+
```
102+
103+
`synchronous_commit = off` gives a performance boost (no fsync on commit) at the cost of potentially losing the last few seconds of committed transactions on crash. This is acceptable for non-critical data (analytics events, logs, rate limit counters). Never set it off globally for transactional data.
104+
105+
## Autovacuum: The Most Misunderstood Setting
106+
107+
Autovacuum is the background process that reclaims dead tuple space, updates table statistics, and prevents XID wraparound. It is not optional. Disabling autovacuum is not a performance optimization — it's setting up a future disaster.
108+
109+
The most important autovacuum settings:
110+
111+
### `autovacuum_vacuum_scale_factor` and `autovacuum_vacuum_threshold`
112+
113+
These control when autovacuum triggers on a table. The formula:
114+
```
115+
threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples
116+
```
117+
118+
**Defaults:**
119+
- `autovacuum_vacuum_threshold = 50` (absolute minimum dead tuples)
120+
- `autovacuum_vacuum_scale_factor = 0.2` (20% of table)
121+
122+
For a table with 10 million rows, autovacuum triggers when there are 2,000,050 dead tuples. For a frequently-updated table, this means autovacuum runs rarely, and dead tuples accumulate heavily before cleanup.
123+
124+
**Recommendation for large tables:** Lower the scale factor significantly, or set per-table thresholds:
125+
126+
```sql
127+
-- Per-table: vacuum after 1% dead tuples (instead of 20%)
128+
ALTER TABLE orders SET (
129+
autovacuum_vacuum_scale_factor = 0.01,
130+
autovacuum_vacuum_threshold = 1000,
131+
autovacuum_analyze_scale_factor = 0.005,
132+
autovacuum_analyze_threshold = 500
133+
);
134+
```
135+
136+
### `autovacuum_max_workers`
137+
138+
**Default:** 3. The number of autovacuum workers that can run simultaneously. For a system with many large, frequently-updated tables, the default is often insufficient.
139+
140+
```ini
141+
autovacuum_max_workers = 6
142+
```
143+
144+
More workers means more CPU and I/O, but tables stay cleaner and queries stay fast.
145+
146+
### `autovacuum_cost_delay` and `autovacuum_vacuum_cost_limit`
147+
148+
Autovacuum throttles itself using a cost-based mechanism to avoid overwhelming I/O. Each page read, dirty page write, and dirty page hit has a cost. When the accumulated cost reaches `autovacuum_vacuum_cost_limit`, autovacuum sleeps for `autovacuum_cost_delay` milliseconds.
149+
150+
**Defaults:**
151+
- `autovacuum_cost_delay = 2ms` (recent versions)
152+
- `autovacuum_vacuum_cost_limit = 200`
153+
154+
For SSDs, autovacuum can be much more aggressive. The default throttling is designed for spinning disks:
155+
156+
```ini
157+
# For NVMe SSDs:
158+
autovacuum_vacuum_cost_delay = 2ms
159+
autovacuum_vacuum_cost_limit = 2000 # 10x more aggressive
160+
```
161+
162+
### Monitoring Autovacuum
163+
164+
```sql
165+
-- See when tables were last vacuumed and their bloat
166+
SELECT
167+
schemaname,
168+
relname,
169+
n_live_tup,
170+
n_dead_tup,
171+
round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
172+
last_vacuum,
173+
last_autovacuum,
174+
last_analyze,
175+
last_autoanalyze
176+
FROM pg_stat_user_tables
177+
ORDER BY n_dead_tup DESC;
178+
179+
-- Tables approaching autovacuum threshold
180+
SELECT
181+
schemaname,
182+
relname,
183+
n_dead_tup,
184+
n_live_tup,
185+
autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * n_live_tup AS vacuum_threshold
186+
FROM pg_stat_user_tables
187+
JOIN pg_class ON relname = pg_class.relname
188+
LEFT JOIN (
189+
SELECT relid,
190+
(array_to_string(reloptions, ',') ~ 'autovacuum_vacuum_threshold')::boolean AS has_threshold,
191+
(regexp_match(array_to_string(reloptions, ','), 'autovacuum_vacuum_threshold=(\d+)'))[1]::bigint AS autovacuum_vacuum_threshold,
192+
(regexp_match(array_to_string(reloptions, ','), 'autovacuum_vacuum_scale_factor=(\d+\.?\d*)'))[1]::numeric AS autovacuum_vacuum_scale_factor
193+
FROM pg_class
194+
) opts ON opts.relid = pg_class.oid
195+
CROSS JOIN (
196+
SELECT current_setting('autovacuum_vacuum_threshold')::bigint AS autovacuum_vacuum_threshold,
197+
current_setting('autovacuum_vacuum_scale_factor')::numeric AS autovacuum_vacuum_scale_factor
198+
) defaults;
199+
```
200+
201+
## Connection Pooling with PgBouncer
202+
203+
Postgres connections are expensive: each backend process uses ~5-10MB of RAM and involves OS process creation overhead. Applications that open many short-lived connections — serverless functions, high-concurrency APIs — can overwhelm Postgres's connection capacity.
204+
205+
PgBouncer is a lightweight connection pooler that sits between your application and Postgres, multiplexing many application connections onto a smaller number of Postgres connections.
206+
207+
### Pool Modes
208+
209+
**Session mode:** A server connection is assigned to a client for the duration of the client session. No query-level multiplexing. Only useful for reducing connection overhead (not connection count).
210+
211+
**Transaction mode:** A server connection is assigned for the duration of each transaction. After the transaction commits or rolls back, the connection returns to the pool. This is the most useful mode — a pool of 50 server connections can handle thousands of concurrent application connections.
212+
213+
**Statement mode:** A server connection is assigned for a single statement, then released. Most restrictive — prepared statements, `SET` commands, and transactions spanning multiple statements don't work.
214+
215+
For most applications, **transaction mode** is the right choice.
216+
217+
### PgBouncer Configuration
218+
219+
```ini
220+
[databases]
221+
mydb = host=127.0.0.1 port=5432 dbname=mydb
222+
223+
[pgbouncer]
224+
listen_addr = 127.0.0.1
225+
listen_port = 6432
226+
227+
pool_mode = transaction
228+
229+
# Server connections
230+
max_client_conn = 10000 # Allow many application connections
231+
default_pool_size = 25 # Postgres sees max 25 connections from this app
232+
min_pool_size = 5
233+
reserve_pool_size = 5
234+
235+
# Authentication
236+
auth_type = scram-sha-256
237+
auth_file = /etc/pgbouncer/userlist.txt
238+
239+
# Timeouts
240+
server_idle_timeout = 600
241+
client_idle_timeout = 0
242+
query_timeout = 0
243+
244+
# Logging
245+
log_connections = 0
246+
log_disconnections = 0
247+
```
248+
249+
**What doesn't work in transaction mode:** `SET LOCAL` (within a transaction is fine), session-level `SET`, advisory locks held across transactions, `LISTEN/NOTIFY`, server-side cursors outside transactions, prepared statements (unless you use `server_reset_query` or PgBouncer's prepared statement tracking feature). Know these limitations before adopting.
250+
251+
### How Many Postgres Connections?
252+
253+
The optimal number of Postgres server connections for throughput:
254+
255+
```
256+
optimal_connections ≈ CPU_count * 2 + effective_spindle_count
257+
```
258+
259+
For a 4-core server with an NVMe SSD:
260+
```
261+
optimal_connections ≈ 4 * 2 + 1 ≈ 9
262+
```
263+
264+
This seems shockingly low but is supported by benchmarks. More connections than CPUs means context-switching overhead and lock contention outweigh concurrency benefits. For most servers, 20-50 Postgres connections is sufficient for high throughput. PgBouncer's job is to make 1000 application clients share those efficiently.
265+
266+
## `max_connections`
267+
268+
**Default:** 100. Set this based on what you can afford in terms of memory, not what you hope to use.
269+
270+
Each connection uses about 5-10MB of RAM. For a server with 32GB RAM and 8GB `shared_buffers`, the remaining 24GB can support roughly 2,400-4,800 connections. But you don't want that many — use PgBouncer instead and keep `max_connections` low.
271+
272+
```ini
273+
max_connections = 200 # With PgBouncer handling the fan-out
274+
```
275+
276+
## Query Performance: The EXPLAIN Habit
277+
278+
No amount of server configuration compensates for missing indexes or bad queries. The most impactful performance work is query-level.
279+
280+
The diagnostic workflow:
281+
282+
1. **Find slow queries** via `pg_stat_statements`:
283+
```sql
284+
SELECT query, calls, total_exec_time / calls AS avg_ms, rows / calls AS avg_rows
285+
FROM pg_stat_statements
286+
WHERE calls > 100
287+
ORDER BY avg_ms DESC
288+
LIMIT 20;
289+
```
290+
291+
2. **Explain the slow query:**
292+
```sql
293+
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
294+
SELECT * FROM orders o
295+
JOIN users u ON u.id = o.user_id
296+
WHERE o.created_at > now() - interval '7 days'
297+
AND o.status = 'pending';
298+
```
299+
300+
3. **Look for:**
301+
- Sequential scans on large tables → need an index
302+
- Nested loop with large outer relation → bad join choice, likely bad statistics
303+
- High `Buffers: read` → cache miss, data not in buffer
304+
- `rows=X (actual rows=Y)` with large discrepancy → bad statistics, run `ANALYZE`
305+
- Sort operations with "external sort" → increase `work_mem`
306+
- Filter: `(Rows Removed by Filter: N)` with large N → index doesn't exist or isn't selective enough
307+
308+
4. **Fix the issue** (add index, update statistics, rewrite query)
309+
310+
5. **Verify improvement** (run `EXPLAIN ANALYZE` again)
311+
312+
### Statistics and `ANALYZE`
313+
314+
Run `ANALYZE` after large bulk loads to update table statistics before the planner has to guess:
315+
316+
```sql
317+
ANALYZE orders; -- Update statistics for one table
318+
ANALYZE; -- Update statistics for all tables in current database
319+
```
320+
321+
For columns with high cardinality or very uneven distributions, increase per-column statistics:
322+
323+
```sql
324+
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
325+
ANALYZE orders;
326+
```
327+
328+
### Parallel Query
329+
330+
Postgres can parallelize query execution across multiple CPU cores for sequential scans and some aggregations. This is controlled by `max_parallel_workers_per_gather`.
331+
332+
```ini
333+
max_parallel_workers_per_gather = 4 # Up to 4 workers per parallel query
334+
max_parallel_workers = 8 # Total parallel workers (across all queries)
335+
max_worker_processes = 16 # Total background workers
336+
```
337+
338+
Parallel query is automatic — the planner decides when to use it. It helps large analytical queries; it doesn't help indexed lookups.
339+
340+
## Partitioning for Performance
341+
342+
As covered in Chapter 3, time-partitioned tables with recent-data access patterns benefit enormously from partition pruning:
343+
344+
```sql
345+
-- This only scans the relevant monthly partition:
346+
SELECT * FROM events
347+
WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01';
348+
```
349+
350+
For queries that always filter on the partition key, partitioning is a very effective performance strategy for large tables.
351+
352+
## Identifying Bloat
353+
354+
Index and table bloat degrades performance and wastes disk space. A quick check:
355+
356+
```sql
357+
-- Find tables with significant dead tuple bloat
358+
SELECT schemaname, relname, n_dead_tup,
359+
pg_size_pretty(pg_total_relation_size(schemaname || '.' || relname)) AS total_size
360+
FROM pg_stat_user_tables
361+
WHERE n_dead_tup > 10000
362+
ORDER BY n_dead_tup DESC
363+
LIMIT 20;
364+
365+
-- pgstattuple for precise bloat measurement (requires extension)
366+
CREATE EXTENSION IF NOT EXISTS pgstattuple;
367+
SELECT * FROM pgstattuple('orders');
368+
```
369+
370+
For severe bloat that `autovacuum` can't reclaim (e.g., after a massive delete), consider `VACUUM FULL` (locks the table) or `pg_repack` (online, no locking).
371+
372+
## Configuration Validation Tools
373+
374+
**pgtune** (pgtune.leopard.in.ua) generates a `postgresql.conf` based on your hardware profile. It's a good starting point.
375+
376+
**pgBadger** analyzes Postgres log files and produces detailed reports on slow queries, wait events, and error patterns.
377+
378+
**check_postgres** is a Nagios/Icinga monitoring plugin that checks connection counts, bloat, vacuum age, and many other indicators.
379+
380+
The key principle: Postgres's default configuration is a safe minimum, not a target. Every production Postgres instance should be tuned for its workload and hardware. The changes in this chapter — higher `shared_buffers`, lower autovacuum scale factors, PgBouncer in transaction mode — produce immediate, measurable improvements on almost every system.

0 commit comments

Comments
 (0)