|
| 1 | +# Performance Tuning |
| 2 | + |
| 3 | +A default Postgres installation is configured conservatively — appropriate for a development machine or a small VM, but leaving significant performance on the table for a production server. This chapter covers the configuration parameters that matter, the autovacuum behavior you need to understand, connection pooling as a force multiplier, and the query optimization habits that prevent most performance problems. |
| 4 | + |
| 5 | +Postgres performance tuning is iterative. There is no single configuration change that makes everything fast. The process is: measure, identify bottlenecks, tune, measure again. |
| 6 | + |
| 7 | +## Memory Configuration |
| 8 | + |
| 9 | +### `shared_buffers` |
| 10 | + |
| 11 | +The most important memory setting. Controls the size of Postgres's shared buffer cache — the pool of memory that all backends share for caching data pages. |
| 12 | + |
| 13 | +**Default:** 128MB (egregiously low for production) |
| 14 | + |
| 15 | +**Recommendation:** ~25% of available RAM |
| 16 | + |
| 17 | +```ini |
| 18 | +shared_buffers = 8GB # on a 32GB server |
| 19 | +``` |
| 20 | + |
| 21 | +Setting `shared_buffers` higher than 25% of RAM has diminishing returns because the OS page cache also caches frequently-read data. Going above 40% of RAM can actually hurt performance by reducing the OS page cache. |
| 22 | + |
| 23 | +### `effective_cache_size` |
| 24 | + |
| 25 | +Not a memory allocation — it's a hint to the query planner about how much total memory (shared_buffers + OS cache) is available for caching. The planner uses this to decide between index scans and sequential scans. |
| 26 | + |
| 27 | +**Recommendation:** 50–75% of total RAM |
| 28 | + |
| 29 | +```ini |
| 30 | +effective_cache_size = 24GB # on a 32GB server |
| 31 | +``` |
| 32 | + |
| 33 | +If this is set too low, the planner incorrectly thinks the disk is slow and prefers sequential scans when it should use indexes. |
| 34 | + |
| 35 | +### `work_mem` |
| 36 | + |
| 37 | +Memory available per sort or hash operation. Each sort, hash join, and hash aggregate can use up to `work_mem`. A query with multiple operations can use `work_mem` multiple times. With many concurrent connections, total memory usage can be `work_mem × connections × operations_per_query`. |
| 38 | + |
| 39 | +**Default:** 4MB (too low for complex queries) |
| 40 | + |
| 41 | +**Recommendation:** Balance between query complexity and connection count. A formula: |
| 42 | +``` |
| 43 | +work_mem = (RAM - shared_buffers) / (max_connections × 2) |
| 44 | +``` |
| 45 | + |
| 46 | +For a 32GB server with 8GB shared_buffers and 100 max_connections: |
| 47 | +``` |
| 48 | +work_mem = (32 - 8) / (100 × 2) ≈ 120MB |
| 49 | +``` |
| 50 | + |
| 51 | +Setting `work_mem` too high with many connections causes OOM. Set conservatively globally and increase per-session for complex analytical queries: |
| 52 | + |
| 53 | +```sql |
| 54 | +SET work_mem = '256MB'; |
| 55 | +SELECT ... ORDER BY ... LIMIT ...; -- benefits from higher work_mem |
| 56 | +RESET work_mem; |
| 57 | +``` |
| 58 | + |
| 59 | +Watch for "external sort" operations in `EXPLAIN` output — they indicate the sort spilled to disk because `work_mem` was too low. |
| 60 | + |
| 61 | +### `maintenance_work_mem` |
| 62 | + |
| 63 | +Memory for maintenance operations: `VACUUM`, `CREATE INDEX`, `ALTER TABLE`, etc. |
| 64 | + |
| 65 | +**Recommendation:** 1GB or more for systems with large tables. This speeds up index creation dramatically. |
| 66 | + |
| 67 | +```ini |
| 68 | +maintenance_work_mem = 2GB |
| 69 | +``` |
| 70 | + |
| 71 | +### `max_wal_size` |
| 72 | + |
| 73 | +Controls when checkpoint processing forces a WAL flush. Too small causes frequent checkpoints, generating I/O spikes. Too large increases recovery time after a crash. |
| 74 | + |
| 75 | +**Recommendation:** 2–4GB for most systems, higher for write-heavy workloads. |
| 76 | + |
| 77 | +```ini |
| 78 | +max_wal_size = 4GB |
| 79 | +``` |
| 80 | + |
| 81 | +Watch `pg_stat_bgwriter.checkpoint_req` — if this is high, checkpoints are being triggered by WAL size rather than time. Increase `max_wal_size`. |
| 82 | + |
| 83 | +## Checkpoint Configuration |
| 84 | + |
| 85 | +Checkpoints flush dirty pages from the buffer cache to disk. A checkpoint that runs too fast causes I/O spikes (all the dirty pages written in a burst). A checkpoint that runs too slow causes long recovery times after a crash. |
| 86 | + |
| 87 | +```ini |
| 88 | +checkpoint_completion_target = 0.9 # Spread I/O over 90% of checkpoint interval |
| 89 | +checkpoint_timeout = 15min # Maximum time between checkpoints |
| 90 | +``` |
| 91 | + |
| 92 | +`checkpoint_completion_target = 0.9` (the default in recent Postgres versions) is good. It tells Postgres to spread checkpoint writes over 90% of the interval between checkpoints, smoothing I/O. |
| 93 | + |
| 94 | +## WAL Settings |
| 95 | + |
| 96 | +```ini |
| 97 | +wal_level = replica # Minimum for streaming replication |
| 98 | +wal_compression = on # Compress WAL records (reduces WAL volume) |
| 99 | +wal_buffers = 16MB # WAL write buffer (usually auto-configured from shared_buffers) |
| 100 | +synchronous_commit = on # Don't change this unless you understand the trade-off |
| 101 | +``` |
| 102 | + |
| 103 | +`synchronous_commit = off` gives a performance boost (no fsync on commit) at the cost of potentially losing the last few seconds of committed transactions on crash. This is acceptable for non-critical data (analytics events, logs, rate limit counters). Never set it off globally for transactional data. |
| 104 | + |
| 105 | +## Autovacuum: The Most Misunderstood Setting |
| 106 | + |
| 107 | +Autovacuum is the background process that reclaims dead tuple space, updates table statistics, and prevents XID wraparound. It is not optional. Disabling autovacuum is not a performance optimization — it's setting up a future disaster. |
| 108 | + |
| 109 | +The most important autovacuum settings: |
| 110 | + |
| 111 | +### `autovacuum_vacuum_scale_factor` and `autovacuum_vacuum_threshold` |
| 112 | + |
| 113 | +These control when autovacuum triggers on a table. The formula: |
| 114 | +``` |
| 115 | +threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples |
| 116 | +``` |
| 117 | + |
| 118 | +**Defaults:** |
| 119 | +- `autovacuum_vacuum_threshold = 50` (absolute minimum dead tuples) |
| 120 | +- `autovacuum_vacuum_scale_factor = 0.2` (20% of table) |
| 121 | + |
| 122 | +For a table with 10 million rows, autovacuum triggers when there are 2,000,050 dead tuples. For a frequently-updated table, this means autovacuum runs rarely, and dead tuples accumulate heavily before cleanup. |
| 123 | + |
| 124 | +**Recommendation for large tables:** Lower the scale factor significantly, or set per-table thresholds: |
| 125 | + |
| 126 | +```sql |
| 127 | +-- Per-table: vacuum after 1% dead tuples (instead of 20%) |
| 128 | +ALTER TABLE orders SET ( |
| 129 | + autovacuum_vacuum_scale_factor = 0.01, |
| 130 | + autovacuum_vacuum_threshold = 1000, |
| 131 | + autovacuum_analyze_scale_factor = 0.005, |
| 132 | + autovacuum_analyze_threshold = 500 |
| 133 | +); |
| 134 | +``` |
| 135 | + |
| 136 | +### `autovacuum_max_workers` |
| 137 | + |
| 138 | +**Default:** 3. The number of autovacuum workers that can run simultaneously. For a system with many large, frequently-updated tables, the default is often insufficient. |
| 139 | + |
| 140 | +```ini |
| 141 | +autovacuum_max_workers = 6 |
| 142 | +``` |
| 143 | + |
| 144 | +More workers means more CPU and I/O, but tables stay cleaner and queries stay fast. |
| 145 | + |
| 146 | +### `autovacuum_cost_delay` and `autovacuum_vacuum_cost_limit` |
| 147 | + |
| 148 | +Autovacuum throttles itself using a cost-based mechanism to avoid overwhelming I/O. Each page read, dirty page write, and dirty page hit has a cost. When the accumulated cost reaches `autovacuum_vacuum_cost_limit`, autovacuum sleeps for `autovacuum_cost_delay` milliseconds. |
| 149 | + |
| 150 | +**Defaults:** |
| 151 | +- `autovacuum_cost_delay = 2ms` (recent versions) |
| 152 | +- `autovacuum_vacuum_cost_limit = 200` |
| 153 | + |
| 154 | +For SSDs, autovacuum can be much more aggressive. The default throttling is designed for spinning disks: |
| 155 | + |
| 156 | +```ini |
| 157 | +# For NVMe SSDs: |
| 158 | +autovacuum_vacuum_cost_delay = 2ms |
| 159 | +autovacuum_vacuum_cost_limit = 2000 # 10x more aggressive |
| 160 | +``` |
| 161 | + |
| 162 | +### Monitoring Autovacuum |
| 163 | + |
| 164 | +```sql |
| 165 | +-- See when tables were last vacuumed and their bloat |
| 166 | +SELECT |
| 167 | + schemaname, |
| 168 | + relname, |
| 169 | + n_live_tup, |
| 170 | + n_dead_tup, |
| 171 | + round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct, |
| 172 | + last_vacuum, |
| 173 | + last_autovacuum, |
| 174 | + last_analyze, |
| 175 | + last_autoanalyze |
| 176 | +FROM pg_stat_user_tables |
| 177 | +ORDER BY n_dead_tup DESC; |
| 178 | + |
| 179 | +-- Tables approaching autovacuum threshold |
| 180 | +SELECT |
| 181 | + schemaname, |
| 182 | + relname, |
| 183 | + n_dead_tup, |
| 184 | + n_live_tup, |
| 185 | + autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * n_live_tup AS vacuum_threshold |
| 186 | +FROM pg_stat_user_tables |
| 187 | +JOIN pg_class ON relname = pg_class.relname |
| 188 | +LEFT JOIN ( |
| 189 | + SELECT relid, |
| 190 | + (array_to_string(reloptions, ',') ~ 'autovacuum_vacuum_threshold')::boolean AS has_threshold, |
| 191 | + (regexp_match(array_to_string(reloptions, ','), 'autovacuum_vacuum_threshold=(\d+)'))[1]::bigint AS autovacuum_vacuum_threshold, |
| 192 | + (regexp_match(array_to_string(reloptions, ','), 'autovacuum_vacuum_scale_factor=(\d+\.?\d*)'))[1]::numeric AS autovacuum_vacuum_scale_factor |
| 193 | + FROM pg_class |
| 194 | +) opts ON opts.relid = pg_class.oid |
| 195 | +CROSS JOIN ( |
| 196 | + SELECT current_setting('autovacuum_vacuum_threshold')::bigint AS autovacuum_vacuum_threshold, |
| 197 | + current_setting('autovacuum_vacuum_scale_factor')::numeric AS autovacuum_vacuum_scale_factor |
| 198 | +) defaults; |
| 199 | +``` |
| 200 | + |
| 201 | +## Connection Pooling with PgBouncer |
| 202 | + |
| 203 | +Postgres connections are expensive: each backend process uses ~5-10MB of RAM and involves OS process creation overhead. Applications that open many short-lived connections — serverless functions, high-concurrency APIs — can overwhelm Postgres's connection capacity. |
| 204 | + |
| 205 | +PgBouncer is a lightweight connection pooler that sits between your application and Postgres, multiplexing many application connections onto a smaller number of Postgres connections. |
| 206 | + |
| 207 | +### Pool Modes |
| 208 | + |
| 209 | +**Session mode:** A server connection is assigned to a client for the duration of the client session. No query-level multiplexing. Only useful for reducing connection overhead (not connection count). |
| 210 | + |
| 211 | +**Transaction mode:** A server connection is assigned for the duration of each transaction. After the transaction commits or rolls back, the connection returns to the pool. This is the most useful mode — a pool of 50 server connections can handle thousands of concurrent application connections. |
| 212 | + |
| 213 | +**Statement mode:** A server connection is assigned for a single statement, then released. Most restrictive — prepared statements, `SET` commands, and transactions spanning multiple statements don't work. |
| 214 | + |
| 215 | +For most applications, **transaction mode** is the right choice. |
| 216 | + |
| 217 | +### PgBouncer Configuration |
| 218 | + |
| 219 | +```ini |
| 220 | +[databases] |
| 221 | +mydb = host=127.0.0.1 port=5432 dbname=mydb |
| 222 | + |
| 223 | +[pgbouncer] |
| 224 | +listen_addr = 127.0.0.1 |
| 225 | +listen_port = 6432 |
| 226 | + |
| 227 | +pool_mode = transaction |
| 228 | + |
| 229 | +# Server connections |
| 230 | +max_client_conn = 10000 # Allow many application connections |
| 231 | +default_pool_size = 25 # Postgres sees max 25 connections from this app |
| 232 | +min_pool_size = 5 |
| 233 | +reserve_pool_size = 5 |
| 234 | + |
| 235 | +# Authentication |
| 236 | +auth_type = scram-sha-256 |
| 237 | +auth_file = /etc/pgbouncer/userlist.txt |
| 238 | + |
| 239 | +# Timeouts |
| 240 | +server_idle_timeout = 600 |
| 241 | +client_idle_timeout = 0 |
| 242 | +query_timeout = 0 |
| 243 | + |
| 244 | +# Logging |
| 245 | +log_connections = 0 |
| 246 | +log_disconnections = 0 |
| 247 | +``` |
| 248 | + |
| 249 | +**What doesn't work in transaction mode:** `SET LOCAL` (within a transaction is fine), session-level `SET`, advisory locks held across transactions, `LISTEN/NOTIFY`, server-side cursors outside transactions, prepared statements (unless you use `server_reset_query` or PgBouncer's prepared statement tracking feature). Know these limitations before adopting. |
| 250 | + |
| 251 | +### How Many Postgres Connections? |
| 252 | + |
| 253 | +The optimal number of Postgres server connections for throughput: |
| 254 | + |
| 255 | +``` |
| 256 | +optimal_connections ≈ CPU_count * 2 + effective_spindle_count |
| 257 | +``` |
| 258 | + |
| 259 | +For a 4-core server with an NVMe SSD: |
| 260 | +``` |
| 261 | +optimal_connections ≈ 4 * 2 + 1 ≈ 9 |
| 262 | +``` |
| 263 | + |
| 264 | +This seems shockingly low but is supported by benchmarks. More connections than CPUs means context-switching overhead and lock contention outweigh concurrency benefits. For most servers, 20-50 Postgres connections is sufficient for high throughput. PgBouncer's job is to make 1000 application clients share those efficiently. |
| 265 | + |
| 266 | +## `max_connections` |
| 267 | + |
| 268 | +**Default:** 100. Set this based on what you can afford in terms of memory, not what you hope to use. |
| 269 | + |
| 270 | +Each connection uses about 5-10MB of RAM. For a server with 32GB RAM and 8GB `shared_buffers`, the remaining 24GB can support roughly 2,400-4,800 connections. But you don't want that many — use PgBouncer instead and keep `max_connections` low. |
| 271 | + |
| 272 | +```ini |
| 273 | +max_connections = 200 # With PgBouncer handling the fan-out |
| 274 | +``` |
| 275 | + |
| 276 | +## Query Performance: The EXPLAIN Habit |
| 277 | + |
| 278 | +No amount of server configuration compensates for missing indexes or bad queries. The most impactful performance work is query-level. |
| 279 | + |
| 280 | +The diagnostic workflow: |
| 281 | + |
| 282 | +1. **Find slow queries** via `pg_stat_statements`: |
| 283 | +```sql |
| 284 | +SELECT query, calls, total_exec_time / calls AS avg_ms, rows / calls AS avg_rows |
| 285 | +FROM pg_stat_statements |
| 286 | +WHERE calls > 100 |
| 287 | +ORDER BY avg_ms DESC |
| 288 | +LIMIT 20; |
| 289 | +``` |
| 290 | + |
| 291 | +2. **Explain the slow query:** |
| 292 | +```sql |
| 293 | +EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) |
| 294 | +SELECT * FROM orders o |
| 295 | +JOIN users u ON u.id = o.user_id |
| 296 | +WHERE o.created_at > now() - interval '7 days' |
| 297 | + AND o.status = 'pending'; |
| 298 | +``` |
| 299 | + |
| 300 | +3. **Look for:** |
| 301 | + - Sequential scans on large tables → need an index |
| 302 | + - Nested loop with large outer relation → bad join choice, likely bad statistics |
| 303 | + - High `Buffers: read` → cache miss, data not in buffer |
| 304 | + - `rows=X (actual rows=Y)` with large discrepancy → bad statistics, run `ANALYZE` |
| 305 | + - Sort operations with "external sort" → increase `work_mem` |
| 306 | + - Filter: `(Rows Removed by Filter: N)` with large N → index doesn't exist or isn't selective enough |
| 307 | + |
| 308 | +4. **Fix the issue** (add index, update statistics, rewrite query) |
| 309 | + |
| 310 | +5. **Verify improvement** (run `EXPLAIN ANALYZE` again) |
| 311 | + |
| 312 | +### Statistics and `ANALYZE` |
| 313 | + |
| 314 | +Run `ANALYZE` after large bulk loads to update table statistics before the planner has to guess: |
| 315 | + |
| 316 | +```sql |
| 317 | +ANALYZE orders; -- Update statistics for one table |
| 318 | +ANALYZE; -- Update statistics for all tables in current database |
| 319 | +``` |
| 320 | + |
| 321 | +For columns with high cardinality or very uneven distributions, increase per-column statistics: |
| 322 | + |
| 323 | +```sql |
| 324 | +ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500; |
| 325 | +ANALYZE orders; |
| 326 | +``` |
| 327 | + |
| 328 | +### Parallel Query |
| 329 | + |
| 330 | +Postgres can parallelize query execution across multiple CPU cores for sequential scans and some aggregations. This is controlled by `max_parallel_workers_per_gather`. |
| 331 | + |
| 332 | +```ini |
| 333 | +max_parallel_workers_per_gather = 4 # Up to 4 workers per parallel query |
| 334 | +max_parallel_workers = 8 # Total parallel workers (across all queries) |
| 335 | +max_worker_processes = 16 # Total background workers |
| 336 | +``` |
| 337 | + |
| 338 | +Parallel query is automatic — the planner decides when to use it. It helps large analytical queries; it doesn't help indexed lookups. |
| 339 | + |
| 340 | +## Partitioning for Performance |
| 341 | + |
| 342 | +As covered in Chapter 3, time-partitioned tables with recent-data access patterns benefit enormously from partition pruning: |
| 343 | + |
| 344 | +```sql |
| 345 | +-- This only scans the relevant monthly partition: |
| 346 | +SELECT * FROM events |
| 347 | +WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01'; |
| 348 | +``` |
| 349 | + |
| 350 | +For queries that always filter on the partition key, partitioning is a very effective performance strategy for large tables. |
| 351 | + |
| 352 | +## Identifying Bloat |
| 353 | + |
| 354 | +Index and table bloat degrades performance and wastes disk space. A quick check: |
| 355 | + |
| 356 | +```sql |
| 357 | +-- Find tables with significant dead tuple bloat |
| 358 | +SELECT schemaname, relname, n_dead_tup, |
| 359 | + pg_size_pretty(pg_total_relation_size(schemaname || '.' || relname)) AS total_size |
| 360 | +FROM pg_stat_user_tables |
| 361 | +WHERE n_dead_tup > 10000 |
| 362 | +ORDER BY n_dead_tup DESC |
| 363 | +LIMIT 20; |
| 364 | + |
| 365 | +-- pgstattuple for precise bloat measurement (requires extension) |
| 366 | +CREATE EXTENSION IF NOT EXISTS pgstattuple; |
| 367 | +SELECT * FROM pgstattuple('orders'); |
| 368 | +``` |
| 369 | + |
| 370 | +For severe bloat that `autovacuum` can't reclaim (e.g., after a massive delete), consider `VACUUM FULL` (locks the table) or `pg_repack` (online, no locking). |
| 371 | + |
| 372 | +## Configuration Validation Tools |
| 373 | + |
| 374 | +**pgtune** (pgtune.leopard.in.ua) generates a `postgresql.conf` based on your hardware profile. It's a good starting point. |
| 375 | + |
| 376 | +**pgBadger** analyzes Postgres log files and produces detailed reports on slow queries, wait events, and error patterns. |
| 377 | + |
| 378 | +**check_postgres** is a Nagios/Icinga monitoring plugin that checks connection counts, bloat, vacuum age, and many other indicators. |
| 379 | + |
| 380 | +The key principle: Postgres's default configuration is a safe minimum, not a target. Every production Postgres instance should be tuned for its workload and hardware. The changes in this chapter — higher `shared_buffers`, lower autovacuum scale factors, PgBouncer in transaction mode — produce immediate, measurable improvements on almost every system. |
0 commit comments