Skip to content

Commit 5ce5a1c

Browse files
DavidLiedleclaude
andcommitted
Add Chapter 14: Replication and High Availability
Covers streaming replication setup, synchronous vs asynchronous, replication slots, read replicas, logical replication and CDC with Debezium, Patroni for automatic failover, client routing patterns, replication monitoring, and managed HA options. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent acdf394 commit 5ce5a1c

1 file changed

Lines changed: 290 additions & 0 deletions

File tree

src/ch14-replication-ha.md

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# Replication and High Availability
2+
3+
A single Postgres instance, no matter how well-tuned, is a single point of failure. High availability means the database continues to serve traffic when a hardware failure, network partition, or software crash takes down the primary server. Replication is the mechanism that makes this possible.
4+
5+
Postgres's replication story is mature, well-understood, and capable of supporting serious production requirements — from simple read replicas to automatic failover to logical change data capture.
6+
7+
## Physical (Streaming) Replication
8+
9+
Streaming replication is the primary replication mechanism in Postgres. The standby connects to the primary, streams WAL records, and applies them to maintain a byte-for-byte copy of the primary's data.
10+
11+
**How it works:**
12+
1. The primary writes all changes to WAL (Write-Ahead Log)
13+
2. The standby's WAL receiver process connects to the primary's WAL sender process
14+
3. WAL records stream continuously to the standby
15+
4. The standby's startup process applies WAL records, keeping the standby's data in sync
16+
5. The standby is read-only — queries can run against it, but no writes are allowed
17+
18+
### Setting Up Streaming Replication
19+
20+
On the primary, in `postgresql.conf`:
21+
22+
```ini
23+
wal_level = replica
24+
max_wal_senders = 10
25+
wal_keep_size = 1GB # Keep enough WAL for standbys to catch up
26+
```
27+
28+
Create a replication role:
29+
30+
```sql
31+
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
32+
```
33+
34+
In `pg_hba.conf`, allow the standby to connect for replication:
35+
36+
```
37+
host replication replicator 10.0.0.2/32 scram-sha-256
38+
```
39+
40+
On the standby, create a `standby.signal` file and configure `postgresql.conf`:
41+
42+
```ini
43+
primary_conninfo = 'host=10.0.0.1 port=5432 user=replicator password=secure_password'
44+
restore_command = '' # Only needed for WAL archiving
45+
```
46+
47+
The standby automatically starts replicating when it finds `standby.signal` in the data directory.
48+
49+
Verify replication is working on the primary:
50+
51+
```sql
52+
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
53+
write_lag, flush_lag, replay_lag
54+
FROM pg_stat_replication;
55+
```
56+
57+
Replication lag is shown in `replay_lag` — the time difference between when the primary committed a transaction and when the standby applied it.
58+
59+
### Synchronous vs. Asynchronous Replication
60+
61+
**Asynchronous replication (default):** The primary commits and returns to the client without waiting for the standby to acknowledge. The standby eventually applies the changes. Replication lag is typically milliseconds to seconds. If the primary crashes, recent transactions may not have reached the standby — *replication lag = potential data loss*.
62+
63+
**Synchronous replication:** The primary waits for one or more standbys to acknowledge writing the WAL before committing. Zero data loss on primary failure, at the cost of added commit latency (every commit waits for network round-trip to standby).
64+
65+
Configure synchronous replication on the primary:
66+
67+
```ini
68+
synchronous_standby_names = 'standby1' # Specific standby name
69+
# Or: 'ANY 1 (standby1, standby2)' # Any 1 of multiple standbys
70+
```
71+
72+
On the standby, set `application_name` in `primary_conninfo`:
73+
74+
```ini
75+
primary_conninfo = 'host=primary port=5432 user=replicator application_name=standby1'
76+
```
77+
78+
For most applications, asynchronous replication with a well-monitored replica is sufficient. The typical replication lag of milliseconds means RPO (Recovery Point Objective) is measured in seconds, not minutes.
79+
80+
### Replication Slots
81+
82+
A replication slot guarantees that the primary retains WAL segments until all subscribers have consumed them. Without a slot, if a standby falls behind (network outage, maintenance), the primary might delete WAL that the standby still needs, requiring a full resync.
83+
84+
```sql
85+
-- Create a physical replication slot
86+
SELECT pg_create_physical_replication_slot('standby1_slot');
87+
88+
-- List slots and their lag
89+
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
90+
FROM pg_replication_slots;
91+
```
92+
93+
**Warning:** Replication slots that fall behind and stop consuming WAL will cause WAL to accumulate indefinitely, eventually filling the disk. Monitor slot lag and set `max_slot_wal_keep_size` as a safety valve:
94+
95+
```ini
96+
max_slot_wal_keep_size = 10GB # Discard WAL for lagging slots beyond this
97+
```
98+
99+
## Read Replicas
100+
101+
A standby server in hot standby mode accepts read-only queries. This is a simple way to scale read traffic.
102+
103+
```sql
104+
-- On primary: hot standby is the default
105+
-- On standby, queries run normally (read-only)
106+
SELECT COUNT(*) FROM orders; -- works on standby
107+
108+
-- Queries requiring writes fail:
109+
INSERT INTO orders ...; -- ERROR: cannot execute INSERT in a read-only transaction
110+
```
111+
112+
Read replica use cases:
113+
- Long-running analytical queries that would impact primary performance
114+
- Reporting and BI tools
115+
- Full-text search queries
116+
- Geographic distribution (read from a nearby region)
117+
118+
For application routing, you need to direct read traffic to replicas and write traffic to the primary. Tools like HAProxy, PgBouncer, or application-level connection routing handle this.
119+
120+
**Standby hint bits:** One subtlety — when a read query on the standby touches a tuple that needs its hint bits set (a tuple that hasn't been frozen), the standby can't update the hint bits (it's read-only). Postgres handles this gracefully (it applies the visibility rules manually), but it means hot standby reads can be slightly slower than primary reads for data that's never been frozen.
121+
122+
## Logical Replication
123+
124+
Physical replication creates an exact byte-for-byte copy of the primary. Logical replication replicates changes at the row level — individual INSERT, UPDATE, DELETE events — allowing more flexibility:
125+
126+
- Replicate a subset of tables
127+
- Replicate to a different major Postgres version
128+
- Replicate to multiple downstream consumers
129+
- Support bidirectional replication (with care)
130+
- Use as a CDC (Change Data Capture) source
131+
132+
Logical replication uses a publish/subscribe model:
133+
134+
**Publisher (primary):**
135+
```sql
136+
-- Requires wal_level = logical in postgresql.conf
137+
CREATE PUBLICATION my_publication
138+
FOR TABLE users, orders, products;
139+
-- Or: FOR ALL TABLES (replicates everything)
140+
```
141+
142+
**Subscriber (downstream Postgres):**
143+
```sql
144+
CREATE SUBSCRIPTION my_subscription
145+
CONNECTION 'host=primary port=5432 user=replicator password=xxx dbname=mydb'
146+
PUBLICATION my_publication;
147+
```
148+
149+
The subscription starts copying the initial data, then applies ongoing changes. Tables on the subscriber must exist with compatible schemas.
150+
151+
### Logical Decoding and CDC
152+
153+
Logical replication is built on *logical decoding* — the ability to decode WAL into a stream of data changes. Tools like Debezium use logical decoding to stream Postgres changes into Kafka, enabling event sourcing and data integration patterns without polling:
154+
155+
```sql
156+
-- Create a logical replication slot for an external consumer
157+
SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');
158+
159+
-- Peek at changes (for debugging)
160+
SELECT * FROM pg_logical_slot_peek_changes('debezium', NULL, NULL);
161+
```
162+
163+
Debezium + Kafka is the gold standard for streaming Postgres changes to other systems. This is how you bridge Postgres (the source of truth) with Elasticsearch, Redis, data warehouses, and other consumers while maintaining Postgres as the authoritative store.
164+
165+
## Automatic Failover with Patroni
166+
167+
Physical replication sets up the data flow, but failover — promoting a standby when the primary fails — requires additional tooling. Patroni is the most widely-used solution.
168+
169+
**Patroni** is a Postgres cluster manager that:
170+
- Uses a distributed consensus store (etcd, Consul, or ZooKeeper) to elect a leader
171+
- Automatically promotes a standby when the primary becomes unavailable
172+
- Manages the Postgres configuration for the current role (primary vs. standby)
173+
- Provides a REST API for cluster status and management
174+
175+
A Patroni cluster consists of:
176+
- 2 or more Postgres instances with Patroni agents
177+
- A distributed consensus store (3-node etcd cluster is typical)
178+
- (Optionally) HAProxy or similar for client routing
179+
180+
**Basic Patroni configuration** (`patroni.yml`):
181+
182+
```yaml
183+
name: postgres1
184+
scope: my-cluster
185+
186+
restapi:
187+
listen: 0.0.0.0:8008
188+
connect_address: 10.0.0.1:8008
189+
190+
etcd3:
191+
hosts: 10.0.0.10:2379,10.0.0.11:2379,10.0.0.12:2379
192+
193+
bootstrap:
194+
dcs:
195+
ttl: 30
196+
loop_wait: 10
197+
retry_timeout: 10
198+
maximum_lag_on_failover: 1048576 # 1MB - max allowed lag for a standby to be eligible
199+
200+
postgresql:
201+
listen: 0.0.0.0:5432
202+
connect_address: 10.0.0.1:5432
203+
data_dir: /var/lib/postgresql/data
204+
parameters:
205+
wal_level: replica
206+
hot_standby: on
207+
max_wal_senders: 10
208+
max_replication_slots: 10
209+
```
210+
211+
Patroni handles:
212+
- Leader election (which Postgres node is primary)
213+
- Automatic failover (promotes best standby when primary fails)
214+
- Replication configuration management
215+
- Health checking
216+
217+
**The patronictl CLI** for cluster management:
218+
```bash
219+
patronictl -c /etc/patroni.yml list # Show cluster state
220+
patronictl -c /etc/patroni.yml failover my-cluster # Manual failover
221+
patronictl -c /etc/patroni.yml switchover my-cluster # Planned failover
222+
```
223+
224+
## Client Routing
225+
226+
When the primary changes (failover), your application needs to know the new primary's address. Options:
227+
228+
**DNS-based routing:** Maintain a DNS record that always points to the current primary. Patroni updates this record on failover. Simple but has TTL lag.
229+
230+
**HAProxy:** A reverse proxy that routes connections based on health checks. Patroni's REST API reports whether a node is primary or standby. HAProxy queries this to route traffic.
231+
232+
```
233+
frontend postgres_primary
234+
bind *:5432
235+
default_backend primary
236+
237+
backend primary
238+
option httpchk GET /primary
239+
server postgres1 10.0.0.1:5432 check port 8008
240+
server postgres2 10.0.0.2:5432 check port 8008
241+
```
242+
243+
**PgBouncer + Patroni:** PgBouncer can be reconfigured to point to the new primary after failover, either via a patroni callback or by pointing to a virtual IP that moves with the primary.
244+
245+
**Cluster-aware drivers:** Some Postgres drivers support multiple hosts and automatically discover the current primary. For example, in Go with `pgx`:
246+
247+
```go
248+
connStr := "postgres://user:pass@host1,host2,host3/mydb?target_session_attrs=read-write"
249+
```
250+
251+
This connects to the first host in the list that accepts read-write connections (i.e., the primary).
252+
253+
## Replication Monitoring
254+
255+
Essential metrics to monitor:
256+
257+
```sql
258+
-- Replication lag on primary
259+
SELECT client_addr, replay_lag, sync_state
260+
FROM pg_stat_replication;
261+
262+
-- On standby: how far behind is this standby?
263+
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
264+
265+
-- WAL sender activity
266+
SELECT pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
267+
FROM pg_stat_replication;
268+
```
269+
270+
Alert on:
271+
- Replication lag > 60 seconds (normal is <100ms for async)
272+
- Replication slot lag growing indefinitely
273+
- Standby not connected at all
274+
275+
## Cloud Managed HA
276+
277+
If you're using a managed Postgres service (AWS RDS, Cloud SQL, Aurora, Supabase), high availability is typically handled for you:
278+
279+
- **AWS RDS Multi-AZ:** Synchronous standby in a different availability zone. Automatic failover in 60-120 seconds. Read replicas available separately.
280+
- **AWS Aurora:** Multi-AZ by design with storage-level replication. Very fast failover (~30 seconds). Up to 15 read replicas.
281+
- **Google Cloud SQL:** HA with an automatic standby in a different zone. Failover in ~60 seconds.
282+
- **Supabase:** Built-in HA with read replicas.
283+
284+
Managed HA trades control for operational simplicity. For teams without dedicated DBAs, managed HA is often the right choice — let the cloud provider handle the Patroni equivalent.
285+
286+
## The Availability Math
287+
288+
A single Postgres instance with good hardware and no HA has roughly 99.9% availability (about 8 hours of downtime per year from scheduled maintenance and unexpected failures). With HA and automatic failover, you can achieve 99.95% or better (under 5 hours per year). The gap between "I should add a standby" and "this will cause production downtime" is often just one hardware failure away.
289+
290+
For any production system with users, HA is not optional — it's part of operating responsibly. The complexity of Patroni (or using a managed service) is a one-time investment that pays off the first time a disk fails or a host becomes unavailable and traffic just... fails over.

0 commit comments

Comments
 (0)