Skip to content

Commit 69dd6e7

Browse files
DavidLiedleclaude
andcommitted
Add Chapter 15: Backup and Recovery
Covers RPO/RTO definitions, pg_dump/pg_restore for logical backups, WAL archiving for continuous backup, PITR configuration, pgBackRest as the production standard, Barman, backup testing practices, the 3-2-1 rule, and monitoring backup health. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 5ce5a1c commit 69dd6e7

1 file changed

Lines changed: 322 additions & 0 deletions

File tree

src/ch15-backup-recovery.md

Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
# Backup and Recovery
2+
3+
A database you can't restore is a database you don't have. Backup is not about the backup — it's about the restore. The question is not "do we have backups?" but "have we tested restoring from them recently?" and "what's our actual recovery time in an emergency?"
4+
5+
This chapter covers the full spectrum of Postgres backup strategies: logical dumps with `pg_dump`, continuous archiving with WAL, point-in-time recovery, and the production-grade tools that make this manageable.
6+
7+
## Terminology: RPO and RTO
8+
9+
Two metrics define your backup requirements:
10+
11+
**RPO (Recovery Point Objective):** How much data can you afford to lose? An RPO of 5 minutes means you can tolerate losing up to 5 minutes of transactions in a catastrophic failure. An RPO of 0 means you need synchronous replication with no data loss.
12+
13+
**RTO (Recovery Time Objective):** How long can the database be unavailable during recovery? An RTO of 4 hours means you can take up to 4 hours to restore from a backup and catch up. An RTO of 5 minutes requires a hot standby.
14+
15+
Your backup strategy flows from these numbers. A startup with moderate data loss tolerance might accept RPO=24h (daily backups) and RTO=4h. A financial system might require RPO=0 and RTO=minutes. Most production applications land somewhere in between.
16+
17+
## pg_dump: Logical Backups
18+
19+
`pg_dump` is Postgres's built-in tool for creating logical (SQL) backups. It connects to the database and dumps the schema and data as SQL commands or CSV.
20+
21+
```bash
22+
# Dump to SQL file (text format)
23+
pg_dump --dbname=mydb --file=backup.sql
24+
25+
# Dump to custom binary format (recommended: smaller, parallelizable restore)
26+
pg_dump --dbname=mydb --format=custom --file=backup.dump
27+
28+
# Dump with connection string
29+
pg_dump "postgresql://user:pass@host:5432/mydb" --format=custom --file=backup.dump
30+
31+
# Dump only specific tables
32+
pg_dump --table=orders --table=users --format=custom --file=partial_backup.dump
33+
34+
# Dump schema only (no data)
35+
pg_dump --schema-only --format=custom --file=schema_only.dump
36+
37+
# Dump data only (no schema)
38+
pg_dump --data-only --format=custom --file=data_only.dump
39+
```
40+
41+
### Restoring with `pg_restore`
42+
43+
```bash
44+
# Restore from custom format
45+
pg_restore --dbname=mydb --jobs=4 backup.dump # parallel restore with 4 workers
46+
47+
# Restore to a new database
48+
createdb mydb_restored
49+
pg_restore --dbname=mydb_restored --jobs=4 backup.dump
50+
51+
# Restore a specific table
52+
pg_restore --dbname=mydb --table=orders backup.dump
53+
54+
# Restore schema only
55+
pg_restore --schema-only --dbname=mydb backup.dump
56+
57+
# From SQL format:
58+
psql --dbname=mydb < backup.sql
59+
```
60+
61+
### `pg_dumpall`
62+
63+
Dumps the entire PostgreSQL cluster — all databases, global objects (roles, tablespaces):
64+
65+
```bash
66+
pg_dumpall --file=cluster_backup.sql --globals-only # Just roles/tablespaces
67+
pg_dumpall --file=cluster_backup.sql # Everything
68+
```
69+
70+
### Limitations of Logical Backups
71+
72+
- **Slow:** Dumping a large database takes hours; restoring takes longer (must replay all INSERT statements)
73+
- **Point-in-time is limited:** A dump reflects the state at one moment. No PITR without WAL archiving.
74+
- **Not suitable for very large databases:** For databases > several hundred GB, dump/restore is too slow for realistic RTO
75+
76+
For small databases (<50GB), pg_dump with daily automation is often sufficient. For larger databases or stricter RPO/RTO requirements, WAL archiving is necessary.
77+
78+
## WAL Archiving: Continuous Backup
79+
80+
Physical backup (PITR) works by:
81+
1. Taking a base backup (a copy of the data directory)
82+
2. Archiving WAL segments continuously as they're generated
83+
3. At recovery time: restore the base backup, then replay WAL segments up to the desired point in time
84+
85+
This allows recovery to any point in time since the base backup — with granularity limited only by WAL segment size.
86+
87+
### Configuring WAL Archiving
88+
89+
In `postgresql.conf`:
90+
91+
```ini
92+
wal_level = replica # Minimum for archiving
93+
archive_mode = on
94+
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'
95+
# Or to S3:
96+
# archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f'
97+
# Or with WAL-G:
98+
# archive_command = 'wal-g wal-push %p'
99+
```
100+
101+
`%p` is the path to the WAL file, `%f` is just the filename. The archive command must return 0 on success and non-zero on failure.
102+
103+
### Taking a Base Backup
104+
105+
```bash
106+
# Simple base backup with pg_basebackup
107+
pg_basebackup \
108+
--host=localhost \
109+
--username=replicator \
110+
--pgdata=/var/lib/postgresql/base_backup \
111+
--format=tar \
112+
--gzip \
113+
--compress=9 \
114+
--wal-method=stream \
115+
--progress
116+
117+
# Or to a custom location:
118+
pg_basebackup -h localhost -U replicator -D /backup/base -Ft -z -Xs -P
119+
```
120+
121+
`--wal-method=stream` streams WAL during the backup, ensuring the backup is immediately usable without waiting for WAL segments to be archived.
122+
123+
### Point-in-Time Recovery
124+
125+
To recover to a specific point in time:
126+
127+
1. Restore the base backup to the data directory
128+
2. Configure recovery in `postgresql.conf`:
129+
130+
```ini
131+
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
132+
# Or from S3: 'aws s3 cp s3://my-bucket/wal/%f %p'
133+
134+
recovery_target_time = '2024-06-15 14:30:00+00'
135+
recovery_target_action = 'promote' # Promote to primary after reaching target
136+
```
137+
138+
3. Create `recovery.signal` in the data directory
139+
4. Start Postgres
140+
141+
Postgres will restore WAL segments up to the target time, then promote the instance to read-write mode.
142+
143+
**Recovery targets:**
144+
- `recovery_target_time`: Recover to a specific timestamp
145+
- `recovery_target_lsn`: Recover to a specific WAL position
146+
- `recovery_target_xid`: Recover to after a specific transaction
147+
- `recovery_target_name`: Recover to a named restore point (created with `pg_create_restore_point()`)
148+
149+
PITR is invaluable for "logical corruption" disasters — a bad deployment drops the wrong column, a bug deletes the wrong rows. You can recover the database to just before the accident.
150+
151+
## pgBackRest: The Production Standard
152+
153+
pgBackRest is an enterprise-grade backup tool for Postgres, widely considered the most complete solution available. It handles:
154+
155+
- Full, differential, and incremental backups
156+
- WAL archiving and PITR
157+
- Parallel backup and restore (dramatically faster than `pg_basebackup`)
158+
- Backup catalog management
159+
- Encryption and compression
160+
- Remote repositories (S3, GCS, Azure, SFTP)
161+
- Standby backup (don't stress the primary)
162+
163+
### pgBackRest Configuration
164+
165+
```ini
166+
# /etc/pgbackrest/pgbackrest.conf
167+
168+
[global]
169+
repo1-path=/var/lib/pgbackrest
170+
repo1-retention-full=2
171+
repo1-retention-diff=7
172+
repo1-cipher-type=aes-256-cbc
173+
repo1-cipher-pass=very_secure_passphrase
174+
175+
# Or S3:
176+
# repo1-type=s3
177+
# repo1-s3-bucket=my-postgres-backups
178+
# repo1-s3-region=us-east-1
179+
# repo1-s3-endpoint=s3.amazonaws.com
180+
181+
[main]
182+
pg1-path=/var/lib/postgresql/data
183+
pg1-user=postgres
184+
```
185+
186+
PostgreSQL configuration (for archiving):
187+
188+
```ini
189+
archive_mode = on
190+
archive_command = 'pgbackrest --stanza=main archive-push %p'
191+
```
192+
193+
### pgBackRest Operations
194+
195+
```bash
196+
# Create the stanza (repository initialization)
197+
pgbackrest --stanza=main stanza-create
198+
199+
# Take a full backup
200+
pgbackrest --stanza=main backup --type=full
201+
202+
# Take a differential backup (since last full)
203+
pgbackrest --stanza=main backup --type=diff
204+
205+
# Take an incremental backup (since last backup)
206+
pgbackrest --stanza=main backup --type=incr
207+
208+
# List backups
209+
pgbackrest --stanza=main info
210+
211+
# Restore to latest
212+
pgbackrest --stanza=main restore
213+
214+
# PITR to a specific time
215+
pgbackrest --stanza=main restore --recovery-option="recovery_target_time=2024-06-15 14:30:00"
216+
217+
# Restore to a new path
218+
pgbackrest --stanza=main restore --pg1-path=/var/lib/postgresql/restored
219+
```
220+
221+
### Backup Schedule
222+
223+
A typical schedule:
224+
- Full backup: Weekly (Sunday at 2am)
225+
- Differential backup: Daily (every other day at 2am)
226+
- WAL archiving: Continuous
227+
228+
This gives you the ability to restore to any point in time within the retention window, with a restore time of (base backup restore time) + (WAL replay time since backup). With incremental backups, the WAL replay window is small.
229+
230+
## Barman: Another Production Option
231+
232+
Barman (Backup and Recovery Manager) is another mature backup tool, popular in enterprises. It focuses on remote backup management — Barman runs on a dedicated backup server and fetches backups from Postgres servers.
233+
234+
For teams that want Barman's centralized management model, it's an excellent choice. pgBackRest and Barman have roughly equivalent capabilities; the choice is often based on operational preference.
235+
236+
## Testing Your Backups
237+
238+
Here is a hard rule: **if you haven't tested a restore, you don't have a backup.**
239+
240+
Databases fail in ways that expose backup configuration bugs you didn't know existed. Backup processes fail silently. WAL archiving gets misconfigured. Encryption keys get lost. The restore command in your documentation is wrong.
241+
242+
A practical test regime:
243+
244+
**Monthly:** Restore the latest backup to a test environment. Verify the database starts, data is present, and basic queries work.
245+
246+
**After major configuration changes:** Any time you change the backup configuration, test a restore before assuming it works.
247+
248+
**After Postgres upgrades:** Major version upgrades may require updating backup tool versions. Test the restore process after any upgrade.
249+
250+
**Restore test script:**
251+
252+
```bash
253+
#!/bin/bash
254+
set -e
255+
256+
# Restore to a test instance
257+
pgbackrest --stanza=main restore \
258+
--pg1-path=/var/lib/postgresql/test_restore \
259+
--target-exclusive \
260+
--target=latest
261+
262+
# Start the test instance on a different port
263+
postgres -D /var/lib/postgresql/test_restore -p 5433 &
264+
265+
# Wait for startup
266+
until pg_isready -p 5433; do sleep 1; done
267+
268+
# Verify critical data
269+
psql -p 5433 -d mydb -c "SELECT COUNT(*) FROM orders;" | grep -q "[0-9]"
270+
psql -p 5433 -d mydb -c "SELECT max(created_at) FROM orders;"
271+
272+
# Verify WAL applied (check that recent transactions are present)
273+
psql -p 5433 -d mydb -c "SELECT id FROM orders ORDER BY id DESC LIMIT 1;"
274+
275+
echo "Restore test passed"
276+
277+
# Cleanup
278+
pg_ctl stop -D /var/lib/postgresql/test_restore
279+
rm -rf /var/lib/postgresql/test_restore
280+
```
281+
282+
Run this in CI or as a scheduled job. The 30 minutes this takes each month is cheap compared to discovering your backup is broken during an actual disaster.
283+
284+
## The 3-2-1 Rule
285+
286+
Apply the 3-2-1 backup rule to Postgres:
287+
- **3** copies of the data
288+
- **2** different storage media
289+
- **1** offsite copy
290+
291+
In practice:
292+
- **Copy 1:** The live database
293+
- **Copy 2:** A hot standby (streaming replication)
294+
- **Copy 3:** WAL archives and base backups in cloud object storage (S3, GCS)
295+
296+
The standby gives you fast failover (RTO in minutes). The cloud backup gives you PITR and protection against logical corruption and catastrophic datacenter failures.
297+
298+
## Monitoring Backup Health
299+
300+
Critical metrics to alert on:
301+
302+
```sql
303+
-- Is WAL archiving working?
304+
-- (Check pg_stat_archiver)
305+
SELECT archived_count, last_archived_wal, last_archived_time,
306+
failed_count, last_failed_wal, last_failed_time
307+
FROM pg_stat_archiver;
308+
```
309+
310+
Alert if:
311+
- `last_archived_time` is more than 5 minutes old
312+
- `failed_count` increased since the last check
313+
- Base backup hasn't run in the configured interval
314+
- Backup storage usage is growing unexpectedly
315+
316+
pgBackRest has built-in `check` and `verify` commands:
317+
```bash
318+
pgbackrest --stanza=main check # Verify archiving is working
319+
pgbackrest --stanza=main verify # Verify backup files are intact
320+
```
321+
322+
Backup and recovery is the insurance policy for everything in this book. You can optimize performance, harden security, and design elegant schemas — but without working backups, a single hardware failure or operator error can erase it all. Invest in your backup strategy proportionally to the value of your data. For almost any production system, that means WAL archiving, regular base backups to offsite storage, and tested restores.

0 commit comments

Comments
 (0)