Backups That Restore
Backups That Restore
The uncomfortable truth about database backups: most teams have them, almost no team has verified them under pressure. A backup that has never been restored is not a backup — it is a hope. This lesson teaches you the two fundamental backup strategies, how Point-In-Time Recovery (PITR) works at the byte level, and why a rigorous restore-testing discipline is what separates teams that survive failures from teams that lose data.
Logical vs Physical Backups
Every backup mechanism falls into one of two categories. Choosing the wrong one for your situation is a production mistake you discover at the worst possible time.
Logical backups export the database as SQL statements — CREATE TABLE, INSERT INTO, and so on. pg_dump for PostgreSQL and mysqldump for MySQL are the canonical tools. They are portable (restore to a different OS or minor version), human-readable for auditing, and selective (dump a single table or schema). The trade-off: they are slow to restore at scale because the database must re-execute every SQL statement, rebuild every index, and enforce every constraint. A 500 GB logical dump can take 4–6 hours to restore. That is fine for a nightly dev-refresh workflow; it is catastrophic during a production incident where your RTO is 30 minutes.
Physical backups copy the raw data files — the pages, blocks, and WAL segments that make up the database on disk. Tools: pg_basebackup for PostgreSQL, Percona XtraBackup for MySQL/InnoDB, and filesystem-level snapshots (EBS snapshot, ZFS send, LVM snapshot). Physical backups are fast to restore because you are copying bytes, not re-executing SQL. A 500 GB physical backup might restore in 20–40 minutes with a fast network. The trade-off: they are not portable across major versions and cannot selectively restore a single table without extracting it into a logical dump first.
In practice, a mature backup strategy uses both: physical backups for the base (fast restore, small RTO), logical backups for selective object recovery and cross-version portability.
Point-In-Time Recovery (PITR)
A base backup captures a consistent snapshot at one moment. But what if data was corrupted or deleted three hours after that snapshot? This is where PITR becomes essential. PITR lets you restore the database to any arbitrary point in time — not just "as of the last backup" but "as of 14:27:43 on Tuesday, before the bad DELETE ran."
PITR works by combining a base backup with a continuous stream of Write-Ahead Log (WAL) segments. Every write to the database — insert, update, delete, DDL — is first recorded as a WAL record before it touches the data files. If you archive these WAL segments continuously, you can replay them on top of a base backup to reconstruct the database at any point between the base backup time and now.
To enable PITR in PostgreSQL, you must configure WAL archiving before you need it. Enabling it after a disaster is too late:
Backup Storage: The 3-2-1 Rule Applied
The 3-2-1 rule: keep 3 copies, on 2 different media types, with 1 copy offsite. For databases this translates concretely: nightly full backups retained for 30 days, WAL archives retained for 7 days minimum (longer for compliance), replicated to a different AWS region or cloud provider. At Amazon and Google, production databases are backed up to at least two geographically separate locations. A single-region backup strategy fails the same disaster that took out your primary.
rm -rf that hits your primary will hit your co-located backup simultaneously. Always ship backups to a separate storage system — cloud object storage (S3, GCS) with versioning and MFA-delete enabled is the minimum bar.
Testing Restores: The Practice That Saves You
Restore testing is not a one-time event. It is a recurring operational practice. The failure modes that bite teams at the worst time are almost always discovered during a real incident — not during a fire drill — because they never scheduled one. There are three levels of restore testing you should run:
- Weekly automated restore validation — restore the most recent backup to an isolated throwaway instance, run a checksum or row-count comparison against a known baseline, tear it down. This takes 20 minutes and runs unattended via a CI job or cron.
- Monthly full PITR drill — restore to a specific timestamp, verify application correctness (not just DB connectivity), confirm the restored data matches what you expected at that point in time. This catches WAL archive gaps that an automated count check misses.
- Quarterly RTO measurement — simulate a complete primary failure, restore from cold, and time every step end-to-end. Compare to your documented RTO. If you exceed it, you have found a process gap before it costs you a SLA breach.
pgbackrest info output in your observability stack and page on-call if the most recent successful backup is older than 25 hours.
Key Metrics to Track
Backup and recovery are operational disciplines, which means you need numbers. Track these in your monitoring system:
- RPO (Recovery Point Objective): the maximum data loss acceptable. If your RPO is 1 hour, your WAL archive must be no more than 1 hour behind, continuously.
- RTO (Recovery Time Objective): how long a full restore may take. Measure it in drills; never rely on an estimate.
- Backup age: time since the last successful backup completed. Alert at 25 hours for daily backups.
- WAL archive lag: how far behind the archive is from the primary. Alert at 15 minutes for sub-hour RPO targets.
- Restore test pass rate: percentage of weekly automated tests that pass. Anything below 100% is a blocked incident waiting to happen.
A backup strategy that cannot demonstrate these numbers in a runbook is aspirational, not operational. The goal of this entire lesson distills to one sentence: know your RPO and RTO, automate the backup, automate the restore test, and measure both continuously.