Databases in Production

Backups That Restore

18 min Lesson 4 of 30

Backups That Restore

The uncomfortable truth about database backups: most teams have them, almost no team has verified them under pressure. A backup that has never been restored is not a backup — it is a hope. This lesson teaches you the two fundamental backup strategies, how Point-In-Time Recovery (PITR) works at the byte level, and why a rigorous restore-testing discipline is what separates teams that survive failures from teams that lose data.

The only backup that counts is one you have successfully restored. Google SRE teams run restore drills on a regular cadence. Netflix Chaos Engineering explicitly includes database recovery scenarios. If you cannot demonstrate a restore in under your RTO, your backup strategy is incomplete.

Logical vs Physical Backups

Every backup mechanism falls into one of two categories. Choosing the wrong one for your situation is a production mistake you discover at the worst possible time.

Logical backups export the database as SQL statements — CREATE TABLE, INSERT INTO, and so on. pg_dump for PostgreSQL and mysqldump for MySQL are the canonical tools. They are portable (restore to a different OS or minor version), human-readable for auditing, and selective (dump a single table or schema). The trade-off: they are slow to restore at scale because the database must re-execute every SQL statement, rebuild every index, and enforce every constraint. A 500 GB logical dump can take 4–6 hours to restore. That is fine for a nightly dev-refresh workflow; it is catastrophic during a production incident where your RTO is 30 minutes.

# PostgreSQL logical backup — dump in custom format (compressed, parallel-restorable)
pg_dump \
  --host=db-primary.internal \
  --port=5432 \
  --username=backup_user \
  --format=custom \
  --compress=9 \
  --jobs=4 \
  --file=/backups/mydb_$(date +%Y%m%d_%H%M%S).dump \
  mydb

# Restore a custom-format dump with parallel restore workers
pg_restore \
  --host=db-restore-target.internal \
  --port=5432 \
  --username=postgres \
  --dbname=mydb \
  --jobs=8 \
  --verbose \
  /backups/mydb_20250801_020000.dump

Physical backups copy the raw data files — the pages, blocks, and WAL segments that make up the database on disk. Tools: pg_basebackup for PostgreSQL, Percona XtraBackup for MySQL/InnoDB, and filesystem-level snapshots (EBS snapshot, ZFS send, LVM snapshot). Physical backups are fast to restore because you are copying bytes, not re-executing SQL. A 500 GB physical backup might restore in 20–40 minutes with a fast network. The trade-off: they are not portable across major versions and cannot selectively restore a single table without extracting it into a logical dump first.

# PostgreSQL physical base backup — streams WAL continuously during backup
pg_basebackup \
  --host=db-primary.internal \
  --port=5432 \
  --username=replication_user \
  --pgdata=/backups/basebackup_$(date +%Y%m%d) \
  --format=tar \
  --compress=lz4 \
  --wal-method=stream \
  --checkpoint=fast \
  --progress \
  --verbose

# Result: base.tar.lz4 (data files) + pg_wal.tar.lz4 (WAL from start of backup)
# These two together form a consistent snapshot of the database.

In practice, a mature backup strategy uses both: physical backups for the base (fast restore, small RTO), logical backups for selective object recovery and cross-version portability.

Point-In-Time Recovery (PITR)

A base backup captures a consistent snapshot at one moment. But what if data was corrupted or deleted three hours after that snapshot? This is where PITR becomes essential. PITR lets you restore the database to any arbitrary point in time — not just "as of the last backup" but "as of 14:27:43 on Tuesday, before the bad DELETE ran."

PITR works by combining a base backup with a continuous stream of Write-Ahead Log (WAL) segments. Every write to the database — insert, update, delete, DDL — is first recorded as a WAL record before it touches the data files. If you archive these WAL segments continuously, you can replay them on top of a base backup to reconstruct the database at any point between the base backup time and now.

PITR: restore a base backup, then replay archived WAL segments up to the exact recovery target timestamp — stopping just before the destructive event.

To enable PITR in PostgreSQL, you must configure WAL archiving before you need it. Enabling it after a disaster is too late:

# postgresql.conf — enable WAL archiving (must be set before disaster strikes)
wal_level = replica           # or logical; must be replica+ for archiving
archive_mode = on
archive_command = 'pgbackrest --stanza=mydb archive-push %p'
# pgBackRest handles compression, deduplication, and S3/GCS upload

# recovery.conf (PostgreSQL 11 and below) OR postgresql.conf (12+)
restore_command = 'pgbackrest --stanza=mydb archive-get %f "%p"'
recovery_target_time = '2025-08-05 14:27:42'
recovery_target_action = 'promote'   # make writable once target is reached
# Start PostgreSQL in recovery mode — it will replay WAL until the target time

Use pgBackRest or Barman instead of raw archive_command scripts. They handle WAL deduplication, parallel transfer, integrity checksumming, backup rotation, and direct restore from S3/GCS. Rolling your own shell-script archiver means you are also on the hook for all of that — and you will discover the gaps during an incident.

Backup Storage: The 3-2-1 Rule Applied

The 3-2-1 rule: keep 3 copies, on 2 different media types, with 1 copy offsite. For databases this translates concretely: nightly full backups retained for 30 days, WAL archives retained for 7 days minimum (longer for compliance), replicated to a different AWS region or cloud provider. At Amazon and Google, production databases are backed up to at least two geographically separate locations. A single-region backup strategy fails the same disaster that took out your primary.

Never store backups on the same server as the database. A disk failure, ransomware, or accidental rm -rf that hits your primary will hit your co-located backup simultaneously. Always ship backups to a separate storage system — cloud object storage (S3, GCS) with versioning and MFA-delete enabled is the minimum bar.

Testing Restores: The Practice That Saves You

Restore testing is not a one-time event. It is a recurring operational practice. The failure modes that bite teams at the worst time are almost always discovered during a real incident — not during a fire drill — because they never scheduled one. There are three levels of restore testing you should run:

Weekly automated restore validation — restore the most recent backup to an isolated throwaway instance, run a checksum or row-count comparison against a known baseline, tear it down. This takes 20 minutes and runs unattended via a CI job or cron.
Monthly full PITR drill — restore to a specific timestamp, verify application correctness (not just DB connectivity), confirm the restored data matches what you expected at that point in time. This catches WAL archive gaps that an automated count check misses.
Quarterly RTO measurement — simulate a complete primary failure, restore from cold, and time every step end-to-end. Compare to your documented RTO. If you exceed it, you have found a process gap before it costs you a SLA breach.

#!/bin/bash
# restore-test.sh — weekly automated restore validation (runs in CI or cron)
set -euo pipefail

STANZA="mydb"
RESTORE_HOST="restore-sandbox.internal"
RESTORE_PORT="5433"
RESTORE_PATH="/var/lib/postgresql/15/restore_test"
LOG="/var/log/backup-restore-test.log"

echo "[$(date)] Starting restore validation" | tee -a "$LOG"

# 1. Restore latest backup to sandbox instance
pgbackrest \
  --stanza="$STANZA" \
  --pg1-path="$RESTORE_PATH" \
  --pg1-port="$RESTORE_PORT" \
  --recovery-option="recovery_target_action=promote" \
  restore | tee -a "$LOG"

# 2. Start PostgreSQL on the restored data
pg_ctl -D "$RESTORE_PATH" -o "-p $RESTORE_PORT" start

# 3. Run integrity checks
EXPECTED_USERS=142850
ACTUAL_USERS=$(psql -h localhost -p "$RESTORE_PORT" -U postgres -At \
  -c "SELECT COUNT(*) FROM users;")

if [ "$ACTUAL_USERS" -lt "$((EXPECTED_USERS - 100))" ]; then
  echo "[FAIL] User count $ACTUAL_USERS is below threshold" | tee -a "$LOG"
  # Alert via PagerDuty / Slack webhook
  curl -s -X POST "$PAGERDUTY_WEBHOOK" \
    -d "{\"summary\": \"Restore test FAILED: user count $ACTUAL_USERS\"}"
  exit 1
fi

echo "[PASS] Restore validated — $ACTUAL_USERS users. Tearing down." | tee -a "$LOG"
pg_ctl -D "$RESTORE_PATH" stop
rm -rf "$RESTORE_PATH"

Alert on backup job failures, not just restore failures. If your nightly backup silently fails for a week and then a disaster hits, you have zero valid recovery points. Monitor pgbackrest info output in your observability stack and page on-call if the most recent successful backup is older than 25 hours.

Key Metrics to Track

Backup and recovery are operational disciplines, which means you need numbers. Track these in your monitoring system:

RPO (Recovery Point Objective): the maximum data loss acceptable. If your RPO is 1 hour, your WAL archive must be no more than 1 hour behind, continuously.
RTO (Recovery Time Objective): how long a full restore may take. Measure it in drills; never rely on an estimate.
Backup age: time since the last successful backup completed. Alert at 25 hours for daily backups.
WAL archive lag: how far behind the archive is from the primary. Alert at 15 minutes for sub-hour RPO targets.
Restore test pass rate: percentage of weekly automated tests that pass. Anything below 100% is a blocked incident waiting to happen.

A backup strategy that cannot demonstrate these numbers in a runbook is aspirational, not operational. The goal of this entire lesson distills to one sentence: know your RPO and RTO, automate the backup, automate the restore test, and measure both continuously.