Disaster Recovery & Multi-Region

Backup Architecture

18 min Lesson 3 of 27

Backup Architecture

The previous lesson established RTO and RPO as contracts. Backup architecture is the engineering system that fulfills those contracts. Most engineers have run pg_dump; very few have designed a backup system that survives a simultaneous primary region failure, a ransomware attack that encrypts your backup bucket, and a rogue admin with DELETE privileges — on the same night. That is the threat model for a production backup architecture. This lesson covers what you actually back up, why immutability is a hard requirement, how the 3-2-1 rule scales to cloud-native environments, and how cross-region and cross-account replication closes the last gap.

Backup is not the same as DR. A backup is a point-in-time copy. Disaster recovery is the process and infrastructure required to restore service using that copy within your RTO. A great backup with no tested restore procedure is not a DR strategy — it is hope. Everything in this lesson feeds a tested restore procedure, not just a green backup job.

What to Back Up (and What to Skip)

Senior engineers spend as much energy deciding what not to back up as what to include. Backing up everything wastes money and slows restores; backing up too little leaves gaps. The correct mental model is to ask: "If I lose this artefact and cannot reconstruct it from other artefacts, what is the business impact and recovery time?" Classify every data class along those two axes.

Databases (highest priority). Transactional databases — PostgreSQL, MySQL, Aurora — hold state that cannot be regenerated. Back up with both logical exports (pg_dump, mysqldump) and point-in-time recovery (PITR)-capable snapshots. Logical exports survive schema migrations; PITR closes the gap between scheduled snapshots by replaying WAL/binlog forward.
Object storage and file stores. User-uploaded content (S3, GCS) is typically irreplaceable. Enable versioning and cross-region replication at the bucket level. Do not confuse S3 replication with backup — replication propagates deletions. You need immutable versioned copies plus a separate "lock vault" bucket.
Secrets and configuration. Vault seal keys, KMS CMKs, and Terraform state. Loss of seal keys means your entire secret store is permanently inaccessible. Export Vault snapshots; export Terraform state to a separate locked bucket; back up etcd for self-managed Kubernetes clusters.
Application code and IaC. Git is already replicated across developer machines and your CI provider. A DR event is unlikely to destroy all git remotes simultaneously. Treat code repositories as low priority for additional backup — but do back up your CI/CD config (pipeline definitions, runner secrets, registry images to a disaster-recovery registry in another account).
Derived / caches. Elasticsearch indices built from a database, Redis caches, CDN caches. These can be reconstructed from the source of truth. Do not back them up unless rebuild time exceeds your RTO. For a 100 TB Elasticsearch index that takes 18 hours to reindex, back it up. For a 5 GB Redis LRU cache, let it warm from the database on restart.
Logs and observability data. Logs are large, expensive to store, and rarely needed for application restoration. Back up only the last 30 days of structured logs if compliance requires it (PCI, HIPAA); otherwise rely on log retention in your SIEM/S3.

Backup Types and the Scheduling Equation

Production backup schedules are rarely simple nightly dumps. You compose three backup types to balance storage cost against restore time.

Full backup — a complete snapshot of the dataset. Restore from a single artefact. Cost: proportional to dataset size. For a 10 TB PostgreSQL database, a full dump takes 6–8 hours and produces a 2–3 TB compressed file.
Incremental backup — only the changes since the last backup (full or incremental). Cheap to produce. Restore requires replaying the full plus every incremental in order — a chain that can grow long and brittle.
Differential backup — changes since the last full only. Restore requires the full plus one differential — a two-step restore that never gets longer, at the cost of larger differentials as time passes.

The industry-standard schedule for large databases is: weekly full + daily differential + continuous WAL/binlog streaming. This gives you point-in-time recovery to any second of the week at the cost of: one weekly full plus a differential that grows across 7 days. Aurora, RDS, and Cloud SQL all implement this natively; on self-managed PostgreSQL, use pgBackRest or pg_basebackup + WAL archiving to S3.

# pgBackRest configuration for weekly full + daily diff + continuous WAL archiving.
# /etc/pgbackrest/pgbackrest.conf on the database host.

[global]
repo1-path=/var/lib/pgbackrest
repo1-type=s3
repo1-s3-bucket=myco-db-backups-primary
repo1-s3-region=us-east-1
repo1-s3-key=<IAM_ACCESS_KEY>
repo1-s3-key-secret=<IAM_SECRET_KEY>
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<STRONG_PASSPHRASE>

# Cross-region second repo (S3 in us-west-2, separate account)
repo2-path=/dr
repo2-type=s3
repo2-s3-bucket=myco-db-backups-dr
repo2-s3-region=us-west-2
repo2-s3-key=<DR_IAM_ACCESS_KEY>
repo2-s3-key-secret=<DR_IAM_SECRET_KEY>

[global:archive-push]
compress-level=3

[mydb]
pg1-path=/var/lib/postgresql/16/main

# Schedule (crontab):
# Weekly full backup on Sunday at 01:00
0 1 * * 0   pgbackrest --stanza=mydb --type=full backup
# Daily differential Mon-Sat at 01:00
0 1 * * 1-6 pgbackrest --stanza=mydb --type=diff backup
# WAL archiving is always-on via postgresql.conf archive_command

Immutability: The Non-Negotiable Requirement

Ransomware attacks against backup infrastructure are now the primary vector for maximizing ransom leverage. If the attacker can encrypt or delete your backups, the business has no choice but to pay. Immutability removes that leverage entirely. An immutable backup, once written, cannot be modified or deleted for a configured retention period — not by an application, not by an operator, and not by the cloud root account.

Two mechanisms enforce immutability at scale:

S3 Object Lock (WORM). Enabled at bucket creation time (cannot be turned on later). Two modes: Compliance mode — no one, including AWS root, can delete the object during the retention period; Governance mode — privileged IAM roles with s3:BypassGovernanceRetention can override. Use Compliance mode for your DR vault. Governance mode for your primary bucket (to allow corrections by your team).
Azure Immutable Blob Storage / GCS Object Hold. Cloud-equivalent mechanisms with the same semantics. All major clouds offer WORM storage; use it.

Immutability also defends against accidental deletion. A developer aws s3 rm --recursive against the wrong bucket is a more common failure mode than ransomware. Both are prevented by the same lock.

# Enable S3 Object Lock Compliance mode on a new DR vault bucket.
# Object Lock MUST be enabled at bucket creation; it cannot be added later.

aws s3api create-bucket \
  --bucket myco-db-backups-dr-vault \
  --region us-west-2 \
  --create-bucket-configuration LocationConstraint=us-west-2 \
  --object-lock-enabled-for-bucket

# Set default Compliance retention: 90 days for all new objects.
aws s3api put-object-lock-configuration \
  --bucket myco-db-backups-dr-vault \
  --object-lock-configuration '{
    "ObjectLockEnabled": "Enabled",
    "Rule": {
      "DefaultRetention": {
        "Mode": "COMPLIANCE",
        "Days": 90
      }
    }
  }'

# Deny ALL deletes and lock overrides — even from the bucket-owning account.
# Attach this as a bucket policy (SCPs add another account-level layer).
aws s3api put-bucket-policy \
  --bucket myco-db-backups-dr-vault \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "DenyDeleteAndBypass",
        "Effect": "Deny",
        "Principal": "*",
        "Action": [
          "s3:DeleteObject",
          "s3:DeleteObjectVersion",
          "s3:PutObjectRetention",
          "s3:BypassGovernanceRetention"
        ],
        "Resource": "arn:aws:s3:::myco-db-backups-dr-vault/*"
      }
    ]
  }'

The 3-2-1 Rule and Its Cloud-Native Extension

The 3-2-1 rule, formalized in the 1990s, states: keep 3 copies of data, on 2 different media types, with 1 copy off-site. It was designed for tape drives in server rooms. Its cloud-native extension is the 3-2-1-1-0 rule, which adds two critical requirements for the modern threat model.

The 3-2-1-1-0 backup architecture: Copy 1 in the primary region (Governance lock); Copy 2 cross-region WORM (Compliance lock, separate bucket); Copy 3 in an isolated AWS account with Glacier Deep Archive — unreachable from production credentials. The +0 requires regular automated restore tests to confirm all copies are recoverable.

The first "1" adds an offline or air-gapped copy — a copy in a separate AWS/GCP account whose credentials cannot be compromised via the production account. Even if an attacker obtains root in your production account, they cannot reach the isolated account. Glacier Deep Archive in a separate organization account costs roughly $0.00099/GB/month — essentially free for the insurance it provides.
The "0" adds zero unverified backups — every backup must have a regularly tested restore. An untested backup is not a backup. It is an assumption. At Google SRE, this principle is enforced via automated restore pipelines that run weekly, spin up a fresh environment, restore the backup, run a schema and row-count assertion, and report the result to an SLO dashboard.

Cross-Region and Cross-Account Replication

Replication and backup serve different purposes but must be coordinated. S3 Cross-Region Replication (CRR) copies objects to a destination bucket in another region, typically within seconds. It is the right tool for minimizing RPO on object data — user uploads in us-east-1 appear in us-west-2 within 15 seconds. But CRR propagates deletes (unless you explicitly filter DeleteMarkerReplication: Status: Disabled), which means a ransomware delete against the source propagates to the replica. The DR vault (Copy 3) must be a separate account with Object Lock Compliance and no CRR delete propagation.

The cross-account replication pattern uses an IAM role in the destination account that the source account's S3 replication service is allowed to assume. The destination account's bucket policy grants only s3:PutObject — never s3:DeleteObject. This means the source account can write but never delete from the vault.

# Terraform: cross-account S3 replication from production to DR vault.
# Production account role that S3 Replication service assumes.

resource "aws_iam_role" "s3_replication" {
  name = "s3-backup-replication-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "s3.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "replication_policy" {
  role = aws_iam_role.s3_replication.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["s3:GetObjectVersionForReplication",
                  "s3:GetObjectVersionAcl",
                  "s3:ListBucket"]
        Resource = [
          "arn:aws:s3:::myco-db-backups-primary",
          "arn:aws:s3:::myco-db-backups-primary/*"
        ]
      },
      {
        Effect = "Allow"
        Action = ["s3:ReplicateObject",
                  "s3:ReplicateDelete",
                  "s3:ReplicateTags"]
        Resource = "arn:aws:s3:::myco-db-backups-dr-vault/*"
      }
    ]
  })
}

# DR vault bucket policy (in the DR AWS account) — allows PutObject only.
# DeleteObject is never granted; Object Lock provides the final defence.
resource "aws_s3_bucket_policy" "dr_vault_policy" {
  provider = aws.dr_account
  bucket   = aws_s3_bucket.dr_vault.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid       = "AllowCrossAccountReplicate"
        Effect    = "Allow"
        Principal = { AWS = "arn:aws:iam::<PROD_ACCOUNT_ID>:role/s3-backup-replication-role" }
        Action    = ["s3:ReplicateObject", "s3:ReplicateTags"]
        Resource  = "arn:aws:s3:::myco-db-backups-dr-vault/*"
      }
    ]
  })
}

Backup Monitoring and the Restore SLO

A backup pipeline is a production system and must be monitored like one. The key metrics are: backup job success rate, backup duration (alert if +50% above baseline — often the first sign of data growth hitting I/O limits), backup size versus prior run (an anomalous 10x increase means a bug or a data explosion), and last successful restore age (alert if older than 7 days).

Model your restore objective explicitly: if your RPO is 1 hour, your restore pipeline must complete within your RTO minus the time to reproduce the failure — which typically leaves 30–60 minutes for actual data restoration. Test that your restoration process, including DNS failover and application start-up, meets that budget. The single most common DR audit finding is that backup jobs complete successfully but restores take 4× the assumed time because no one has run one under production-scale load in the past 12 months.

Automate restore verification weekly. Write a pipeline job (GitHub Actions, Jenkins, or a scheduled Lambda) that restores the latest backup to a throwaway RDS instance, runs SELECT COUNT(*) and row hash spot checks on 10 critical tables, asserts the row counts are within 0.1% of production, then drops the instance. The whole job takes 20 minutes. Failing it pages on-call. This single automation closes the biggest gap in most organizations' DR posture — untested backups discovered only during an actual disaster.

Production pitfall — backing up to the same account as production. If your production AWS account is compromised, and your backup bucket lives in the same account, an attacker can delete backups and production simultaneously. This failure mode ended several startups in 2023 during the wave of AWS credential phishing attacks. The DR vault bucket must live in a separate AWS account, ideally in a separate AWS Organization under a separate email address with a hardware MFA device that is physically stored off-site.