Disaster Recovery & Multi-Region

The Cost of Resilience

18 min Lesson 9 of 27

The Cost of Resilience

Every DR tier you add costs money — sometimes a lot of money. A synchronous multi-region active-active deployment that cuts RTO to under a minute can easily cost 3–4× the bill of a single-region setup. That is fine if the service is your payment processor. It is almost certainly wrong for your internal analytics dashboard. The discipline of DR cost engineering is matching resilience spend to business value — no more, no less.

The Real Economics: What You Are Actually Buying

DR infrastructure has three independent cost axes:

Compute standby cost — idle or low-utilisation replicas sitting warm in a secondary region. A warm standby at 25 % of primary capacity for a 200-node cluster means 50 additional nodes paying full reserved-instance rates, 24 × 7.
Data replication cost — cross-region data transfer is billed per GB on every major cloud. AWS charges $0.02/GB for inter-region transfer. A database pushing 500 GB/day of WAL replication adds ~$300/month before storage. At 5 TB/day (large SaaS), that is $3,000/month on transfer alone.
Operational overhead — runbooks, game days, chaos engineering, on-call rotations that cover two regions, and the engineering time to keep configs in sync. Google SRE estimates that operating a multi-region service costs 1.5–2× the SRE headcount of a single-region service.

These three axes compound. Before committing to a tier, price each one explicitly, then divide by your estimated annual downtime cost to get a cost-of-resilience ratio.

Annual downtime cost formula: Downtime cost = (Revenue/hour) × (MTTR − RTO target) × (expected incidents/year). If you earn $50,000/hour and expect two incidents/year with a current MTTR of 4 hours, moving RTO from 4 h to 15 min saves $375,000/year. If the warm-standby upgrade costs $200,000/year, the ROI is positive. If it costs $600,000, it is not.

Service Criticality Tiering

The standard industry framework assigns every service one of four tiers. Tier assignments drive DR requirements, not the other way around.

Service criticality tiers with RTO/RPO targets and matching DR strategies.

The single most common DR cost mistake at large organisations is treating every service as Tier 0. When everything is "critical", nothing is. A realistic Fortune-500 service portfolio has roughly 5 % Tier 0, 20 % Tier 1, 50 % Tier 2, and 25 % Tier 3 services. Applying Tier 0 treatment across the board inflates the DR budget by 4–6×.

Tiering Services in Practice

Assign tiers using a structured scoring matrix, not tribal knowledge. Score each service on four dimensions (1–5 scale):

Revenue impact — does the service directly process money or block transactions?
Customer-facing SLA — is there a contractual or regulatory uptime commitment?
Dependency fan-out — how many other services call this one?
Data loss sensitivity — is data loss recoverable or catastrophic?

Score of 17–20 → Tier 0. Score 12–16 → Tier 1. Score 7–11 → Tier 2. Score 4–6 → Tier 3. Encode this in a YAML service catalogue that your infrastructure-as-code reads to provision the correct DR tier automatically.

Automate tier enforcement: store the tier in a service catalogue (a Backstage YAML or a custom ConfigMap) and write a Terraform module that consumes it. Tier 0 services automatically get multi-region Route 53 health checks, global Aurora clusters, and synchronous replication. Tier 3 services get a nightly snapshot and nothing else. Engineering decisions should not require a meeting every time a new service is deployed.

Cost-Optimisation Techniques Per Tier

Tier 0 cost controls: Use reserved instances or committed-use discounts for standby compute (you know you need it 24 × 7 — pay 1-year or 3-year rates). Use synchronous replication only for the write path; reads can be regional. Global load balancers (AWS Global Accelerator, GCP Premium Tier) add ~$0.01/GB but save latency SLA violations worth far more.

Tier 1 cost controls: Size warm standbys at 25–50 % of primary, not 100 %. Most failovers happen when the primary is under load — scale-out during failover via ASG/HPA rather than pre-provisioning full parity. Use Aurora Global Database instead of self-managed replication: it is cheaper to operate and the cross-region replication lag is typically under 1 second.

Tier 2 cost controls: Pilot light means only the database replicates continuously; application compute is off (spot instances or stopped). When a failover is declared, Terraform apply with a targeted plan spins up compute in minutes. The database is already warm. This pattern cuts standby compute cost by 80–90 % compared to a warm standby.

Tier 3 cost controls: Point-in-time recovery (PITR) on managed databases costs a fraction of a cent per GB-month. Combined with a weekly AMI snapshot and S3 Cross-Region Replication (CRR) at $0.015/GB, a 1 TB Tier 3 service costs roughly $25/month for its entire DR infrastructure. That is the correct answer — not a hot standby.

# Terraform: tier-driven DR module dispatch
# services.yaml excerpt
services:
  payment-service:
    tier: 0
    regions: [us-east-1, eu-west-1, ap-southeast-1]
  order-service:
    tier: 1
    regions: [us-east-1, eu-west-1]
  reporting-service:
    tier: 3
    regions: [us-east-1]

# main.tf — read the catalogue, dispatch to the right module
locals {
  svc = yamldecode(file("services.yaml")).services
}

module "payment_dr" {
  source  = "./modules/dr-tier0"
  count   = local.svc["payment-service"].tier == 0 ? 1 : 0
  regions = local.svc["payment-service"].regions
}

module "order_dr" {
  source  = "./modules/dr-tier1"
  count   = local.svc["order-service"].tier == 1 ? 1 : 0
  regions = local.svc["order-service"].regions
}

module "reporting_dr" {
  source = "./modules/dr-tier3"
  count  = local.svc["reporting-service"].tier == 3 ? 1 : 0
  region = local.svc["reporting-service"].regions[0]
}

The Hidden Costs Nobody Budgets

Egress amplification: Multi-region active-active doubles outbound traffic for any globally-replicated write. If your primary region handles 10 Gbps of writes, you will pay for another ~10 Gbps of cross-region replication. At AWS inter-region rates, that is roughly $175,000/year — a line item that disappears in early architecture conversations and reappears in the first cloud bill.

Test infrastructure: A DR system that is never tested is not a DR system. Each game day costs roughly 1–2 engineer-days per Tier 0 service in preparation and execution. For a team with 20 Tier 0 services running quarterly game days, that is 160–320 engineer-days per year. Budget for it explicitly or it will not happen.

Observability duplication: Every secondary region needs its own monitoring stack (Prometheus, Grafana, alerting pipelines). These are not optional; you cannot failover to a region you cannot observe. A lean secondary observability stack costs 15–25 % of the primary stack.

Avoid the "disaster tax" trap: some organisations add DR infrastructure reactively after every incident, without updating their tier assignments. Over three years this produces a baroque mess where Tier 3 services have Tier 1 infrastructure (wasting money) while some actual Tier 1 services are still on Tier 3 budgets (dangerous). Schedule an annual tier-review with finance, product, and engineering. Demote or promote based on current revenue contribution, not historical incident trauma.

Building a DR Cost Model

Before any DR architecture discussion, build a simple cost model. The following script pulls the current month's regional spend, annotates each service with its tier from the catalogue, and computes the DR spend ratio (DR infrastructure cost divided by total service cost).

#!/usr/bin/env python3
# dr_cost_model.py — annotate AWS Cost Explorer output with tier data
import boto3, yaml, json
from datetime import date, timedelta

ce = boto3.client("ce", region_name="us-east-1")

with open("services.yaml") as f:
    catalogue = yaml.safe_load(f)["services"]

start = (date.today().replace(day=1)).isoformat()
end   = date.today().isoformat()

resp = ce.get_cost_and_usage(
    TimePeriod={"Start": start, "End": end},
    Granularity="MONTHLY",
    Metrics=["UnblendedCost"],
    GroupBy=[{"Type": "TAG", "Key": "Service"}],
)

print(f"{'Service':<30} {'Tier':<6} {'Cost (USD)':<14} {'DR Ratio'}")
print("-" * 65)

for group in resp["ResultsByTime"][0]["Groups"]:
    svc_name = group["Keys"][0].replace("Service$", "")
    cost     = float(group["Metrics"]["UnblendedCost"]["Amount"])
    tier     = catalogue.get(svc_name, {}).get("tier", "?")

    # Approximate DR overhead by tier
    dr_overhead = {0: 0.60, 1: 0.35, 2: 0.10, 3: 0.02}.get(tier, 0)
    dr_cost = cost * dr_overhead
    print(f"{svc_name:<30} {str(tier):<6} ${cost:<13.2f} {dr_overhead*100:.0f}% (~${dr_cost:.0f})")

Run this monthly in your cost-management pipeline. Flag any service where the DR overhead exceeds 50 % of total service cost — that is a signal the tier assignment is wrong or the architecture is inefficient for its actual risk profile.

Resilience is not free, and it should not be. The goal is not to minimise DR spend — it is to ensure every dollar of DR spend is justified by a proportional reduction in business risk. A payment service with $50,000/hour revenue exposure has no business running on a Tier 2 pilot-light setup. A nightly batch report with zero customer SLA has no business running on a cross-region active-active cluster. Get the tier right and the cost naturally follows.