The Cost of Resilience
The Cost of Resilience
Every DR tier you add costs money — sometimes a lot of money. A synchronous multi-region active-active deployment that cuts RTO to under a minute can easily cost 3–4× the bill of a single-region setup. That is fine if the service is your payment processor. It is almost certainly wrong for your internal analytics dashboard. The discipline of DR cost engineering is matching resilience spend to business value — no more, no less.
The Real Economics: What You Are Actually Buying
DR infrastructure has three independent cost axes:
- Compute standby cost — idle or low-utilisation replicas sitting warm in a secondary region. A warm standby at 25 % of primary capacity for a 200-node cluster means 50 additional nodes paying full reserved-instance rates, 24 × 7.
- Data replication cost — cross-region data transfer is billed per GB on every major cloud. AWS charges $0.02/GB for inter-region transfer. A database pushing 500 GB/day of WAL replication adds ~$300/month before storage. At 5 TB/day (large SaaS), that is $3,000/month on transfer alone.
- Operational overhead — runbooks, game days, chaos engineering, on-call rotations that cover two regions, and the engineering time to keep configs in sync. Google SRE estimates that operating a multi-region service costs 1.5–2× the SRE headcount of a single-region service.
These three axes compound. Before committing to a tier, price each one explicitly, then divide by your estimated annual downtime cost to get a cost-of-resilience ratio.
Downtime cost = (Revenue/hour) × (MTTR − RTO target) × (expected incidents/year). If you earn $50,000/hour and expect two incidents/year with a current MTTR of 4 hours, moving RTO from 4 h to 15 min saves $375,000/year. If the warm-standby upgrade costs $200,000/year, the ROI is positive. If it costs $600,000, it is not.
Service Criticality Tiering
The standard industry framework assigns every service one of four tiers. Tier assignments drive DR requirements, not the other way around.
The single most common DR cost mistake at large organisations is treating every service as Tier 0. When everything is "critical", nothing is. A realistic Fortune-500 service portfolio has roughly 5 % Tier 0, 20 % Tier 1, 50 % Tier 2, and 25 % Tier 3 services. Applying Tier 0 treatment across the board inflates the DR budget by 4–6×.
Tiering Services in Practice
Assign tiers using a structured scoring matrix, not tribal knowledge. Score each service on four dimensions (1–5 scale):
- Revenue impact — does the service directly process money or block transactions?
- Customer-facing SLA — is there a contractual or regulatory uptime commitment?
- Dependency fan-out — how many other services call this one?
- Data loss sensitivity — is data loss recoverable or catastrophic?
Score of 17–20 → Tier 0. Score 12–16 → Tier 1. Score 7–11 → Tier 2. Score 4–6 → Tier 3. Encode this in a YAML service catalogue that your infrastructure-as-code reads to provision the correct DR tier automatically.
Cost-Optimisation Techniques Per Tier
Tier 0 cost controls: Use reserved instances or committed-use discounts for standby compute (you know you need it 24 × 7 — pay 1-year or 3-year rates). Use synchronous replication only for the write path; reads can be regional. Global load balancers (AWS Global Accelerator, GCP Premium Tier) add ~$0.01/GB but save latency SLA violations worth far more.
Tier 1 cost controls: Size warm standbys at 25–50 % of primary, not 100 %. Most failovers happen when the primary is under load — scale-out during failover via ASG/HPA rather than pre-provisioning full parity. Use Aurora Global Database instead of self-managed replication: it is cheaper to operate and the cross-region replication lag is typically under 1 second.
Tier 2 cost controls: Pilot light means only the database replicates continuously; application compute is off (spot instances or stopped). When a failover is declared, Terraform apply with a targeted plan spins up compute in minutes. The database is already warm. This pattern cuts standby compute cost by 80–90 % compared to a warm standby.
Tier 3 cost controls: Point-in-time recovery (PITR) on managed databases costs a fraction of a cent per GB-month. Combined with a weekly AMI snapshot and S3 Cross-Region Replication (CRR) at $0.015/GB, a 1 TB Tier 3 service costs roughly $25/month for its entire DR infrastructure. That is the correct answer — not a hot standby.
The Hidden Costs Nobody Budgets
Egress amplification: Multi-region active-active doubles outbound traffic for any globally-replicated write. If your primary region handles 10 Gbps of writes, you will pay for another ~10 Gbps of cross-region replication. At AWS inter-region rates, that is roughly $175,000/year — a line item that disappears in early architecture conversations and reappears in the first cloud bill.
Test infrastructure: A DR system that is never tested is not a DR system. Each game day costs roughly 1–2 engineer-days per Tier 0 service in preparation and execution. For a team with 20 Tier 0 services running quarterly game days, that is 160–320 engineer-days per year. Budget for it explicitly or it will not happen.
Observability duplication: Every secondary region needs its own monitoring stack (Prometheus, Grafana, alerting pipelines). These are not optional; you cannot failover to a region you cannot observe. A lean secondary observability stack costs 15–25 % of the primary stack.
Building a DR Cost Model
Before any DR architecture discussion, build a simple cost model. The following script pulls the current month's regional spend, annotates each service with its tier from the catalogue, and computes the DR spend ratio (DR infrastructure cost divided by total service cost).
Run this monthly in your cost-management pipeline. Flag any service where the DR overhead exceeds 50 % of total service cost — that is a signal the tier assignment is wrong or the architecture is inefficient for its actual risk profile.
Resilience is not free, and it should not be. The goal is not to minimise DR spend — it is to ensure every dollar of DR spend is justified by a proportional reduction in business risk. A payment service with $50,000/hour revenue exposure has no business running on a Tier 2 pilot-light setup. A nightly batch report with zero customer SLA has no business running on a cross-region active-active cluster. Get the tier right and the cost naturally follows.