Disaster Recovery & Multi-Region

Multi-Region Application Patterns

18 min Lesson 5 of 27

Multi-Region Application Patterns

Running an application in multiple regions is not a deployment detail — it is an architectural decision that propagates into every layer of your stack: routing logic, database schema, cache coherency, session handling, and the SLAs you can honestly promise to customers. This lesson covers the three dimensions engineers must get right before a second region ever starts taking traffic: how requests reach the right region, where data lives and why that matters, and how the system behaves when a region is partially or completely unavailable.

Traffic Routing Across Regions

At the DNS layer, three routing strategies dominate production deployments:

Latency-based routing — AWS Route 53, Google Cloud DNS, and Cloudflare all measure RTT from the resolver to each regional endpoint and route the query to the closest healthy region. This is the default choice for user-facing workloads: a user in Singapore hits ap-southeast-1, not us-east-1. The trap is resolver-location vs. user-location mismatch — a corporate DNS resolver in London serving a mobile user in Dubai will route wrong.
Geolocation routing — routes based on the geographic origin of the DNS query, not measured RTT. Useful for data-residency compliance (EU data must stay in EU) and for mapping specific ISPs or countries to dedicated capacity.
Weighted routing — splits traffic by percentage. Used during a regional canary rollout (10% to the new region) and during a failover drain (ramp us-east-1 weight from 100 to 0 over 5 minutes rather than a hard cut).

DNS TTL is a hard constraint. At a 60-second TTL, a failing region can receive traffic for up to 60 seconds after you update the record. Route 53 health checks with a 10-second interval and a 3-failure threshold add roughly 30 seconds of detection time on top of that. Design your RTO around this: the absolute floor for DNS-based failover is roughly 90 seconds. If you need sub-30-second failover, you need an anycast layer (Cloudflare, AWS Global Accelerator, GCP Premium Tier) that re-routes at the network level, bypassing DNS caching entirely.

Anycast vs. DNS routing: AWS Global Accelerator assigns static anycast IP addresses that route via the AWS backbone to the nearest healthy regional endpoint. Unlike DNS, this re-routing happens in milliseconds — no TTL to wait out, no resolver caches to expire. The trade-off is cost (~$0.025/GB on top of data transfer) and the fact that you are routing all TCP traffic through AWS infrastructure, which complicates mTLS termination and source-IP logging.

# Route 53: latency-based routing with health check failover
# Terraform — two A records, one per region, latency policy

resource "aws_route53_health_check" "us_east" {
  fqdn              = "us-east-1.api.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/healthz"
  failure_threshold = 3
  request_interval  = 10
}

resource "aws_route53_record" "api_us_east" {
  zone_id         = var.hosted_zone_id
  name            = "api.example.com"
  type            = "A"
  set_identifier  = "us-east-1"
  health_check_id = aws_route53_health_check.us_east.id

  latency_routing_policy {
    region = "us-east-1"
  }

  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}

# Identical block for eu-west-1 (set_identifier = "eu-west-1")

Below the DNS layer, a global load balancer or service mesh controls per-request routing. Envoy and Istio support locality-aware load balancing — a pod in us-east-1a prefers endpoints in the same AZ, then the same region, and only spills to another region when local capacity is exhausted. This matters enormously for latency and cost: cross-region data transfer is both slower and billed at egress rates.

Data Locality

Routing is the easy part. Data is where multi-region gets hard. Every read/write must answer: which copy of the data am I operating on, and is it consistent with the others?

The dominant production patterns are:

Primary-region writes, replica reads — all writes go to one region (the primary), which replicates asynchronously to secondaries. Reads can be served locally from the nearest replica. Simple to reason about; the trade-off is that replica lag (typically 10–200ms for well-tuned Postgres streaming replication, but can spike to seconds under heavy write load) means readers may see slightly stale data. Aurora Global Database runs this with a typical replication lag of under 1 second and allows a secondary to be promoted to primary in under 60 seconds.
Active-active with conflict resolution — each region accepts writes to a shared dataset. Requires either a globally-consistent distributed database (CockroachDB, Spanner, DynamoDB Global Tables) or careful partitioning that ensures no two regions ever write to the same record. DynamoDB Global Tables use last-writer-wins with a vector clock; Spanner uses external-consistency TrueTime. Both impose latency overhead on writes proportional to the inter-region round-trip (50–150ms for adjacent regions).
Data locality by user partition — partition users by geography. European users own their data in the EU region; North American users own theirs in us-east-1. Each region is effectively a primary for its own shard. No cross-region reads required for steady-state operations; failover still requires cross-region access if a region is down. This is the pattern used by Stripe, Shopify, and most companies with EU GDPR obligations.

The cache coherency trap: Teams often run Redis with a primary in one region and read replicas in others, then discover that a stale cache in eu-west-1 is serving a user data that was deleted or updated in us-east-1. For user-scoped data (sessions, entitlements, balances), stale reads are a correctness bug, not just a performance issue. Either accept eventual consistency explicitly in your product design, or route writes and reads for a given user to a single region (the data-locality-by-user pattern).

Active-Active Architecture

Active-active means all regions serve live write traffic simultaneously with no single primary. It is the gold standard for availability but the most complex pattern to operate correctly.

Active-active across three regions: each region serves live writes; Kafka replicates events bidirectionally; DNS health checks automate failover when a region goes dark.

Key design rules for active-active:

Partition writes by a stable key. The safest active-active is one where the same logical record is never written from two regions simultaneously. Use consistent hashing on user_id or tenant_id to pin a user to a home region. That region owns their writes; the other regions are read replicas for that user. This eliminates the conflict problem entirely.
Make operations idempotent. Events replicated cross-region may arrive out of order or be delivered more than once. Every write operation should carry a version vector or a deterministic UUID so a duplicate application has no effect.
Track replication lag as a first-class SLI. Wire a Prometheus metric for replication lag on your database and message queue. Alert at 5 seconds, page at 30 seconds. During a brown-out in one region, lag spikes before availability drops — it is your earliest warning.

Static Stability

Static stability is the principle that a region must be able to operate independently — serving all its traffic, scaling its compute, and recovering from failures — without making any cross-region API calls. This sounds obvious but breaks down constantly in practice.

Common static stability violations:

An auth service that validates JWTs by calling a token-introspection endpoint in the primary region. When us-east-1 is degraded, logins in eu-west-1 fail.
Kubernetes clusters that pull container images from an ECR registry in a single region. A region failure takes down deployments everywhere.
Feature flags fetched from a central control plane on every request. If the control plane is unavailable, the fallback is undefined and may default to "all features off."
Secrets retrieved from a single-region Vault or AWS Secrets Manager endpoint at pod startup. Pods fail to start in unaffected regions during a control-plane event.

# Static stability: cache feature flags locally in each region
# LaunchDarkly relay proxy (or Flagsmith self-hosted) per region
# Each app reads from the local relay; relay streams from central

# Kubernetes deployment snippet (LaunchDarkly relay)
env:
  - name: LD_RELAY_CONFIG
    value: /etc/ld/relay.toml
# relay.toml
[relay]
  heartbeatIntervalMs = 10000
  # Falls back to last known flag state if upstream unreachable
  offlineCachePath = /var/cache/ld-relay

---
# Static stability: multi-region ECR pull-through cache
# Each region has its own ECR registry; images are pushed to all on build
# ArgoCD ImageUpdater notifies regional clusters independently
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-image-updater-config
data:
  registries.conf: |
    registries:
      - name: ECR us-east-1
        prefix: 123456789.dkr.ecr.us-east-1.amazonaws.com
        api_url: https://123456789.dkr.ecr.us-east-1.amazonaws.com
        credentials: ext:/scripts/ecr-login.sh
        default: true
      - name: ECR eu-west-1
        prefix: 123456789.dkr.ecr.eu-west-1.amazonaws.com
        api_url: https://123456789.dkr.ecr.eu-west-1.amazonaws.com
        credentials: ext:/scripts/ecr-login.sh

The litmus test for static stability is a chaos game day where you fully block all cross-region network traffic (use a security group deny-all rule or a network chaos experiment) and verify that each region continues to serve 100% of its local traffic for at least 30 minutes. Teams that have never run this test reliably discover a cross-region dependency they did not know existed.

Design for the "split-brain" steady state. Multi-region systems spend more time in partial-degradation mode (one region lagging, one control plane unreachable) than in clean full-outage mode. Your runbooks, alerts, and application code must handle the middle cases: a region that is reachable but returning elevated errors, a database replica that is 45 seconds behind, or a configuration store that is returning stale data. These are harder to detect and harder to recover from than a clean region failure.

Choosing the Right Pattern

Not every workload needs active-active. The cost — in engineering complexity, database licensing, cross-region egress fees, and operational burden — is substantial. Match the pattern to the actual availability and latency requirement:

Active-passive with warm standby: RTO 5–15 minutes, 30–50% of active cost. Sufficient for most B2B SaaS with 99.9% SLAs.
Active-active, partitioned by user: RTO under 90 seconds (DNS TTL), no write conflicts. Correct choice for consumer products with global users and 99.95%+ SLAs.
Active-active, fully distributed (Spanner/CockroachDB): RTO under 30 seconds, true multi-master. Required for financial ledgers, inventory systems, or any workload where a dirty read under failover is unacceptable. Cost is 3–10x a single-region deployment.

The pattern you choose today determines the migration path you face if your requirements change. Starting with a well-isolated active-passive design — clean service boundaries, idempotent writes, replication in place — leaves the door open for a later migration to active-active without a full rewrite.