Multi-Region Application Patterns
Multi-Region Application Patterns
Running an application in multiple regions is not a deployment detail — it is an architectural decision that propagates into every layer of your stack: routing logic, database schema, cache coherency, session handling, and the SLAs you can honestly promise to customers. This lesson covers the three dimensions engineers must get right before a second region ever starts taking traffic: how requests reach the right region, where data lives and why that matters, and how the system behaves when a region is partially or completely unavailable.
Traffic Routing Across Regions
At the DNS layer, three routing strategies dominate production deployments:
- Latency-based routing — AWS Route 53, Google Cloud DNS, and Cloudflare all measure RTT from the resolver to each regional endpoint and route the query to the closest healthy region. This is the default choice for user-facing workloads: a user in Singapore hits
ap-southeast-1, notus-east-1. The trap is resolver-location vs. user-location mismatch — a corporate DNS resolver in London serving a mobile user in Dubai will route wrong. - Geolocation routing — routes based on the geographic origin of the DNS query, not measured RTT. Useful for data-residency compliance (EU data must stay in EU) and for mapping specific ISPs or countries to dedicated capacity.
- Weighted routing — splits traffic by percentage. Used during a regional canary rollout (10% to the new region) and during a failover drain (ramp
us-east-1weight from 100 to 0 over 5 minutes rather than a hard cut).
DNS TTL is a hard constraint. At a 60-second TTL, a failing region can receive traffic for up to 60 seconds after you update the record. Route 53 health checks with a 10-second interval and a 3-failure threshold add roughly 30 seconds of detection time on top of that. Design your RTO around this: the absolute floor for DNS-based failover is roughly 90 seconds. If you need sub-30-second failover, you need an anycast layer (Cloudflare, AWS Global Accelerator, GCP Premium Tier) that re-routes at the network level, bypassing DNS caching entirely.
Below the DNS layer, a global load balancer or service mesh controls per-request routing. Envoy and Istio support locality-aware load balancing — a pod in us-east-1a prefers endpoints in the same AZ, then the same region, and only spills to another region when local capacity is exhausted. This matters enormously for latency and cost: cross-region data transfer is both slower and billed at egress rates.
Data Locality
Routing is the easy part. Data is where multi-region gets hard. Every read/write must answer: which copy of the data am I operating on, and is it consistent with the others?
The dominant production patterns are:
- Primary-region writes, replica reads — all writes go to one region (the primary), which replicates asynchronously to secondaries. Reads can be served locally from the nearest replica. Simple to reason about; the trade-off is that replica lag (typically 10–200ms for well-tuned Postgres streaming replication, but can spike to seconds under heavy write load) means readers may see slightly stale data. Aurora Global Database runs this with a typical replication lag of under 1 second and allows a secondary to be promoted to primary in under 60 seconds.
- Active-active with conflict resolution — each region accepts writes to a shared dataset. Requires either a globally-consistent distributed database (CockroachDB, Spanner, DynamoDB Global Tables) or careful partitioning that ensures no two regions ever write to the same record. DynamoDB Global Tables use last-writer-wins with a vector clock; Spanner uses external-consistency TrueTime. Both impose latency overhead on writes proportional to the inter-region round-trip (50–150ms for adjacent regions).
- Data locality by user partition — partition users by geography. European users own their data in the EU region; North American users own theirs in
us-east-1. Each region is effectively a primary for its own shard. No cross-region reads required for steady-state operations; failover still requires cross-region access if a region is down. This is the pattern used by Stripe, Shopify, and most companies with EU GDPR obligations.
eu-west-1 is serving a user data that was deleted or updated in us-east-1. For user-scoped data (sessions, entitlements, balances), stale reads are a correctness bug, not just a performance issue. Either accept eventual consistency explicitly in your product design, or route writes and reads for a given user to a single region (the data-locality-by-user pattern).
Active-Active Architecture
Active-active means all regions serve live write traffic simultaneously with no single primary. It is the gold standard for availability but the most complex pattern to operate correctly.
Key design rules for active-active:
- Partition writes by a stable key. The safest active-active is one where the same logical record is never written from two regions simultaneously. Use consistent hashing on
user_idortenant_idto pin a user to a home region. That region owns their writes; the other regions are read replicas for that user. This eliminates the conflict problem entirely. - Make operations idempotent. Events replicated cross-region may arrive out of order or be delivered more than once. Every write operation should carry a version vector or a deterministic UUID so a duplicate application has no effect.
- Track replication lag as a first-class SLI. Wire a Prometheus metric for replication lag on your database and message queue. Alert at 5 seconds, page at 30 seconds. During a brown-out in one region, lag spikes before availability drops — it is your earliest warning.
Static Stability
Static stability is the principle that a region must be able to operate independently — serving all its traffic, scaling its compute, and recovering from failures — without making any cross-region API calls. This sounds obvious but breaks down constantly in practice.
Common static stability violations:
- An auth service that validates JWTs by calling a token-introspection endpoint in the primary region. When
us-east-1is degraded, logins ineu-west-1fail. - Kubernetes clusters that pull container images from an ECR registry in a single region. A region failure takes down deployments everywhere.
- Feature flags fetched from a central control plane on every request. If the control plane is unavailable, the fallback is undefined and may default to "all features off."
- Secrets retrieved from a single-region Vault or AWS Secrets Manager endpoint at pod startup. Pods fail to start in unaffected regions during a control-plane event.
The litmus test for static stability is a chaos game day where you fully block all cross-region network traffic (use a security group deny-all rule or a network chaos experiment) and verify that each region continues to serve 100% of its local traffic for at least 30 minutes. Teams that have never run this test reliably discover a cross-region dependency they did not know existed.
Choosing the Right Pattern
Not every workload needs active-active. The cost — in engineering complexity, database licensing, cross-region egress fees, and operational burden — is substantial. Match the pattern to the actual availability and latency requirement:
- Active-passive with warm standby: RTO 5–15 minutes, 30–50% of active cost. Sufficient for most B2B SaaS with 99.9% SLAs.
- Active-active, partitioned by user: RTO under 90 seconds (DNS TTL), no write conflicts. Correct choice for consumer products with global users and 99.95%+ SLAs.
- Active-active, fully distributed (Spanner/CockroachDB): RTO under 30 seconds, true multi-master. Required for financial ledgers, inventory systems, or any workload where a dirty read under failover is unacceptable. Cost is 3–10x a single-region deployment.
The pattern you choose today determines the migration path you face if your requirements change. Starting with a well-isolated active-passive design — clean service boundaries, idempotent writes, replication in place — leaves the door open for a later migration to active-active without a full rewrite.