Data Replication Across Regions
Data Replication Across Regions
Moving compute across regions during a failover is relatively straightforward — you re-point DNS, spin up instances, and let your orchestrator reschedule pods. Moving data is a fundamentally harder problem. Data has weight: it must be consistent, durable, and available at the right place at the right time. The choices you make about replication topology determine your effective RPO and set a hard lower bound on your RTO. Get them wrong and your DR plan exists only on paper.
Synchronous vs Asynchronous Replication
Every cross-region replication design starts with one decision: does the primary wait for the remote replica to confirm a write before acknowledging success to the caller?
Synchronous replication means the write is not acknowledged until at least one replica in the secondary region has durably written it. RPO is effectively zero — no committed transaction can be lost. The cost is latency: if your primary is us-east-1 and your replica is eu-west-1, the round-trip is roughly 80 ms. Every write adds that RTT to its latency budget. At p99 this compounds. For OLTP workloads at scale, synchronous cross-region replication is rarely feasible beyond a few hundred kilometres.
Asynchronous replication means the primary acknowledges the write immediately and ships the change log to the replica in the background. Write latency is unchanged. The trade-off is replication lag: if the primary fails before the replica applies the last few seconds of the log, those transactions are lost. At AWS with RDS MySQL, typical cross-region async lag sits between 50 ms and a few seconds under normal load; it can spike to minutes during a write storm or a large DDL migration.
Replication Lag is a Production Trap
Teams often configure async replication, measure lag at 50 ms during testing, and declare their RPO as "near zero." This is dangerous. Replication lag is not a constant — it is a function of write rate, network conditions, and replica I/O capacity. During a large batch import, a schema migration, or a traffic spike, lag routinely climbs to minutes. If the primary fails at that moment, you lose minutes of data, not milliseconds. Always monitor Seconds_Behind_Source (MySQL) or pg_stat_replication.write_lag (PostgreSQL) as an SLO metric with alerting thresholds.
Cross-Region Replication: Database-Specific Options
At big-tech scale, teams choose their replication mechanism based on their database engine, their RPO/RTO targets, and their operational complexity tolerance.
Amazon Aurora Global Database
Aurora Global Database uses a dedicated replication layer that bypasses the database's own redo log shipping. Changes are written to Aurora's distributed storage layer (SSD-backed, six copies across three AZs in the primary region) and replicated to up to five secondary regions using a proprietary fast replication protocol. Typical replication lag is under one second — often under 100 ms. During a failover, the secondary region is promoted with an RPO under one second and an RTO of approximately one minute. This is purpose-built DR architecture at managed-service quality.
PostgreSQL Logical Replication Across Regions
For self-managed PostgreSQL on EC2 or bare metal, logical replication (available since PG 10) allows you to replicate individual tables or sets of tables to a remote subscriber. Unlike streaming (physical) replication, logical replication survives major version mismatches and allows the subscriber to be writable on non-subscribed tables — a key property for active-active patterns. The trade-off is that logical replication does not replicate DDL; schema changes must be applied manually to subscribers before they are applied on the publisher, or you will break the replication slot.
Apache Kafka Cross-Region Mirroring (MirrorMaker 2)
For event-streaming data — the backbone of modern data pipelines — Kafka MirrorMaker 2 (MM2) replicates topics across clusters in different regions with configurable consumer offset translation. Topics in the secondary cluster are prefixed with the source cluster alias (us-east.orders), which means failover consumers need only re-point their bootstrap servers and adjust topic names. MM2 is built on Kafka Connect and inherits its operational model: distributed workers, REST-managed connectors, and connector status visible via the Connect API.
Conflict Handling in Active-Active Replication
When both regions accept writes — the active-active pattern — the same logical record can be mutated concurrently in two regions. This is a distributed systems fundamentals problem with no perfect solution. The practical options are:
- Last-write-wins (LWW) by wall clock: The write with the highest timestamp wins. Simple but dangerous — clock skew between regions means a write from 200 ms ago can silently overwrite a more recent write from the other region. Use only for records where occasional overwrites are acceptable (user preferences, cache entries).
- LWW by logical clock (CRDT or version vector): Replace wall clock with a Lamport timestamp or vector clock. CockroachDB and DynamoDB use variants of this approach. More correct, still lossy — one write is silently discarded.
- Application-level conflict detection: Tag every write with an origin region and a sequence number. On merge, detect conflicts explicitly and push them to an application-layer resolution queue. This preserves both writes and lets business logic decide (e.g., sum the inventory deltas rather than pick a winner). This is the correct approach for financial data but adds significant operational complexity.
- Avoid conflicts by design (entity ownership): Partition entities so each region owns a disjoint subset. Region A owns even-numbered user IDs, Region B owns odd-numbered ones. No entity is ever written by two regions simultaneously. This is the most operationally reliable approach and the one most large-scale systems converge on after fighting LWW bugs in production.
Cross-Region Replication Topology Diagram
Monitoring Replication Health as an SLO
Replication lag is not a background operational metric — it is a first-class SLO because it is your real-time RPO. Instrument it accordingly. At Amazon scale, teams alert on two thresholds: a warning at 10× the average lag (indicating a developing problem) and a critical alert at the SLO breach point (e.g., 60 seconds for a system with a 1-minute RPO commitment). The critical alert should page the on-call engineer, not just email the team.
For Aurora Global Database, the key CloudWatch metric is AuroraGlobalDBReplicationLag measured per secondary cluster. For self-managed PostgreSQL, expose pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as a Prometheus gauge via postgres_exporter. For Kafka MM2, the metric kafka.connect:type=MirrorSourceConnector,attribute=replication-latency-ms gives per-topic end-to-end lag including connector processing time.