Replication Deep Dive
Replication Deep Dive
Replication is the backbone of database high availability and horizontal read scaling — and it is also one of the most fertile sources of production incidents. Understanding replication at the protocol level, not just as a feature you enable, separates engineers who manage databases from those who are managed by them.
How Replication Actually Works
MySQL/MariaDB and PostgreSQL both ship binary log (binlog / WAL) based replication. The primary writes every committed change to a durable log. Replicas stream that log and replay it in order. The critical insight is that this is asynchronous by default: the primary considers a transaction committed the moment it hits the primary's disk, before any replica has acknowledged receipt.
Replication Lag: Causes, Measurement, and Thresholds
Replication lag is the age of the oldest unexecuted event on a replica. It is not a single number — it is a compound of network latency, I/O saturation on the replica, and single-threaded SQL replay bottlenecks (pre-MySQL 8.0 / pre-Postgres parallel apply).
Common lag causes at scale:
- Large transactions — a single
UPDATE orders SET ... WHERE 1=1that takes 40 s on the primary blocks the replica SQL thread for the same 40 s. - I/O-bound replay — replicas often run on cheaper hardware; their disk is the bottleneck.
- DDL statements —
ALTER TABLEon a 200 GB table locks the SQL thread for its full duration on row-based replicas. - Single-threaded SQL thread — MySQL pre-5.7, Postgres pre-16 Logical Replication parallel apply. Enable parallel SQL threads.
Measure lag from inside the database, not from application-side pings:
In Prometheus, expose these metrics via mysqld_exporter or postgres_exporter and alert when lag exceeds your SLO threshold (typically 30 s for OLTP, 5 min for analytics replicas):
replica_parallel_workers = 8 and replica_parallel_type = LOGICAL_CLOCK in /etc/mysql/conf.d/replica.cnf. This allows independent transactions to replay in parallel, cutting lag by 4-8x on write-heavy workloads. Verify with SHOW PROCESSLIST — you should see multiple system user threads running SQL.Read Replicas in Production: Patterns and Pitfalls
Read replicas are powerful — they absorb 80-95% of traffic on read-heavy services. The rules for using them safely:
- Never read your own writes from a replica. After a write on the primary, route the next read for that session to the primary (or wait for
WAIT_FOR_EXECUTED_GTID_SET/pg_wal_replay_wait()to confirm the replica has caught up). Session stickiness to the primary after writes is the safest pattern. - Route by query type, not randomly. Analytics queries belong on a dedicated replica; OLTP reads can use a pool of general replicas. Never mix them — a 60-second analytical query will delay OLTP replay behind it.
- Use a proxy. ProxySQL (MySQL) and PgBouncer+HAProxy (Postgres) handle read/write splitting, health checks, and automatic failover without baking connection strings into application code.
- Health-check lag before routing. ProxySQL
mysql_servers.max_replication_lagremoves a backend from the read pool when lag exceeds the threshold. Configure it. A lagging replica with stale data is worse than routing the read to the primary.
Split-Brain: The Most Dangerous Failure Mode
Split-brain occurs when two nodes simultaneously believe they are the primary and accept writes. Both diverge. When the network partition heals, you have two conflicting histories with no automated way to merge them. Data is lost.
Prevention strategies (ranked by effectiveness):
- STONITH / fencing — Shoot The Other Node In The Head. Before a replica promotes itself, it must fence the old primary (power-cycle it via IPMI, detach its cloud disk, or kill its network interface). Without fencing, two nodes can simultaneously believe they are primary.
- Quorum-based promotion — Patroni (Postgres), Orchestrator (MySQL), and MHA all require a majority vote from the cluster before promoting. A minority partition cannot achieve quorum, so it cannot promote.
- Semi-synchronous replication — requires at least one replica to acknowledge receipt before the primary returns success. If all replicas are unreachable, semi-sync falls back to async after
rpl_semi_sync_source_timeout(default 10 s) — so it degrades gracefully but prevents the most common data-loss scenario. - Monitoring replica slot pressure (Postgres) — unused replication slots hold WAL indefinitely. A replica that falls too far behind and then reconnects can cause the primary to OOM. Monitor
pg_replication_slots.activeand drop inactive slots promptly.
max_wal_size = 1GB and no replication slots — meaning WAL is recycled before a slow replica can replay it. Enable replication slots for replicas you care about, but set max_slot_wal_keep_size to cap WAL retention so a dead replica cannot fill the disk. Without this cap, a replica that goes offline for 24 hours can bring down the primary when WAL storage is exhausted.Operational Checklist: Running Replicas in Production
- Enable GTID-based replication (
gtid_mode=ON+enforce_gtid_consistency=ONon MySQL) so failover does not require manual binlog position arithmetic. - Monitor
Seconds_Behind_Sourceorpg_replication_lagwith alerting; page on lag > 60 s. - Set
read_only=ONon every replica at the engine level — your HA manager should setsuper_read_only=ONtoo, so even privileged users cannot accidentally write. - Verify replication integrity weekly with
pt-table-checksum(MySQL) orpg_amcheck(Postgres 14+). Silent corruption is real. - Test failover in a staging environment monthly. Mean time to confirm your HA stack actually works is more valuable than any runbook.