Databases in Production

Replication Deep Dive

18 min Lesson 3 of 30

Replication Deep Dive

Replication is the backbone of database high availability and horizontal read scaling — and it is also one of the most fertile sources of production incidents. Understanding replication at the protocol level, not just as a feature you enable, separates engineers who manage databases from those who are managed by them.

How Replication Actually Works

MySQL/MariaDB and PostgreSQL both ship binary log (binlog / WAL) based replication. The primary writes every committed change to a durable log. Replicas stream that log and replay it in order. The critical insight is that this is asynchronous by default: the primary considers a transaction committed the moment it hits the primary's disk, before any replica has acknowledged receipt.

Async replication (default) vs semi-sync acknowledge path. Semi-sync requires at least one replica to confirm receipt before the primary responds to the client.

Replication Lag: Causes, Measurement, and Thresholds

Replication lag is the age of the oldest unexecuted event on a replica. It is not a single number — it is a compound of network latency, I/O saturation on the replica, and single-threaded SQL replay bottlenecks (pre-MySQL 8.0 / pre-Postgres parallel apply).

Common lag causes at scale:

Large transactions — a single UPDATE orders SET ... WHERE 1=1 that takes 40 s on the primary blocks the replica SQL thread for the same 40 s.
I/O-bound replay — replicas often run on cheaper hardware; their disk is the bottleneck.
DDL statements — ALTER TABLE on a 200 GB table locks the SQL thread for its full duration on row-based replicas.
Single-threaded SQL thread — MySQL pre-5.7, Postgres pre-16 Logical Replication parallel apply. Enable parallel SQL threads.

Measure lag from inside the database, not from application-side pings:

-- MySQL / MariaDB — on any replica
SHOW REPLICA STATUS\G
-- Key fields:
--   Seconds_Behind_Source  : lag in seconds (NULL = IO thread disconnected)
--   Retrieved_Gtid_Set     : what the replica received
--   Executed_Gtid_Set      : what the replica replayed
--   Relay_Log_Space        : bytes of unreplayed events buffered

-- PostgreSQL — on any standby
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- PostgreSQL — on the primary, lag per standby slot
SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_bytes
FROM pg_replication_slots;

In Prometheus, expose these metrics via mysqld_exporter or postgres_exporter and alert when lag exceeds your SLO threshold (typically 30 s for OLTP, 5 min for analytics replicas):

# PromQL alert — MySQL replica lag
ALERT MySQLReplicaLagHigh
  IF mysql_slave_status_seconds_behind_master{job="mysql"} > 30
  FOR 2m
  LABELS { severity="warning" }
  ANNOTATIONS {
    summary = "Replica {{ $labels.instance }} is {{ $value }}s behind primary",
  }

# PromQL alert — PostgreSQL standby lag
ALERT PGStandbyLagHigh
  IF pg_replication_lag{job="postgres"} > 30
  FOR 2m
  LABELS { severity="warning" }

Enable parallel SQL threads on MySQL replicas: set replica_parallel_workers = 8 and replica_parallel_type = LOGICAL_CLOCK in /etc/mysql/conf.d/replica.cnf. This allows independent transactions to replay in parallel, cutting lag by 4-8x on write-heavy workloads. Verify with SHOW PROCESSLIST — you should see multiple system user threads running SQL.

Read Replicas in Production: Patterns and Pitfalls

Read replicas are powerful — they absorb 80-95% of traffic on read-heavy services. The rules for using them safely:

Never read your own writes from a replica. After a write on the primary, route the next read for that session to the primary (or wait for WAIT_FOR_EXECUTED_GTID_SET / pg_wal_replay_wait() to confirm the replica has caught up). Session stickiness to the primary after writes is the safest pattern.
Route by query type, not randomly. Analytics queries belong on a dedicated replica; OLTP reads can use a pool of general replicas. Never mix them — a 60-second analytical query will delay OLTP replay behind it.
Use a proxy. ProxySQL (MySQL) and PgBouncer+HAProxy (Postgres) handle read/write splitting, health checks, and automatic failover without baking connection strings into application code.
Health-check lag before routing. ProxySQL mysql_servers.max_replication_lag removes a backend from the read pool when lag exceeds the threshold. Configure it. A lagging replica with stale data is worse than routing the read to the primary.

Split-Brain: The Most Dangerous Failure Mode

Split-brain occurs when two nodes simultaneously believe they are the primary and accept writes. Both diverge. When the network partition heals, you have two conflicting histories with no automated way to merge them. Data is lost.

Split-brain: both nodes accept writes after a partition; data diverges and one side must be discarded on heal.

Prevention strategies (ranked by effectiveness):

STONITH / fencing — Shoot The Other Node In The Head. Before a replica promotes itself, it must fence the old primary (power-cycle it via IPMI, detach its cloud disk, or kill its network interface). Without fencing, two nodes can simultaneously believe they are primary.
Quorum-based promotion — Patroni (Postgres), Orchestrator (MySQL), and MHA all require a majority vote from the cluster before promoting. A minority partition cannot achieve quorum, so it cannot promote.
Semi-synchronous replication — requires at least one replica to acknowledge receipt before the primary returns success. If all replicas are unreachable, semi-sync falls back to async after rpl_semi_sync_source_timeout (default 10 s) — so it degrades gracefully but prevents the most common data-loss scenario.
Monitoring replica slot pressure (Postgres) — unused replication slots hold WAL indefinitely. A replica that falls too far behind and then reconnects can cause the primary to OOM. Monitor pg_replication_slots.active and drop inactive slots promptly.

The default Postgres configuration ships with max_wal_size = 1GB and no replication slots — meaning WAL is recycled before a slow replica can replay it. Enable replication slots for replicas you care about, but set max_slot_wal_keep_size to cap WAL retention so a dead replica cannot fill the disk. Without this cap, a replica that goes offline for 24 hours can bring down the primary when WAL storage is exhausted.

At Google scale, Spanner sidesteps replication lag entirely by using TrueTime (GPS + atomic clock) to provide externally consistent reads. In your production AWS/GCP environment, Aurora Global Database and Cloud Spanner provide sub-second cross-region replication with automated failover — study their consistency guarantees before assuming all managed databases behave like vanilla MySQL.

Operational Checklist: Running Replicas in Production

Enable GTID-based replication (gtid_mode=ON + enforce_gtid_consistency=ON on MySQL) so failover does not require manual binlog position arithmetic.
Monitor Seconds_Behind_Source or pg_replication_lag with alerting; page on lag > 60 s.
Set read_only=ON on every replica at the engine level — your HA manager should set super_read_only=ON too, so even privileged users cannot accidentally write.
Verify replication integrity weekly with pt-table-checksum (MySQL) or pg_amcheck (Postgres 14+). Silent corruption is real.
Test failover in a staging environment monthly. Mean time to confirm your HA stack actually works is more valuable than any runbook.