Databases in Production

High Availability Architectures

22 min Lesson 2 of 30

High Availability Architectures

A database that goes down takes your entire product with it. Every other layer of the stack — load balancers, application servers, CDNs — can be made stateless and replaced in seconds. Databases cannot. They own the state, and losing them means losing user data, transaction history, and often the business itself. High availability (HA) architecture is the set of patterns that make a database cluster survive node failures, network partitions, and AZ outages without human intervention. This lesson covers the three pillars of production HA: replication topology, synchronous versus asynchronous replication trade-offs, and automatic failover.

The Primary/Replica Model

The foundational HA pattern for relational databases is primary/replica replication (historically called master/slave). One node — the primary — accepts all writes. One or more replica nodes receive a stream of changes from the primary and replay them to maintain an identical copy of the data.

This separation gives you three things simultaneously:

  • Redundancy: if the primary dies, a replica can be promoted to take its place.
  • Read scalability: read-heavy workloads can distribute SELECT queries across replicas, relieving primary write pressure.
  • Operational safety: long-running analytics queries run on a replica without slowing down OLTP writes on the primary.
Key idea: The primary/replica model is not a backup strategy — it is a failover strategy. A replica that is 30 seconds behind the primary cannot replace a backup taken before a destructive DROP TABLE. Replication and backups solve different failure modes and both are required in production.

In PostgreSQL, replication is built on WAL (Write-Ahead Log) streaming. Every transaction on the primary is first written to the WAL; replicas connect as streaming replication clients and receive WAL segments in near-real-time. In MySQL/MariaDB, the equivalent is the binary log (binlog) streamed in row-based format. Both mechanisms mean replicas are not reading from the primary's tables directly — they are replaying a log of changes, which makes replication lightweight even at high write rates.

Synchronous vs. Asynchronous Replication

The single most important architectural decision in database HA is whether replication is synchronous or asynchronous. This is not a preference — it is a direct trade-off between durability guarantees and write latency, and the wrong choice causes data loss or unacceptable slowdowns.

Asynchronous replication (default in most engines): The primary writes to its own WAL, commits, acknowledges success to the client, and then ships the WAL to replicas in the background. If the primary crashes before the replica has consumed the latest entries, those entries are lost permanently — this is called replication lag becoming data loss during a failover. The upside is that the commit path on the primary adds zero latency from network round-trips to replicas.

Synchronous replication: The primary writes to the WAL and only acknowledges the client commit after at least one designated synchronous replica has confirmed it received and flushed the WAL entry to durable storage. No data can be lost in a failover to that replica. The cost: every write incurs an additional network round-trip to the replica — typically 0.5–2 ms in the same AZ, 5–30 ms cross-AZ. At high write rates (thousands of transactions per second), this latency compounds.

Production pitfall — the durability illusion: Many teams run asynchronous replication and believe they have HA because they have a replica. They do not. If the primary crashes and the replica is 10 seconds behind, you just lost 10 seconds of committed transactions. For financial systems, order-entry, or any write that a user has already received "Success" for, that is catastrophic. Use synchronous replication on at least one replica for every workload where durability of committed data is a hard requirement.

The industry solution at companies like Google (Spanner, AlloyDB) and AWS (Aurora) is to use a quorum-based write protocol that achieves synchronous durability across multiple nodes while hiding most of the latency penalty. Aurora, for example, writes the log record to 6 storage nodes across 3 AZs and acknowledges the commit after 4 of 6 confirm — this provides synchronous guarantees with much better availability than a single synchronous replica.

For PostgreSQL specifically, you configure synchronous commit at the session or server level:

-- postgresql.conf: make one replica synchronous synchronous_standby_names = 'FIRST 1 (replica1, replica2)' synchronous_commit = on -- default; set to 'remote_write' for a middle ground -- Per-session override (useful for bulk loads that can tolerate lag): SET synchronous_commit = off; COPY large_table FROM '/tmp/data.csv'; SET synchronous_commit = on; -- Check current replication state from the primary: SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS replication_lag_bytes, sync_state FROM pg_stat_replication;

MySQL/InnoDB Group Replication and semi-synchronous plugins offer analogous controls. In a managed service (RDS Multi-AZ, CloudSQL), the sync/async mode is set at the instance level and the engine handles the WAL shipping transparently.

Automatic Failover

Having a warm replica is not HA. HA requires that when the primary fails, a replica is promoted to primary automatically — without a human paging someone at 3 AM to run pg_ctl promote. Automatic failover is the operational differentiator between a system with replicas and a system with true HA.

The components of a production automatic failover system are:

  1. Health monitoring: an agent or sidecar that continuously polls the primary. A missed heartbeat alone is not sufficient — good implementations distinguish between "primary unreachable from this node" and "primary unreachable from the majority of the cluster" to avoid split-brain.
  2. Leader election / quorum: a distributed consensus mechanism (typically based on Raft or Paxos — etcd, ZooKeeper, or built-in engine consensus) that safely elects exactly one new primary. The quorum requirement prevents two nodes simultaneously promoting themselves and accepting conflicting writes.
  3. Promotion: the winning replica is promoted to primary — WAL replay stops, write connections are accepted.
  4. Re-routing: the connection layer (virtual IP, DNS failover, or a proxy like PgBouncer/ProxySQL) redirects application traffic to the new primary. Applications should use a single endpoint that always points to the current primary — baking in replica IPs breaks this.
  5. Old primary fencing (STONITH): the previous primary is isolated before the new primary accepts writes. Without fencing, a primary that paused (GC, network blip) and comes back can momentarily believe it is still primary, causing a split-brain write window.
PostgreSQL HA Failover: Primary, Replicas, Patroni/Etcd, and Connection Layer Application single DB endpoint HAProxy / PgBouncer virtual IP / DNS Primary (AZ-1) Writes accepted WAL streaming out Replica 1 (AZ-2) sync standby Read queries Replica 2 (AZ-3) async standby DR / analytics writes reads WAL sync WAL async Patroni + etcd leader election · health checks · fencing On primary failure: Patroni promotes Replica 1, flips HAProxy endpoint
Production HA topology: Primary in AZ-1 streams WAL synchronously to Replica 1 and asynchronously to Replica 2. Patroni with etcd handles leader election and fencing; HAProxy re-routes the write endpoint on failover.

Patroni: Production Automatic Failover for PostgreSQL

The de-facto open-source HA solution for PostgreSQL is Patroni, used in production at Zalando, GitLab, Aiven, and many others. Patroni runs as a sidecar on each PostgreSQL node and uses etcd (or Consul, or ZooKeeper) as the distributed consensus store to hold the cluster leader lock. The node that holds the lock is primary; all others are replicas. A minimal Patroni configuration looks like:

# /etc/patroni/config.yml (node: pg-node-1) scope: prod-postgres-cluster name: pg-node-1 restapi: listen: 0.0.0.0:8008 connect_address: 10.0.1.11:8008 etcd3: hosts: 10.0.1.50:2379,10.0.1.51:2379,10.0.1.52:2379 bootstrap: dcs: ttl: 30 # leader lock TTL in seconds loop_wait: 10 # health-check interval retry_timeout: 10 maximum_lag_on_failover: 1048576 # 1 MB — refuse to promote if lag exceeds this synchronous_mode: true # require at least one sync replica before commit initdb: - encoding: UTF8 - data-checksums postgresql: listen: 0.0.0.0:5432 connect_address: 10.0.1.11:5432 data_dir: /var/lib/postgresql/14/main bin_dir: /usr/lib/postgresql/14/bin parameters: wal_level: replica max_wal_senders: 5 max_replication_slots: 5 hot_standby: "on" synchronous_commit: "on" synchronous_standby_names: "FIRST 1 (*)" tags: nofailover: false noloadbalance: false clonefrom: false nosync: false

With maximum_lag_on_failover set, Patroni refuses to promote a replica that has fallen too far behind — preventing the promotion of a stale node that would roll back committed transactions. The combination of synchronous_mode: true and a low lag threshold gives you strong durability guarantees without building custom orchestration.

Failover RTO and RPO in Practice

Two metrics govern what your HA architecture actually promises to the business:

  • RTO (Recovery Time Objective): how long the database is unavailable during a failover. With Patroni, a typical automatic failover completes in 15–45 seconds from the moment the primary stops heartbeating (Patroni's ttl setting is the dominant factor). Managed services like RDS Multi-AZ typically achieve 20–60 seconds. Plan your application connection retry logic around this window — circuit breakers and retry with backoff are not optional.
  • RPO (Recovery Point Objective): how much data can be lost in the worst case. With asynchronous replication, RPO equals the replication lag at the time of failure — potentially seconds to minutes of committed transactions. With synchronous replication on at least one replica, RPO is zero for a failover to that replica.
Pro practice — test your failover regularly: Document your RTO/RPO targets, then measure them against reality by injecting a controlled primary failure in a staging environment (or, with proper blast-radius controls, in production during a low-traffic window). Teams that have never tested failover discover during an actual incident that their application does not reconnect cleanly, their connection pool holds stale connections, or their monitoring fires 20 alerts simultaneously with no clear runbook. Make failover testing a quarterly ritual.

Multi-AZ vs. Multi-Region HA

Synchronous replication across availability zones (AZs within the same AWS region) is the production baseline — the network round-trip is 0.5–2 ms, making synchronous commit practical. Cross-region synchronous replication (latency 50–200 ms) is almost never used for OLTP because the write latency penalty is prohibitive. Cross-region replicas are almost always asynchronous and serve as disaster recovery (DR) targets, not automatic failover candidates. The trade-off: if the primary region is entirely lost, promoting a cross-region async replica is a manual operation and accepts data loss equal to the replication lag.

AWS Aurora Global Database solves part of this by using its custom storage layer to replicate across regions with under 1 second of lag — still asynchronous in the CAP sense, but operationally much faster than traditional streaming replication. For systems that require zero cross-region RPO, Spanner or CockroachDB with synchronous multi-region consensus are the options, at significant cost and complexity.