Caching & Messaging Infrastructure

Kafka Reliability & Disaster Scenarios

18 min Lesson 7 of 30

Kafka Reliability & Disaster Scenarios

Kafka's reliability model is deceptively simple: replicate data across brokers, elect a leader per partition, acknowledge writes to enough replicas. The nuance — and the source of every production outage — lives in the gaps between those steps. This lesson covers the three scenarios that have caused the most real-world data loss and availability failures at production scale: unclean leader elections, the many paths to acknowledged-but-lost writes, and the operational complexity of cross-datacenter replication.

Unclean Leader Elections: Trading Safety for Availability

Every Kafka partition has exactly one leader at any moment. When that leader fails, the controller must elect a new one. The clean path: elect only from the In-Sync Replica (ISR) set — the subset of replicas that have fully caught up with the leader. Replicas in the ISR are guaranteed to hold all messages acknowledged by the previous leader, so a new leader elected from this set has no data gap.

The dangerous path is the unclean election: electing a replica that is not in the ISR because no ISR replica is available. This happens when the leader fails while its followers are lagging behind — a scenario that occurs more often than intuition suggests, especially during rolling restarts or network partitions that outlast replica.lag.time.max.ms (default: 30 seconds).

Production pitfall: unclean.leader.election.enable=true is the default in some Kafka distributions prior to 3.x. With this on, a broker that is significantly behind can win a leader election, becoming authoritative while permanently discarding every message the previous leader acknowledged but this replica never received. Producers get no error. Consumers skip ahead silently. This is silent, permanent data loss.

Set unclean.leader.election.enable=false cluster-wide for any workload where data loss is unacceptable. The trade-off: if the ISR shrinks to zero (all caught-up replicas fail simultaneously), the partition becomes unavailable until at least one ISR member recovers. This is the correct behavior — unavailability is recoverable; data loss is not. Reserve true only for high-throughput logging topics where you explicitly accept loss in exchange for continuous availability.

# server.properties — safety-first baseline for production
unclean.leader.election.enable=false
default.replication.factor=3
min.insync.replicas=2

# Per-topic override (takes precedence over broker default)
kafka-topics.sh --alter \
  --bootstrap-server kafka-1:9092 \
  --topic payments \
  --config unclean.leader.election.enable=false \
  --config min.insync.replicas=2

# Inspect ISR health — healthy vs degraded output
kafka-topics.sh --describe \
  --bootstrap-server kafka-1:9092 \
  --topic payments
# Healthy:  Partition: 0  Leader: 2  Replicas: 2,0,1  Isr: 2,0,1
# Degraded: Partition: 0  Leader: 2  Replicas: 2,0,1  Isr: 2
# (ISR=1 means you are one broker failure away from unavailability)

Data Loss Scenarios: The Full Taxonomy

Unclean elections are one path to data loss. Senior operators maintain a mental map of all paths, because each requires a different mitigation:

Producer acks=0 or acks=1: With acks=0, the producer fires and forgets — any broker failure before the write lands on disk loses the message. With acks=1, the leader acknowledges after writing to its own log but before any follower replicates; if the leader crashes in that window, the message is lost even though the producer received a success response. Use acks=all (equivalent to acks=-1) for any topic that matters.
ISR shrinkage below min.insync.replicas: With acks=all and min.insync.replicas=2, a write requires acknowledgment from at least 2 replicas. If only one replica is in the ISR (due to broker failure or lag), the producer receives NotEnoughReplicasException. This is the correct behavior — the cluster rejects the write rather than accepting it unsafely. Monitor ISR size via JMX metric kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions.
Log truncation on follower promotion: When a follower is elected leader, it truncates its log to the last committed offset — the high-water mark. Any messages the old leader wrote after the high-water mark but before crashing are discarded. With acks=all this cannot happen by definition (those messages would not have been acknowledged), but with acks=1 it is a real loss scenario.
Consumer offset commits ahead of processing: A consumer can commit an offset before fully processing (or durably storing) the corresponding message. If the process crashes after commit but before the side effect completes, the message is effectively lost from the consumer's perspective — it will never be reprocessed. Commit offsets only after your processing side effect is durable.
Compacted topic tombstone expiry: Log-compacted topics delete tombstone records after delete.retention.ms (default: 24 hours). A downstream consumer that was offline longer than this interval misses the deletes and retains stale state indefinitely.

Key insight: The combination acks=all + min.insync.replicas=N-1 (where N is the replication factor) + unclean.leader.election.enable=false gives you the strongest durability guarantee Kafka can provide. This is the baseline for any financially or contractually sensitive topic at big-tech scale. Stripe, LinkedIn, and Confluent's own internal clusters all run with this configuration.

Kafka durability configuration paths: how acks, ISR size, and unclean election settings combine to determine whether data is safe or silently lost.

Multi-Cluster Replication: MirrorMaker 2 in Production

A single Kafka cluster, no matter how well-tuned, is a single failure domain. Network partitions, availability zone failures, and datacenter-level events require a multi-cluster topology. Kafka's native answer is MirrorMaker 2 (MM2), introduced in KIP-382 and built on the Kafka Connect framework. MM2 replaces the original MirrorMaker — which had serious offset-mapping and consumer-group replication gaps — and is the production-standard tool at LinkedIn, Confluent, and AWS MSK.

MM2 provides three key capabilities that its predecessor lacked:

Offset translation: Consumer group offsets are replicated between clusters with a translation layer that accounts for the fact that the same logical message may have different offsets in the source and target clusters. This is essential for failover — consumers can resume from the correct position in the target cluster without replaying the entire topic.
Topic namespace isolation: Topics are prefixed with the source cluster alias (e.g., us-east.payments on the eu-west cluster), preventing collisions and making topology explicit.
Heartbeat and checkpoint topics: MM2 writes synthetic records to mm2-heartbeats and mm2-checkpoints — used by failover tooling (like the RemoteClusterUtils API) to calculate consumer offset translation on-demand.

# mm2.properties — active/passive replication (us-east -> eu-west)
clusters = us-east, eu-west
us-east.bootstrap.servers = kafka-us-east-1:9092,kafka-us-east-2:9092,kafka-us-east-3:9092
eu-west.bootstrap.servers = kafka-eu-west-1:9092,kafka-eu-west-2:9092,kafka-eu-west-3:9092

# Replicate all topics from us-east to eu-west (active/passive DR)
us-east->eu-west.enabled = true
us-east->eu-west.topics = .*
us-east->eu-west.groups = .*

# Topic replication config
replication.factor = 3
tasks.max = 8

# Offset sync interval — lower = faster failover, higher = less overhead
emit.checkpoints.interval.seconds = 10
emit.heartbeats.interval.seconds = 5

# Do not replicate internal MM2 topics back (prevents loops)
us-east->eu-west.topics.blacklist = mm2-.*,__.*

# Start MM2 as a distributed Connect cluster
connect-mirror-maker.sh mm2.properties

# After failover: translate offsets for consumer group 'payment-processor'
# so it resumes from the correct position on eu-west
kafka-consumer-groups.sh \
  --bootstrap-server kafka-eu-west-1:9092 \
  --group payment-processor \
  --reset-offsets \
  --to-datetime 2025-11-01T00:00:00.000 \
  --topic us-east.payments \
  --execute

Active-Active vs. Active-Passive: The Operational Trade-off

MM2 supports both topologies, and the choice has significant operational consequences:

Active-passive is operationally simpler. One cluster is the source of truth; the other is a warm standby. Failover is a deliberate operational action: redirect producers to the standby, verify consumer offset translation, cut DNS. The risk is that the standby cluster is never exercised under production load, meaning failover is slower and more error-prone when you actually need it. Aim for a regular DR drill — at minimum quarterly — to validate the runbook.

Active-active routes different topic namespaces to different clusters and allows each cluster to mirror the other's namespaces. This gives you zero-RPO for reads and automatic fan-out, but introduces the hardest problem in distributed systems: avoiding message cycles (a message replicated to cluster B must not be re-replicated back to cluster A). MM2 uses topic prefixing and the replication.policy.class setting to break cycles, but this must be explicitly tested.

Pro practice: Instrument your MM2 replication lag with the JMX metric kafka.connect:type=MirrorSourceConnector,target=(*),topic=(*),name=replication-latency-ms. Alert when lag exceeds your RTO budget. At LinkedIn's scale (trillions of messages per day cross-datacenter), this metric is on every SRE dashboard. A sustained lag spike is usually the first sign of a network brownout or a target-cluster broker failure before the cluster itself reports degradation.

Disaster Recovery Runbook: The First 15 Minutes

When the primary cluster becomes unavailable, the sequence of decisions matters as much as the technical mechanics:

Confirm scope (minutes 0-2): Is it one broker, the entire cluster, or the network path between clusters? Check ActiveControllerCount, OfflinePartitionsCount, and MM2 lag metrics. A single broker failure in a properly configured cluster is self-healing — do not trigger failover.
Declare incident and freeze writes (minutes 2-5): If cluster-level failure is confirmed, coordinate with the on-call to freeze writes to the primary. This prevents producers from continuing to write to a cluster whose fate is uncertain, which complicates offset reconciliation later.
Translate consumer offsets (minutes 5-10): Use RemoteClusterUtils.translateOffsets() or the kafka-consumer-groups.sh --reset-offsets workflow to map consumer group checkpoints from the primary to the DR cluster. This is the step most teams under-practice and most often get wrong under pressure.
Redirect traffic and validate (minutes 10-15): Update bootstrap server configuration in producer and consumer clients (typically via service discovery or environment config), restart consumers on the DR cluster, and validate that consumer groups are progressing and that key business metrics (e.g., payment processing rate) are recovering.

Key insight: The difference between a 15-minute and a 4-hour DR recovery is almost never the infrastructure — it is the runbook clarity and the frequency of drills. Every cluster failover procedure should be executed in a staging environment at minimum once per quarter. The teams that recover fastest are the ones for whom failover is boring, because they have done it dozens of times.