Caching & Messaging Infrastructure

Kafka Architecture for Operators

22 min Lesson 4 of 30

Kafka Architecture for Operators

Apache Kafka is a distributed commit log masquerading as a message broker. Understanding its internal architecture — not just how to use it from a producer or consumer perspective, but how data is stored, replicated, and coordinated across a cluster — is what separates operators who can keep a Kafka cluster healthy at scale from those who get paged at 3 AM and have no idea where to look. This lesson maps the entire internal architecture: brokers, topics, partitions, replication, and the controller, at a level of detail that lets you reason about production failures before they happen.

Brokers: The Unit of Scale

A Kafka broker is an ordinary JVM process running on a host. Its job is to persist and serve log segments for the partitions it is assigned. A production cluster at big-tech companies runs anywhere from 12 to several hundred brokers, typically on dedicated hosts with local NVMe or high-throughput EBS-optimized instances. Each broker is identified by a numeric broker.id (or, in KRaft mode, a UUID that maps to the same concept).

The broker stores partition data on disk as immutable, append-only segment files. Each segment is a pair: a .log file containing raw message bytes and an .index file mapping offsets to byte positions. When a segment reaches the configured log.segment.bytes (default 1 GB), it is rolled to a new active segment. The broker never overwrites existing segments — it only appends and eventually deletes old segments via the retention policy. This is the core design that makes Kafka fast: sequential I/O is an order of magnitude faster than random I/O on both spinning disks and SSDs.

Kafka is not a database, but it behaves like one. Every message written to a partition is durably persisted to disk before the broker acknowledges the producer (when acks=all). The log-structured storage model means recovery after a crash is simply replaying from the last committed offset — no WAL replay, no B-tree reconstruction.

Key broker configuration knobs every operator must know:

# /etc/kafka/server.properties — critical broker settings

# Disk and storage
log.dirs=/data/kafka/logs                    # Comma-separated list of log dirs
log.segment.bytes=1073741824                 # Roll segment at 1 GB
log.retention.bytes=107374182400             # Retain up to 100 GB per partition
log.retention.hours=168                      # OR retain for 7 days (whichever first)
log.retention.check.interval.ms=300000       # Check every 5 min

# Replication
default.replication.factor=3                 # MINIMUM for production
min.insync.replicas=2                        # Producers with acks=all need 2 ACKs
unclean.leader.election.enable=false         # NEVER allow stale replicas to become leader

# Performance
num.io.threads=8                             # I/O threads per log dir
num.network.threads=3                        # Network handler threads
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600           # Max request size (100 MB)

# JVM heap — set externally via KAFKA_HEAP_OPTS
# KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"           # 6 GB, no heap growth at runtime

Topics, Partitions, and the Log

A topic is a logical feed of records. It is a namespace, not a storage unit. All actual data lives in partitions. When you create a topic with --partitions 12, you create 12 independent, ordered, append-only logs that are distributed across the cluster. Each partition is stored on exactly one broker at any moment (the leader), with copies on additional brokers (followers/replicas).

The partition count determines your maximum parallelism for consumption: a consumer group can have at most one active consumer per partition. If you have 12 partitions and 20 consumers in a group, 8 consumers will be idle. Partition count is permanent in the upward direction — you can increase partitions but never decrease them, and increasing partitions breaks key-based ordering for existing consumers because the same key may now hash to a different partition. Size your partition count correctly at creation time.

Production guidance: at LinkedIn (Kafka's birthplace), the rule of thumb is that a single partition can handle roughly 10 MB/s of throughput on modern hardware. For a topic expected to carry 1 GB/s, that implies 100 partitions. Across a 10-broker cluster, that is 10 partition leaders per broker, which is manageable. Clusters with more than 4,000 partitions per broker begin to show ZooKeeper (or KRaft metadata) pressure and longer failover times.

# Create a topic with explicit partition and replication settings
kafka-topics.sh --bootstrap-server kafka1:9092 \
  --create \
  --topic orders \
  --partitions 24 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000 \
  --config retention.bytes=53687091200

# Describe the topic — shows leader, replicas, ISR per partition
kafka-topics.sh --bootstrap-server kafka1:9092 \
  --describe --topic orders

# Example output (truncated):
# Topic: orders  Partition: 0  Leader: 3  Replicas: 3,1,2  Isr: 3,1,2
# Topic: orders  Partition: 1  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
# Topic: orders  Partition: 2  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1

# Increase partition count (irreversible — plan carefully)
kafka-topics.sh --bootstrap-server kafka1:9092 \
  --alter --topic orders --partitions 48

Replication: ISR and the Durability Contract

Kafka replication is pull-based: follower replicas continuously fetch from the leader, not the other way around. Each partition has exactly one leader broker and zero or more follower brokers. Together these form the replica set. The subset of replicas that are currently caught up with the leader is called the In-Sync Replica set (ISR).

The ISR is the heart of Kafka's durability guarantee. When a producer sends with acks=all (also written acks=-1), the leader does not acknowledge the write until every broker in the ISR has persisted the message. If a follower falls behind by more than replica.lag.time.max.ms (default 30 seconds), it is removed from the ISR. This means a slow or crashed follower does not block producers — it just leaves the ISR, degrading your redundancy level silently until it catches up or is replaced.

The critical operator invariant: min.insync.replicas is the floor. If the ISR shrinks below this value, producers with acks=all will receive a NotEnoughReplicasException and the partition becomes read-only for writes. This is intentional: it is better to reject writes than to accept them with insufficient durability. At Google-scale operations, min.insync.replicas=2 with replication.factor=3 is the universal baseline — it tolerates one broker failure without any write disruption, and requires two simultaneous failures before writes are blocked.

# Monitor ISR health — any partition with ISR shorter than replication factor is degraded
kafka-topics.sh --bootstrap-server kafka1:9092 --describe \
  | awk '/Isr:/ {split($0, a, "Isr: "); split(a[2], isr, ","); \
    split($0, r, "Replicas: "); split(r[2], rep, ","); \
    if (length(isr) < length(rep)) print "DEGRADED:", $0}'

# Better: use kafka-topics.sh built-in filter
kafka-topics.sh --bootstrap-server kafka1:9092 \
  --describe --under-replicated-partitions

# Check which brokers are in the ISR for a specific partition
kafka-log-dirs.sh --bootstrap-server kafka1:9092 \
  --topic-list orders --broker-list 1,2,3

# Trigger preferred leader election after a broker restarts
# (redistributes leadership to original preferred leaders)
kafka-leader-election.sh --bootstrap-server kafka1:9092 \
  --election-type preferred --all-topic-partitions

A 3-broker Kafka cluster with one topic, 3 partitions, and replication factor 3. Each broker holds one partition leader and two ISR followers. Broker 1 also acts as the active Controller.

The Controller: Cluster Brain

The controller is a single broker elected to manage cluster metadata and coordinate partition leadership. In the legacy ZooKeeper-based architecture (Kafka pre-3.3), the controller was elected via an ephemeral ZooKeeper node — whichever broker claimed /controller first became the controller. When the controller crashed or its ZooKeeper session expired, other brokers raced to claim the node and a new controller was elected within seconds.

In the modern KRaft mode (Kafka 3.3+ GA, mandatory in Kafka 4.0+), ZooKeeper is removed entirely. The controller is now a Raft-based consensus group: a small set of brokers (typically 3 or 5, separate from or co-located with the data brokers) forms a Raft quorum. One is the active controller, the others are followers. The metadata log — every topic creation, partition reassignment, broker registration, ISR change — is itself a Kafka partition replicated across this quorum. This means controller failover is now sub-second rather than the 15-30 second ZooKeeper session timeout window that defined legacy Kafka SLOs.

The controller's responsibilities are:

Leader election: When a partition leader fails, the controller picks the next in-sync replica to promote and pushes the new leader epoch to all affected brokers.
Broker registration: When a broker starts or crashes, it registers/deregisters with the controller. The controller then triggers partition reassignments as needed.
ISR changes: Brokers report ISR shrink/expand events to the controller, which persists them in the metadata log and propagates them to other brokers.
Topic and partition management: All administrative operations (create/delete topic, reassign partitions, alter configs) flow through the controller.

# KRaft mode: configure a 3-node controller quorum
# On each controller node, server.properties:

# Roles: 'controller' for dedicated controller nodes,
#        'broker,controller' for combined mode (small clusters only)
process.roles=controller
node.id=1                                    # Unique across the cluster
controller.quorum.voters=1@kafka1:9093,2@kafka2:9093,3@kafka3:9093

# On broker-only nodes:
process.roles=broker
node.id=4
controller.quorum.voters=1@kafka1:9093,2@kafka2:9093,3@kafka3:9093

# Generate a cluster UUID once (do this before first boot):
kafka-storage.sh random-uuid
# Output: abc123XYZ...

# Format each node storage with that UUID:
kafka-storage.sh format -t abc123XYZ -c /etc/kafka/server.properties

# Verify controller status:
kafka-metadata-quorum.sh --bootstrap-controller kafka1:9093 describe --status
# Output shows: LeaderId, LeaderEpoch, HighWatermark, MaxFollowerLag

Partition Assignment and Rack Awareness

Kafka distributes partition replicas across brokers to balance leader load and tolerate broker failures. The default assignment algorithm spreads replicas in a round-robin manner, but in a multi-rack or multi-AZ deployment, you need rack awareness to ensure that each partition has replicas in different physical failure domains.

Configure broker.rack on each broker to match its AZ or physical rack label. When Kafka creates a topic, it will ensure that the replica set for each partition spans different racks. This means losing an entire AZ will not cause data loss for any partition (as long as the remaining replicas are in the ISR).

# Broker in us-east-1a:
broker.rack=us-east-1a

# Broker in us-east-1b:
broker.rack=us-east-1b

# Broker in us-east-1c:
broker.rack=us-east-1c

# Verify rack assignment:
kafka-topics.sh --bootstrap-server kafka1:9092 --describe --topic orders
# Each partition should show replicas on brokers from different racks

# Manual partition reassignment (e.g., after adding brokers):
# 1. Generate a reassignment plan
kafka-reassign-partitions.sh --bootstrap-server kafka1:9092 \
  --broker-list "1,2,3,4,5" \
  --topics-to-move-json-file topics.json \
  --generate

# 2. Execute the reassignment (throttled to avoid impacting producers)
kafka-reassign-partitions.sh --bootstrap-server kafka1:9092 \
  --reassignment-json-file reassignment.json \
  --throttle 50000000 \
  --execute

# 3. Verify completion
kafka-reassign-partitions.sh --bootstrap-server kafka1:9092 \
  --reassignment-json-file reassignment.json \
  --verify

Partition reassignment is I/O-intensive and operator-visible. Moving a partition replica copies every byte of data from the source broker to the destination. On a topic with 500 GB of retention and no throttle set, a reassignment can saturate your disk and network, impacting producer and consumer latency. Always use --throttle with a value set to 20-30% of your inter-broker bandwidth. Also watch the ISR during reassignment: the new replica starts as an observer, builds its log, then joins the ISR — until it joins, your partition has reduced redundancy.

Key Failure Modes Every Operator Must Know

Unclean leader election: If unclean.leader.election.enable=true and all ISR members are down, Kafka can elect an out-of-sync replica as leader. This replica may be missing the last committed messages, causing data loss. Set this to false in all production clusters. The trade-off is availability: if the entire ISR is lost, the partition will be unavailable until an ISR member recovers. This is the correct trade-off for a durable log system.

Controller storm: Rapid broker restarts can trigger repeated controller elections, causing a storm of leader epoch bumps and metadata propagation across all brokers. Symptom: producers and consumers see frequent NOT_LEADER_OR_FOLLOWER and LEADER_NOT_AVAILABLE errors. Mitigation: stagger broker restarts with at least 30-60 seconds between each, and watch kafka.controller:type=KafkaController,name=ActiveControllerCount — it must always be exactly 1 across the cluster.

Log directory failure: If a broker loses a disk, all partitions on that log directory are taken offline. The broker remains up and serves its other log directories. Kafka 1.1+ introduced JBOD (Just a Bunch of Disks) support: configure multiple log.dirs entries pointing to separate disks. A disk failure only offline the partitions on that disk, not the entire broker. Use this in production instead of RAID.

Instrument these four metrics as your Kafka health baseline: (1) UnderReplicatedPartitions — must be 0 at steady state; any value above 0 means your cluster is degraded. (2) ActiveControllerCount — must be exactly 1. (3) OfflinePartitionsCount — must be 0; any offline partition is a complete data unavailability event. (4) RequestHandlerAvgIdlePercent — should stay above 30%; dropping below 20% signals that the broker is CPU-bound and about to start dropping requests. Wire these into your existing Prometheus alerting stack; they are the four-metric SLO for any Kafka cluster.

Summary

Kafka's architecture is a set of deliberate trade-offs: append-only segment files for throughput, pull-based ISR replication for durability without blocking producers, a centralized controller (now Raft-based in KRaft) for consistent metadata, and rack-aware partition assignment for fault isolation. Every production failure mode — unclean elections, ISR shrink, controller storms, log directory failures — flows directly from these design decisions. Operators who understand the architecture can read the failure from metrics and logs in seconds; those who do not spend hours in the dark. In the next lesson we move to the producer and consumer side: how writes are batched, compressed, and acknowledged, and how consumer groups coordinate partition assignment.