Operating Kafka
Operating Kafka
Running Kafka in production at scale means going well beyond "start the broker and publish messages." Four decisions govern cluster health day-to-day: how partitions are sized, how long data is retained, whether compaction is on, and whether consumer lag and ISR health are tracked close to real time. Misconfigure any one of these and you will face disk exhaustion, undetectable consumer drift, or silent data loss — often all three simultaneously on the worst day of the year.
Partition Sizing
Partitions are the unit of parallelism, replication, and file-handle consumption. Getting the count wrong at topic creation is expensive to fix later because repartitioning requires a full data migration.
Throughput formula. Target partition count = max(Tin / p, Tout / c), where Tin is peak ingestion MB/s, p is per-partition write throughput (~10 MB/s on commodity NVMe), and c is per-partition consumer throughput. A topic receiving 500 MB/s of raw events with 5 consumer threads each sustaining 50 MB/s needs at least 10 partitions from the producer side and 10 from the consumer side — so start at 12 with a growth margin.
File-handle cost. Each partition replica opens two file handles (index + segment). A broker hosting 50 000 partition replicas needs at least 100 000 file descriptors; set ulimit -n 200000 in the systemd unit and fs.file-max accordingly. At Google and LinkedIn the practical ceiling per broker is 4 000–6 000 leader partitions before leader election latency and controller overhead degrade.
Partition reassignment. When you must repartition, use kafka-reassign-partitions.sh with throttle flags so rebalancing does not saturate inter-broker network. A 50 MB/s replication throttle is conservative but safe for a cluster that is still serving traffic.
Retention Policies
Retention is a storage budget, not a backup strategy. Two dimensions control it: time-based (retention.ms) and size-based (retention.bytes). Both apply per partition; the per-topic size limit is retention.bytes * partition_count. Kafka enforces whichever limit is hit first.
Segment rolling. Retention is applied at segment granularity, not individual message granularity. A segment is only deleted when it is older than retention.ms AND a newer segment exists. The active (open) segment is never deleted. This means a low-volume topic with a very large segment.ms (default 7 days) can silently retain data far longer than retention.ms. For event-sourcing or audit topics, set segment.ms close to retention.ms / 2 to bound overage.
Tiered storage. Confluent and AWS MSK both offer tiered storage where segments older than a threshold are offloaded to object storage (S3/GCS), keeping only recent hot data on local NVMe. This decouples storage cost from compute scaling and is the recommended pattern for topics with retention measured in weeks or months. Configure via remote.storage.enable=true and local.retention.ms (how long to keep locally before delegating reads to the remote tier).
Log Compaction
Compaction retains the most recent record for each key, discarding older tombstones over time. It is the right choice for changelog topics (database CDC, entity state) where consumers need the latest value, not full history. It is the wrong choice for event streams where ordering within a time window matters.
Compaction runs in background threads controlled by log.cleaner.threads (default 1, raise to 4–6 on large clusters). The cleaner targets partitions whose "dirty ratio" (uncompacted bytes / total bytes) exceeds min.cleanable.dirty.ratio (default 0.5). For aggressive compaction on low-latency CDC topics, lower this to 0.1 and also set min.compaction.lag.ms to prevent records newer than a threshold from being compacted (useful when downstream consumers have a known maximum lag SLO).
Tombstones. A record with a null value is a tombstone — it signals deletion of that key. Tombstones are preserved for at least delete.retention.ms (default 24 hours) so consumers catch up before the key disappears entirely. If a consumer group restarts after a gap longer than delete.retention.ms, it will miss deletions. This is a documented Kafka limitation; mitigate by enforcing consumer SLOs strictly on compacted topics.
cleanup.policy=compact,delete (enabled by default on some MSK presets) applies both: compaction runs, and segments are still dropped after retention.ms. This means even the "latest" value for a key can disappear if the topic is small enough that compaction never wins the race against deletion. For pure changelog semantics, use cleanup.policy=compact only, and accept that disk grows unbounded unless tiered storage is enabled.
Monitoring Consumer Lag
Consumer lag is the difference between the log-end offset (LEO) and the committed offset for each partition. Lag growing monotonically is the first sign of a consumer that cannot keep up; lag that spikes and recovers signals bursty processing. Both are actionable, but the response differs: the former demands scaling consumer instances or optimizing processing, the latter demands smoothing the producer side or tuning fetch sizes.
Native tooling. kafka-consumer-groups.sh --describe shows lag per partition and consumer assignment. It is correct but pulls a point-in-time snapshot; it misses lag that spikes and recovers between polls.
Prometheus integration. Deploy kafka-exporter (danielqsj/kafka-exporter) or Confluent's JMX exporter. The critical metric is kafka_consumergroup_lag per group/topic/partition. Alert on: total lag exceeding 10× the normal steady-state, lag rate-of-change positive for more than 5 minutes, any partition with lag > 0 and no active consumer assignment (orphaned partition).
Monitoring ISR (In-Sync Replicas)
The ISR set is the authoritative indicator of replication health. When a replica falls out of ISR — because it is more than replica.lag.time.max.ms (default 30 s) behind the leader — the cluster is operating below its durability guarantee. If your topic has min.insync.replicas=2 and RF=3, losing two replicas from ISR blocks all acks=all producers with NotEnoughReplicasException.
Key JMX metrics to alert on:
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions— any value above 0 is a P1 alert during normal operations (0 is the only acceptable steady state).kafka.server:type=ReplicaManager,name=IsrShrinksPerSec— a positive rate means replicas are actively falling behind; check network, disk I/O, and GC on the lagging broker immediately.kafka.controller:type=KafkaController,name=ActiveControllerCount— must be exactly 1; 0 means a split-brain election is in progress, 2+ is impossible but indicates a monitoring bug.kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce— baseline this to detect traffic spikes that precede ISR shrinkage.
UnderReplicatedPartitions = 0 and that all consumer groups have zero lag on critical topics. A broker restart while ISR is already degraded can knock a topic below min.insync.replicas and cause producer outages. Instrument your deployment pipeline to gate on these checks — the same way you gate on pod readiness probes.
Operational Runbook Patterns
Senior engineers encode reactions to ISR shrinkage and runaway lag as runbook steps, not intuition:
- ISR shrinkage on one broker. Check
iostat -xz 1and GC logs on that broker. If disk I/O wait exceeds 80%, the broker is log-flushing too eagerly — lowerlog.flush.interval.messagespressure by ensuring OS page-cache flush is handling it (log.flush.interval.msshould be unset, delegating to the OS). If GC pause exceeds 200 ms, tune heap or upgrade to G1GC/ZGC. - Consumer lag growing steadily. Add consumer instances up to the partition count (beyond that they are idle). Check whether processing is CPU-bound or I/O-bound. If I/O-bound (external DB calls), decouple with an async worker pool. Increase
max.poll.recordsonly if processing time per record is very low — larger batches increase commit latency and reprocessing on crash. - Disk saturation. Emergency: reduce
retention.byteson the largest topics, or trigger early segment deletion viakafka-delete-records.shto the current offset minus a safe buffer. Long-term: add brokers and rebalance partitions.
unclean.leader.election.enable=true (disabled by default since Kafka 0.11), you trade consistency for availability — a lagging replica can become leader, causing message loss. This is never acceptable for financial or audit topics. For high-availability workloads that can tolerate reprocessing (metrics ingestion, clickstream), you might enable it as a last resort during a full ISR loss event, but document it as an explicit architectural decision and revert immediately after recovery.