Caching & Messaging Infrastructure

Project: A Messaging Platform Runbook

18 min Lesson 10 of 30

Project: A Messaging Platform Runbook

A runbook is not documentation for its own sake — it is the executable knowledge that lets any on-call engineer, at 3 a.m., restore a platform to health without escalating. This lesson synthesizes everything covered in this tutorial into a single, production-grade runbook for a platform where Redis Cluster serves the session and rate-limit tier and Apache Kafka serves the event streaming tier. The patterns apply whether you are running on bare metal, on AWS MSK + ElastiCache, or on self-managed Kubernetes operators.

Platform Architecture Snapshot

Before you can operate a platform, every team member must share a mental model of its topology. The runbook begins with an authoritative architecture diagram, pinned in your internal wiki and reviewed on every significant change.

Three-AZ topology: Redis primary-replica per shard with Sentinel quorum, Kafka brokers spreading partition leaders, KRaft controllers in an independent Raft group.

Section 1: High-Availability Design Decisions

Redis HA contract: Run a minimum of 3 Sentinel nodes (never 2 — a split-brain between two sentinels cannot resolve quorum). For Redis Cluster, use 3 shards with 1 replica each, spread across 3 AZs, with min-replicas-to-write 1 and min-replicas-max-lag 10 on every primary. Set cluster-require-full-coverage no so a lost shard degrades the cluster rather than bringing the whole thing down. Failover time under Sentinel is 30-60 s by default; tune down-after-milliseconds 5000 and failover-timeout 10000 for sub-30-second recovery in latency-sensitive environments, but accept that aggressive values increase false-positive failovers under network blips.

Kafka HA contract: Use a replication factor of 3 (RF=3) for all topics, with min.insync.replicas=2. This tolerates one broker loss with no data loss and no producer error. For the KRaft controller quorum, 3 dedicated controller nodes is the minimum for production — they require a quorum of 2/3 to elect a new active controller, so a simultaneous loss of 2 controllers stalls metadata operations. Keep controllers on separate hosts from brokers.

Section 2: Monitoring Stack and Golden Signals

The monitoring layer for this platform is built on Prometheus and Grafana, with redis_exporter (oliver006) and JMX exporter (or the native Kafka Prometheus reporter) feeding metrics. The following alert rules are non-negotiable for production:

# prometheus/alerts/messaging-platform.yml

groups:
- name: redis
  rules:
  - alert: RedisReplicationLagHigh
    expr: redis_connected_slaves{job="redis"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis primary {{ $labels.instance }} has no connected replicas"

  - alert: RedisMemoryUsageHigh
    expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis {{ $labels.instance }} memory usage high"

  - alert: RedisKeyEvictions
    expr: increase(redis_evicted_keys_total[5m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Redis evicting keys — check memory policy"

- name: kafka
  rules:
  - alert: KafkaUnderReplicatedPartitions
    expr: kafka_server_replicamanager_underreplicatedpartitions > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Kafka broker {{ $labels.instance }} has {{ $value }} under-replicated partitions"

  - alert: KafkaConsumerGroupLagHigh
    expr: kafka_consumergroup_lag > 50000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Consumer group {{ $labels.consumergroup }} lag {{ $value }}"

  - alert: KafkaBrokerOffline
    expr: kafka_brokers < 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Kafka cluster has fewer than 3 brokers"

The two most actionable Redis metrics in production are used_memory vs maxmemory and connected_slaves. If a primary has zero replicas and crashes, you lose data regardless of your persistence settings. Replication status is your first check on any Redis incident.

Section 3: Incident Response Playbook

The following runbook steps are ordered for speed. Every section maps to a PagerDuty alert rule. The on-call engineer should be able to copy-paste these commands without looking anything up.

Redis Primary Failure (Sentinel-managed)

Confirm the alert: redis-cli -h sentinel1 -p 26379 SENTINEL masters — check the flags field; s_down or o_down confirm sentinel suspicion.
Check quorum: redis-cli -h sentinel1 -p 26379 SENTINEL ckquorum mymaster — must return OK.
If sentinel is healthy and failover has not yet started, trigger manually: redis-cli -h sentinel1 -p 26379 SENTINEL failover mymaster.
Confirm the new primary: redis-cli -h sentinel1 -p 26379 SENTINEL get-master-addr-by-name mymaster.
Verify application reconnection — most Redis client libraries (ioredis, jedis, lettuce) handle sentinel failover transparently within 1-3 s after the sentinel propagates the new master address.
File a post-incident ticket: root cause (OOM, kernel OOM-killer, hardware), replica reattachment status, data loss window (compare master_repl_offset from INFO replication before and after).

Kafka Broker Loss and Partition Re-election

# 1. Identify the offline broker
kafka-broker-api-versions.sh --bootstrap-server broker1:9092

# 2. Check under-replicated partitions — should converge to 0 as replication catches up
kafka-topics.sh --bootstrap-server broker1:9092 \
  --describe --under-replicated-partitions

# 3. If broker is permanently lost, trigger preferred-replica election
# to move leaders back after a replacement broker comes online
kafka-leader-election.sh --bootstrap-server broker1:9092 \
  --election-type PREFERRED \
  --all-topic-partitions

# 4. Verify consumer group lag is recovering
kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
  --describe --group payments-consumer-group

# 5. If lag is growing (consumers cannot keep up after rebalance), scale out:
# kubectl -n kafka scale deployment payments-consumer --replicas=6

Consumer Group Lag Spike (not caused by broker loss)

Isolate the root cause: slow consumer processing, producer throughput spike, or a poison-pill message causing repeated deserialization errors.
Check consumer logs for DeserializationException or repeated rebalances. Repeated rebalances indicate consumers exceeding max.poll.interval.ms (default 5 min) — increase it or reduce batch size.
For a poison-pill: identify the offset with kafka-get-offsets.sh, then use a one-off consumer with auto.offset.reset=none and a seek call to skip past it. Never delete the message — route it to a dead-letter topic.
Scale consumers horizontally up to the number of partitions. Beyond that count, additional consumer instances idle and waste resources.

Never decrease a topic partition count. Kafka does not support repartitioning downward — you must create a new topic, migrate consumers, and drain the old one. Plan partition counts at creation time based on your target throughput: a single partition handles roughly 10-50 MB/s depending on message size and compression codec.

Section 4: Capacity and Scaling Decision Tree

The runbook includes a scaling decision tree that on-call engineers follow before spinning up new infrastructure. This prevents both under-provisioning (platform degradation) and over-provisioning (wasted spend).

Redis scaling triggers:

Memory above 75% for 30 min: First audit key TTLs and eviction policy. If growth is legitimate, add a shard: redis-cli --cluster add-node new_host:6379 existing_host:6379 followed by --cluster rebalance.
CPU above 70% on the primary: Redis is single-threaded per data plane. Scale horizontally (more shards) not vertically. Identify hot keys first with redis-cli --hotkeys or redis-cli OBJECT FREQ key (requires maxmemory-policy allkeys-lfu).
Read latency above 5 ms p99: Route reads to replicas. Ensure the client is configured with READONLY on replica connections.

Kafka scaling triggers:

Broker disk above 70%: Reduce log.retention.hours or increase broker count. Reassign partitions with kafka-reassign-partitions.sh to spread data across the fleet.
Network throughput above 80% of NIC capacity: Add brokers and rebalance partition leaders. For AWS MSK, resize the broker instance type.
Producer latency above 50 ms: Check request.required.acks. If acks=all, check ISR size — if ISR has shrunk to 1, one slow replica is blocking all acks. Identify the lagging replica with kafka-topics.sh --under-min-isr-partitions.

Maintain a capacity baseline document updated monthly. Record peak throughput (MB/s), p99 latency, memory utilization, and consumer lag for each component. Without a baseline you cannot distinguish a real anomaly from normal traffic growth during an incident — and that ambiguity costs precious MTTR minutes.

Section 5: Runbook Maintenance and Drill Schedule

A runbook that is never tested is a liability, not an asset. Integrate the following into your team engineering calendar:

Monthly chaos drill: Kill one Redis replica in staging, verify Sentinel promotes correctly, verify application RTO. Kill one Kafka broker, verify lag converges within SLO. Use the Chaos Engineering principles from the earlier tutorial in this course — document the steady-state hypothesis before each drill.
Quarterly failover test: Perform a full Redis primary failover in production during a low-traffic window. This surfaces configuration drift (firewall rules, client library versions, DNS TTL settings) that only appear under real conditions.
Runbook review on every incident: After each post-mortem, update the affected runbook section within 48 hours. The on-call engineer who ran the incident owns this task — they are the subject-matter expert until the next rotation.
Alert fatigue audit: Quarterly, review firing frequency of every alert. An alert that fires more than twice per week without a P1 is either miscalibrated or pointing at a systemic problem that should be fixed, not silenced.

The measure of a mature messaging platform operation is not that incidents never happen — it is that every incident is shorter, better-contained, and better-documented than the last. This runbook is the operational contract between your platform and every engineer who depends on it.