Caching & Messaging Infrastructure

Capacity & Scaling Data Infrastructure

18 min Lesson 9 of 30

Capacity & Scaling Data Infrastructure

Sizing Redis and Kafka correctly — and knowing when and how to scale them — is one of the highest-leverage skills a production engineer can have. Get it wrong and you face silent data loss, cascading latency spikes, or a six-hour cluster migration at 2 AM. This lesson teaches the sizing models, scaling patterns, and upgrade strategies used at companies running hundreds of terabytes and millions of messages per second.

Redis: Capacity Sizing

Redis is an in-memory store, so every byte counts. The sizing formula is straightforward:

Working set size: estimate the total byte footprint of all live keys using the OBJECT ENCODING command and Redis DEBUG JMAP (or the redis-memory-analyzer tool). Do not rely on INFO memory alone — used_memory_rss always exceeds used_memory due to fragmentation.
Fragmentation ratio: target mem_fragmentation_ratio between 1.0 and 1.5. Ratios above 2.0 on Redis 6+ trigger active defragmentation (activedefrag yes) automatically, but you still need headroom.
Headroom rule: provision at 60–70 % peak utilisation. A fully loaded Redis node cannot accept a BGSAVE fork (needs ~100 % of used_memory for CoW pages) without swapping, which destroys latency.

# Memory footprint of the live dataset
redis-cli INFO memory | grep -E 'used_memory_human|used_memory_rss_human|mem_fragmentation_ratio|maxmemory_human'

# Per-key-space row count and avg TTL
redis-cli INFO keyspace

# Estimate memory for a sample key pattern (redis-memory-analyzer pip package)
rma scan -p 'session:*' --host 127.0.0.1 --port 6379

For eviction sizing, set maxmemory to 75 % of available RAM and choose an eviction policy before you need it. The big-tech default for cache workloads is allkeys-lru; for session stores use volatile-lru so only TTL-bearing keys are eligible. Never leave maxmemory at 0 on a shared host — OOM Killer will terminate the process with no warning.

Redis: Scaling Patterns

There are three axes of scale for Redis:

Vertical scaling: move to a larger instance. Straightforward but limited. Above ~256 GB RAM, fork-based BGSAVE becomes painfully slow; consider switching to RDB-disabled persistence + Replica AOF.
Read scaling via replicas: route read-only commands to replicas with client-side read preference or a proxy (Envoy, Twemproxy). Replicas add replication lag; never route reads that require linearisability to a replica.
Horizontal sharding via Redis Cluster: 16,384 hash slots distributed across N primaries. Minimum viable cluster is 3 primaries + 3 replicas. Target ≤200 GB of working set per primary shard at the 70 % headroom rule. Resharding is online but CPU-expensive — schedule during low-traffic windows and watch cluster_stats_messages_sent for backpressure.

Redis Cluster: 3 primary shards covering all 16,384 hash slots, each with one async replica.

Cluster resharding in practice: use redis-cli --cluster reshard <host>:<port> interactively, or automate with --cluster-from all --cluster-to <target-id> --cluster-slots <N> --cluster-yes. Run during off-peak. Monitor cluster_state and cluster_slots_assigned in INFO cluster throughout. A failed reshard leaves slots in migrating state — finish or abort with CLUSTER SETSLOT <slot> STABLE.

Kafka: Capacity Sizing

Kafka sizing has three independent dimensions: throughput, storage, and partitions.

Throughput: a single Kafka broker on modern hardware handles 200–600 MB/s of aggregate write throughput. Measure actual producer byte-rate with kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec. The bottleneck is almost always the NIC or disk write bandwidth, not CPU.
Storage: total bytes = message_rate × avg_message_size × retention_hours × replication_factor. Add 20 % for index files. Use kafka-log-dirs.sh to measure actual disk usage per topic/partition.
Partition count: start with max(target_throughput / per_partition_throughput, consumer_parallelism). A single partition saturates at roughly 10–100 MB/s depending on consumer count and compression. Over-partitioning has real costs: more open file descriptors, longer leader election, higher ZooKeeper/KRaft metadata load. Google and LinkedIn historically cap partitions-per-broker at 4,000; above that, broker restarts become painful.

# Measure topic throughput and consumer lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group my-consumer-group

# Per-broker disk usage breakdown
kafka-log-dirs.sh --bootstrap-server kafka:9092 \
  --topic-list orders,payments --describe \
  | grep -E 'size|offsetLag'

# Estimate storage for a topic (bash snippet)
MSG_RATE=50000          # messages/sec
AVG_SIZE=512            # bytes
RETENTION=604800        # 7 days in seconds
REP_FACTOR=3
echo $(( MSG_RATE * AVG_SIZE * RETENTION * REP_FACTOR / 1073741824 )) GB

Kafka: Scaling Patterns

Kafka scales horizontally by adding brokers and rebalancing partition leadership. The process:

Add new brokers to the cluster (they join with empty logs).
Run the partition reassignment tool (kafka-reassign-partitions.sh) to generate and execute a reassignment plan that spreads replicas across the enlarged broker set.
Monitor replication progress with kafka-reassign-partitions.sh --verify. Network throttle the replication to avoid starving producers: set leader.replication.throttled.rate and follower.replication.throttled.rate (bytes/sec) on affected topics before starting.

# Step 1: generate reassignment plan
cat topics-to-move.json
# {"version":1,"topics":[{"topic":"orders"},{"topic":"payments"}]}

kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --broker-list "1,2,3,4,5" \
  --topics-to-move-json-file topics-to-move.json \
  --generate > reassignment-plan.json

# Step 2: throttle and execute
kafka-configs.sh --bootstrap-server kafka:9092 \
  --alter --entity-type brokers --entity-default \
  --add-config 'leader.replication.throttled.rate=104857600,follower.replication.throttled.rate=104857600'

kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file reassignment-plan.json \
  --execute

# Step 3: verify and remove throttle when done
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file reassignment-plan.json \
  --verify

kafka-configs.sh --bootstrap-server kafka:9092 \
  --alter --entity-type brokers --entity-default \
  --delete-config 'leader.replication.throttled.rate,follower.replication.throttled.rate'

Kafka broker expansion: adding a fourth broker and redistributing partition leadership to rebalance load.

Partition count is immutable in Kafka (you can only increase, never decrease). Adding partitions to a topic breaks message ordering guarantees for keyed producers — consumers that assumed all messages for a key land on the same partition will see interleaving after the change. Plan partition counts for 2–3× future throughput before going live, or use a compacted changelog topic pattern to rebase ordering.

Upgrade Strategies

Both Redis and Kafka have well-defined rolling upgrade paths. The cardinal rule is: never skip major versions.

Redis rolling upgrade: upgrade replicas first, then failover to a replica (using FAILOVER command in Redis 6.2+ or CLUSTER FAILOVER), then upgrade the former primary. For Sentinel setups, trigger a manual failover with SENTINEL FAILOVER <master-name> after upgrading all replicas.

Kafka rolling upgrade: upgrade one broker at a time. Set inter.broker.protocol.version and log.message.format.version to the current version before starting so the cluster stays protocol-compatible during the transition. Only bump those values once all brokers are on the new version. This is the same pattern used for KRaft migrations: run in mixed mode (ZooKeeper + KRaft controllers) until all brokers are migrated, then cut over.

# Kafka: safe rolling upgrade sequence (broker-by-broker)
# 1. On each broker node, before upgrade:
kafka-configs.sh --bootstrap-server kafka:9092 \
  --entity-type brokers --entity-name 3 \
  --describe | grep protocol.version

# 2. Restart broker 3 with new binary; watch under-replicated partitions
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --under-replicated-partitions

# 3. Wait for URP count to hit 0 before moving to broker 4
# 4. After ALL brokers upgraded, bump protocol versions:
kafka-configs.sh --bootstrap-server kafka:9092 \
  --entity-type brokers --entity-default \
  --alter --add-config 'inter.broker.protocol.version=3.7,log.message.format.version=3.7'

Capacity Planning in Practice

Treat your data infrastructure capacity exactly as you treat Kubernetes node capacity: model it, alert on it, and plan upgrades before you hit 70 % utilisation. Key metrics to track in your observability stack (you already have Prometheus/Grafana from earlier in the course):

Redis: redis_memory_used_bytes / redis_memory_max_bytes, redis_evicted_keys_total, redis_connected_clients
Kafka: kafka_server_brokertopicmetrics_bytesinpersec, kafka_log_log_size, kafka_controller_kafkacontroller_activecontrollercount, consumer group lag via kafka_consumer_group_lag

The most common production capacity surprise is Kafka topic log growth during a consumer outage. If a consumer group stops consuming, retention keeps accumulating. Set topic-level retention.bytes (not just retention.ms) as a hard backstop so a runaway topic cannot fill your disks while the on-call engineer is paged.