Caching & Messaging Infrastructure

Capacity & Scaling Data Infrastructure

18 min Lesson 9 of 30

Capacity & Scaling Data Infrastructure

Sizing Redis and Kafka correctly — and knowing when and how to scale them — is one of the highest-leverage skills a production engineer can have. Get it wrong and you face silent data loss, cascading latency spikes, or a six-hour cluster migration at 2 AM. This lesson teaches the sizing models, scaling patterns, and upgrade strategies used at companies running hundreds of terabytes and millions of messages per second.

Redis: Capacity Sizing

Redis is an in-memory store, so every byte counts. The sizing formula is straightforward:

  • Working set size: estimate the total byte footprint of all live keys using the OBJECT ENCODING command and Redis DEBUG JMAP (or the redis-memory-analyzer tool). Do not rely on INFO memory alone — used_memory_rss always exceeds used_memory due to fragmentation.
  • Fragmentation ratio: target mem_fragmentation_ratio between 1.0 and 1.5. Ratios above 2.0 on Redis 6+ trigger active defragmentation (activedefrag yes) automatically, but you still need headroom.
  • Headroom rule: provision at 60–70 % peak utilisation. A fully loaded Redis node cannot accept a BGSAVE fork (needs ~100 % of used_memory for CoW pages) without swapping, which destroys latency.
# Memory footprint of the live dataset redis-cli INFO memory | grep -E 'used_memory_human|used_memory_rss_human|mem_fragmentation_ratio|maxmemory_human' # Per-key-space row count and avg TTL redis-cli INFO keyspace # Estimate memory for a sample key pattern (redis-memory-analyzer pip package) rma scan -p 'session:*' --host 127.0.0.1 --port 6379

For eviction sizing, set maxmemory to 75 % of available RAM and choose an eviction policy before you need it. The big-tech default for cache workloads is allkeys-lru; for session stores use volatile-lru so only TTL-bearing keys are eligible. Never leave maxmemory at 0 on a shared host — OOM Killer will terminate the process with no warning.

Redis: Scaling Patterns

There are three axes of scale for Redis:

  1. Vertical scaling: move to a larger instance. Straightforward but limited. Above ~256 GB RAM, fork-based BGSAVE becomes painfully slow; consider switching to RDB-disabled persistence + Replica AOF.
  2. Read scaling via replicas: route read-only commands to replicas with client-side read preference or a proxy (Envoy, Twemproxy). Replicas add replication lag; never route reads that require linearisability to a replica.
  3. Horizontal sharding via Redis Cluster: 16,384 hash slots distributed across N primaries. Minimum viable cluster is 3 primaries + 3 replicas. Target ≤200 GB of working set per primary shard at the 70 % headroom rule. Resharding is online but CPU-expensive — schedule during low-traffic windows and watch cluster_stats_messages_sent for backpressure.
Redis Cluster Scaling Topology Client (cluster-aware) Primary A Slots 0–5460 Primary B Slots 5461–10922 Primary C Slots 10923–16383 Replica A Replica B Replica C Hash-slot routing · dashed = async replication
Redis Cluster: 3 primary shards covering all 16,384 hash slots, each with one async replica.
Cluster resharding in practice: use redis-cli --cluster reshard <host>:<port> interactively, or automate with --cluster-from all --cluster-to <target-id> --cluster-slots <N> --cluster-yes. Run during off-peak. Monitor cluster_state and cluster_slots_assigned in INFO cluster throughout. A failed reshard leaves slots in migrating state — finish or abort with CLUSTER SETSLOT <slot> STABLE.

Kafka: Capacity Sizing

Kafka sizing has three independent dimensions: throughput, storage, and partitions.

  • Throughput: a single Kafka broker on modern hardware handles 200–600 MB/s of aggregate write throughput. Measure actual producer byte-rate with kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec. The bottleneck is almost always the NIC or disk write bandwidth, not CPU.
  • Storage: total bytes = message_rate × avg_message_size × retention_hours × replication_factor. Add 20 % for index files. Use kafka-log-dirs.sh to measure actual disk usage per topic/partition.
  • Partition count: start with max(target_throughput / per_partition_throughput, consumer_parallelism). A single partition saturates at roughly 10–100 MB/s depending on consumer count and compression. Over-partitioning has real costs: more open file descriptors, longer leader election, higher ZooKeeper/KRaft metadata load. Google and LinkedIn historically cap partitions-per-broker at 4,000; above that, broker restarts become painful.
# Measure topic throughput and consumer lag kafka-consumer-groups.sh --bootstrap-server kafka:9092 \ --describe --group my-consumer-group # Per-broker disk usage breakdown kafka-log-dirs.sh --bootstrap-server kafka:9092 \ --topic-list orders,payments --describe \ | grep -E 'size|offsetLag' # Estimate storage for a topic (bash snippet) MSG_RATE=50000 # messages/sec AVG_SIZE=512 # bytes RETENTION=604800 # 7 days in seconds REP_FACTOR=3 echo $(( MSG_RATE * AVG_SIZE * RETENTION * REP_FACTOR / 1073741824 )) GB

Kafka: Scaling Patterns

Kafka scales horizontally by adding brokers and rebalancing partition leadership. The process:

  1. Add new brokers to the cluster (they join with empty logs).
  2. Run the partition reassignment tool (kafka-reassign-partitions.sh) to generate and execute a reassignment plan that spreads replicas across the enlarged broker set.
  3. Monitor replication progress with kafka-reassign-partitions.sh --verify. Network throttle the replication to avoid starving producers: set leader.replication.throttled.rate and follower.replication.throttled.rate (bytes/sec) on affected topics before starting.
# Step 1: generate reassignment plan cat topics-to-move.json # {"version":1,"topics":[{"topic":"orders"},{"topic":"payments"}]} kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \ --broker-list "1,2,3,4,5" \ --topics-to-move-json-file topics-to-move.json \ --generate > reassignment-plan.json # Step 2: throttle and execute kafka-configs.sh --bootstrap-server kafka:9092 \ --alter --entity-type brokers --entity-default \ --add-config 'leader.replication.throttled.rate=104857600,follower.replication.throttled.rate=104857600' kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \ --reassignment-json-file reassignment-plan.json \ --execute # Step 3: verify and remove throttle when done kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \ --reassignment-json-file reassignment-plan.json \ --verify kafka-configs.sh --bootstrap-server kafka:9092 \ --alter --entity-type brokers --entity-default \ --delete-config 'leader.replication.throttled.rate,follower.replication.throttled.rate'
Kafka Broker Expansion - Before and After Before (3 brokers) Broker 1 P0 P3 P6 Broker 2 P1 P4 P7 Broker 3 P2 P5 P8 9 partitions across 3 brokers add broker After (4 brokers) Broker 1 P0 P6 Broker 2 P1 P7 Broker 3 P2 P5 Broker 4 P3 P4 P8 balanced across 4 brokers
Kafka broker expansion: adding a fourth broker and redistributing partition leadership to rebalance load.
Partition count is immutable in Kafka (you can only increase, never decrease). Adding partitions to a topic breaks message ordering guarantees for keyed producers — consumers that assumed all messages for a key land on the same partition will see interleaving after the change. Plan partition counts for 2–3× future throughput before going live, or use a compacted changelog topic pattern to rebase ordering.

Upgrade Strategies

Both Redis and Kafka have well-defined rolling upgrade paths. The cardinal rule is: never skip major versions.

Redis rolling upgrade: upgrade replicas first, then failover to a replica (using FAILOVER command in Redis 6.2+ or CLUSTER FAILOVER), then upgrade the former primary. For Sentinel setups, trigger a manual failover with SENTINEL FAILOVER <master-name> after upgrading all replicas.

Kafka rolling upgrade: upgrade one broker at a time. Set inter.broker.protocol.version and log.message.format.version to the current version before starting so the cluster stays protocol-compatible during the transition. Only bump those values once all brokers are on the new version. This is the same pattern used for KRaft migrations: run in mixed mode (ZooKeeper + KRaft controllers) until all brokers are migrated, then cut over.

# Kafka: safe rolling upgrade sequence (broker-by-broker) # 1. On each broker node, before upgrade: kafka-configs.sh --bootstrap-server kafka:9092 \ --entity-type brokers --entity-name 3 \ --describe | grep protocol.version # 2. Restart broker 3 with new binary; watch under-replicated partitions kafka-topics.sh --bootstrap-server kafka:9092 \ --describe --under-replicated-partitions # 3. Wait for URP count to hit 0 before moving to broker 4 # 4. After ALL brokers upgraded, bump protocol versions: kafka-configs.sh --bootstrap-server kafka:9092 \ --entity-type brokers --entity-default \ --alter --add-config 'inter.broker.protocol.version=3.7,log.message.format.version=3.7'

Capacity Planning in Practice

Treat your data infrastructure capacity exactly as you treat Kubernetes node capacity: model it, alert on it, and plan upgrades before you hit 70 % utilisation. Key metrics to track in your observability stack (you already have Prometheus/Grafana from earlier in the course):

  • Redis: redis_memory_used_bytes / redis_memory_max_bytes, redis_evicted_keys_total, redis_connected_clients
  • Kafka: kafka_server_brokertopicmetrics_bytesinpersec, kafka_log_log_size, kafka_controller_kafkacontroller_activecontrollercount, consumer group lag via kafka_consumer_group_lag
The most common production capacity surprise is Kafka topic log growth during a consumer outage. If a consumer group stops consuming, retention keeps accumulating. Set topic-level retention.bytes (not just retention.ms) as a hard backstop so a runaway topic cannot fill your disks while the on-call engineer is paged.