Capacity Planning Fundamentals
Capacity Planning Fundamentals
Capacity planning is the practice of ensuring your infrastructure can serve expected demand — with enough headroom to absorb spikes — without over-provisioning to the point of wasting money. At hyperscale companies, capacity planning is a formal engineering discipline with dedicated teams, quarterly forecasting cycles, and automated procurement pipelines. For the rest of us, getting the fundamentals right prevents two failure modes that kill reliability: running out of capacity at the worst possible moment, and burning the company's cloud budget on idle instances.
This lesson focuses on the three pillars that sit beneath every autoscaling strategy you will configure in subsequent lessons: demand forecasting, headroom policy, and lead times. Get these wrong and no amount of HPA tuning or Karpenter configuration will save you during a traffic event.
Demand Forecasting
Forecasting answers the question: how much capacity will I need at time T in the future? There are three models, used in combination at mature organizations.
- Trend-based forecasting — fit a curve (linear or exponential) to historical utilization data. Useful for organic growth but blind to business events.
- Event-driven forecasting — overlay known business calendar events: product launches, marketing campaigns, Black Friday, fiscal quarter-end spikes in B2B SaaS. These are non-negotiable: every major incident post-mortem that starts with "we ran out of capacity" contains an ignored event.
- Workload-decomposition forecasting — break total demand into its constituent signals (active users × requests/user/s × avg payload). This lets you reason about which services will saturate first and model growth independently per tier.
In practice, export 90 days of CPU/memory/RPS from Prometheus, apply a simple linear regression in Python or in your observability platform's forecast function, then add the known event multipliers on top. The goal is a P95 demand curve, not a mean — size for the tail, not the average.
Headroom Policy
Headroom is the gap you intentionally leave between provisioned capacity and expected peak demand. It serves three purposes: absorb unexpected spikes before autoscaling responds, provide runway for autoscaling to act (a new node takes 2-4 minutes to join a Kubernetes cluster), and prevent CPU/memory saturation from degrading latency before you can scale out.
The right headroom number depends on your scaling speed and your SLO aggressiveness:
- 20 % headroom — minimum viable for services with fast horizontal scaling (<90 s to add a pod that is already scheduled on a warm node). Acceptable for stateless microservices backed by HPA.
- 30–40 % headroom — appropriate when node provisioning is in the path (cluster autoscaling, ~3–5 min). This is the Google/Netflix default for their core serving tiers.
- 50 %+ headroom — required for services with long warm-up times (JVM, ML model loading), stateful systems (databases, Kafka brokers), or single-region deployments where a failure in one AZ instantly doubles load on the survivors.
Headroom is not free: 30 % headroom means you are permanently paying for 1.3x the capacity you need at steady-state. The counter-argument — and it is correct — is that the cost of a 30-minute outage during a traffic spike almost always exceeds months of headroom spend. Encode your policy in runbook form so on-call engineers do not under-provision to save money.
Lead Times
Lead time is how long it takes to get additional capacity into production. It governs how far ahead you must forecast and how much headroom you must maintain. Ignoring lead times is the single most common capacity planning mistake.
Lead times exist at every layer of the stack:
- Pod scheduling (warm node): 5–30 seconds. Kubernetes scheduler + container pull if not cached. This is the HPA regime — reactive, fast.
- Node provisioning (cluster autoscaler / Karpenter): 2–6 minutes for standard instance types; 10–20 minutes for GPU instances or large bare-metal nodes. This is the layer where the "30 % headroom" rule comes from — you need enough buffer to survive while new nodes join.
- Reserved instance procurement: Instantaneous for on-demand, but reserved capacity (AWS RIs, Committed Use) requires a 1–3-year commitment purchased in advance. Misjudge your reserved baseline and you either overpay or exhaust on-demand quota.
- Hardware procurement (on-prem / colocation): 8–26 weeks for standard servers; 6–12 months for specialized hardware (GPUs, high-memory nodes, custom ASICs). At this scale, capacity planning is a capital expenditure process with finance and procurement stakeholders.
--scale-down-unneeded-time=10m and set a minReplicas floor on your HPAs to maintain this buffer even during off-peak hours.
Putting It Together: The Capacity Planning Cycle
Mature organizations run capacity planning as a quarterly cycle, not a reactive fire-drill. The workflow:
- Collect signals — pull 90-day utilization trends from Prometheus/Datadog; add business-calendar events for the next quarter.
- Forecast P50 and P95 demand — per service, per resource type (CPU, memory, network, storage IOPS).
- Apply headroom policy — multiply P95 by your headroom factor (1.3x for most services, higher for stateful tiers).
- Account for lead times — if hardware procurement is in the path, submit requests 12+ weeks before the need date.
- Review autoscaling configurations — validate that HPA targets, VPA recommendations, and Karpenter limits align with the new forecast. Adjust
minReplicasfloors before the busy season, not during it. - Document assumptions — so the next on-call engineer understands why you provisioned 140 % of current load.