Site Reliability Engineering (SRE)

Capacity & Demand Forecasting

18 min Lesson 6 of 29

Capacity & Demand Forecasting

Running out of capacity in production is a reliability incident — one that your error budget pays for whether or not your code is correct. SRE teams treat capacity planning as a first-class reliability practice: predict demand before it hits, provision headroom so normal growth never triggers an outage, and validate launch traffic before it reaches your users at scale. This lesson covers the three pillars of production-grade capacity work: organic growth modeling, launch surge planning, and headroom policy.

Why Capacity Failures Are Different

Most reliability problems are code bugs or configuration errors — they appear suddenly and usually disappear when you roll back. Capacity problems are different: they creep in slowly, they look healthy until the moment they do not, and rolling back code does not help once a node is at 100 % CPU. The blast radius is also wider — a single overloaded database primary can cascade into full service unavailability for millions of users.

At Google, the SRE book describes capacity planning as a continuous process that runs in parallel with software development, not something you do once a quarter. The key insight is that headroom is not wasted resources — it is your reliability buffer.

Pillar 1: Organic Growth Modeling

Organic growth is the slow, predictable increase in traffic driven by user acquisition, seasonal patterns, and product expansion. You model it by extracting historical request rates and fitting a trend — typically linear or exponential depending on the product lifecycle stage.

The standard SRE workflow is to export a key capacity signal (RPS, QPS, active connections) from Prometheus and project it forward. A simple approach uses a predict_linear() PromQL function, which fits a least-squares line through the last N minutes of data and extrapolates forward:

# Project current request rate 30 days into the future
# using the last 7 days of data as the baseline trend
predict_linear(
  http_requests_total{job="api-server"}[7d],
  30 * 24 * 3600
)

This query returns a scalar estimate of where your RPS will be in 30 days if the current trend continues. Run this against your primary capacity signals — RPS, CPU saturation, memory usage, and storage fill rate — and compare the projected values against your resource limits.

Signal selection matters. Request rate is a leading indicator; CPU and memory are lagging. Build your growth model on the metric that most directly drives resource consumption for your workload — for a stateless API it is RPS; for a database it is write throughput and active connections; for a blob store it is bytes stored.

For production forecasting, export daily aggregates to a time-series datastore (BigQuery, ClickHouse) and use a proper statistical library. Google and Meta both use Holt-Winters exponential smoothing to capture seasonality — the weekly rhythm where traffic is lower on weekends is real and should not be hidden by a naive linear fit.

Pillar 2: Launch Surge Planning

Launches break services. A major product launch, a viral campaign, or an App Store feature can spike traffic by 5–20× in minutes. Your organic growth model is useless here — you need a separate launch-traffic process.

The SRE production readiness review (PRR) gate includes a mandatory launch capacity section with three inputs:

Expected peak QPS: Provided by product/marketing based on user acquisition projections. Apply a safety multiplier (1.5×–2×) because these estimates are almost always wrong on the low side.
Per-request resource cost: Measured via load test — CPU seconds per request, memory allocation, downstream fan-out calls. Do not estimate; measure it under realistic load.
Time to scale: How long does your autoscaler take to add capacity? For Kubernetes HPA with cold container images and JVM warmup, a realistic scale-out time is 3–5 minutes. During that window you must absorb the burst from existing headroom.

Run a traffic simulation before every major launch. At minimum, use a load test that ramps from baseline to 2× expected peak and measures latency percentiles, error rate, and resource saturation. Tools like k6, vegeta, and locust are standard for this:

# k6 ramp test simulating a launch spike
# Save as launch-test.js and run: k6 run launch-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // warm up to baseline
    { duration: '5m', target: 100 },   // hold baseline
    { duration: '2m', target: 2000 },  // ramp to 20x spike
    { duration: '5m', target: 2000 },  // hold peak
    { duration: '2m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],  // 99th percentile < 500ms
    http_req_failed:   ['rate<0.01'],   // error rate < 1%
  },
};

export default function () {
  const res = http.get('https://staging.api.example.com/v1/feed');
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(1);
}

Never load-test production directly. Run against a staging environment that mirrors production topology (same instance types, same database sizes, same downstream dependencies). A load test against production that goes wrong is itself an incident.

Pillar 3: Headroom Policy

Headroom is the fraction of capacity you reserve above your expected peak. Without an explicit policy, teams drift toward full utilization because spare capacity looks like waste to infrastructure cost reviewers. SRE enforces headroom as a reliability invariant.

Big-tech standard headroom targets vary by tier:

Stateless compute (Kubernetes pods, Lambda): Keep peak CPU and memory below 70 % of cluster capacity. The remaining 30 % absorbs organic spikes, single-node failures (N+1), and autoscaler lag.
Databases (primaries): Keep peak CPU below 50 % and IOPS below 60 %. Databases degrade non-linearly near saturation — a primary at 80 % CPU under normal load will fall over during a backup window or a slow query incident.
Storage: Alert at 70 % fill; hard stop provisioning new data at 80 %. Storage fills faster than any autoscaler can react because it is typically not auto-scaled.

Operationalize headroom by writing Prometheus alerting rules that fire before you hit the limit, not after:

# prometheus/rules/capacity.yaml
groups:
  - name: capacity.headroom
    rules:
      - alert: ComputeHeadroomLow
        expr: |
          (
            sum by (cluster) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
            /
            sum by (cluster) (kube_node_status_allocatable{resource="cpu"})
          ) > 0.70
        for: 15m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "Cluster {{ $labels.cluster }} CPU headroom below 30%"
          description: "CPU utilization is {{ $value | humanizePercentage }}. Provision additional nodes before headroom is exhausted."

      - alert: DatabasePrimaryHighCPU
        expr: |
          avg by (instance) (
            rate(mysql_global_status_cpu_time[5m])
          ) > 0.50
        for: 10m
        labels:
          severity: warning
          team: sre-dba
        annotations:
          summary: "DB primary {{ $labels.instance }} CPU exceeds 50% headroom threshold"

Putting It Together: The Capacity Review Cadence

Capacity planning is not a one-off exercise. Mature SRE teams run a quarterly capacity review that produces provisioning tickets before the need is urgent:

Extract actuals: Pull 90-day traffic trends for all capacity signals.
Project 6 months forward: Apply growth model; add launch estimates from the product roadmap.
Compare to provisioned capacity minus headroom policy: Identify the date each resource hits the headroom threshold.
Raise provisioning tickets: Cloud capacity in most regions requires 6–12 weeks of lead time for large reserved-instance purchases. Do not order on the day you need it.
Revisit after every major launch: Launches change your growth trajectory. Organic modeling must be re-baselined on post-launch actuals.

Capacity planning feedback loop: trend data feeds the growth model, gap analysis against the headroom policy drives provisioning, and every launch re-baselines the model.

Treat capacity as code. Store your headroom thresholds, alert rules, and provisioning runbooks in version control alongside your application code. When a service is handed from one team to another, the new team inherits not just the service but its capacity commitments. A Git history tells them why a threshold is 50 % and not 70 %.

Common Production Failure Modes

The silent fill: Storage or connection pools that fill slowly over months, with no alerting until the service falls over at 100 %.
Autoscaler doesn't keep up: HPA scales on CPU but your bottleneck is database connections. The pods scale out; the DB falls over because connection count grows with pod count.
Launch traffic underestimated by 10×: Marketing sends an email blast to 10 M users at 09:00 on Monday. Even with autoscaling, cold-start latency causes a queue backup that cascades into timeouts.
Reserved capacity expired: Cloud reserved instances purchased for cost savings expire; on-demand fallback is unavailable in the target region due to a capacity event. Always check reservation expiry dates in your quarterly review.

Capacity planning is one of the highest-leverage activities an SRE team performs. Every minute spent building an accurate model and enforcing headroom policy is a minute not spent on-call during a capacity incident.