Capacity Planning & Autoscaling

Project: An Autoscaling Strategy

18 min Lesson 10 of 27

Project: An Autoscaling Strategy

This capstone lesson puts every concept from the tutorial into a single coherent design. You will walk through the end-to-end process of building an autoscaling strategy for a bursty, event-driven SaaS workload — the kind of system that sits at 4,000 req/s most of the day, then spikes to 40,000 req/s within two minutes when a marketing email drops. The goal is not a toy config; it is a production blueprint with real numbers, real failure modes, and the senior judgment calls that separate a reliable system from one that pages you at 2 AM.

The Workload: Anatomy of a Bursty System

Our reference system is a multi-tenant SaaS API with the following profile:

Baseline: 4,000 req/s, avg latency 45 ms, P99 120 ms. CPU at ~35 % on 20 m6i.2xlarge nodes (8 vCPU each = 160 vCPU total).
Burst pattern: marketing campaign emails land at predictable calendar events — typically 09:00 local time for each of 4 geographic regions. The burst multiplier is 8–12x. Duration: 4–8 minutes of peak, 20–30 minutes to drain back to baseline.
Tail latency sensitivity: the SLO is P99 < 500 ms at burst peak. Violating it triggers customer SLA credits.
Mixed pod types: a stateless API tier, an async worker tier (SQS consumers), and a Redis cluster used for idempotency keys. Each tier scales differently.

Design principle: Before writing a single YAML file, model the worst case on paper. The burst is 10x, lasts 6 minutes, and must stay within SLO. Can your autoscaling chain (HPA pod scale-out → Karpenter node provision → pod ready) complete within the first 90 seconds? If not, the strategy must pre-scale proactively instead.

Layer 1 — HPA for the API Tier

The API tier is stateless, CPU-bound, and horizontally sharded. The HPA uses a composite metric: custom RPS-per-pod from Prometheus Adapter plus CPU as a safety backstop. Target utilization is set conservatively at 50 % CPU to leave headroom for burst absorption before new pods are ready.

# hpa-api.yaml — Production HPA for the API tier
# Key decisions:
#   targetAverageValue 40 rps/pod (4000 rps baseline ÷ 100 pods = 40; burst headroom built in)
#   scaleUp: 100% in 30s window — aggressive upscale; pods are cheap
#   scaleDown: slow (stabilizeFor 300s) — avoid flapping after burst drains

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 80         # floor = 4000 rps / 50 rps per pod; never go below this
  maxReplicas: 600        # ceiling = 30000 rps / 50 rps per pod; matches cluster limit
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "50"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0     # react immediately
      policies:
        - type: Percent
          value: 100                     # double pod count each 30s if needed
          periodSeconds: 30
        - type: Pods
          value: 100                     # or add 100 pods at once, whichever is larger
          periodSeconds: 30
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300   # wait 5 min after burst before scaling in
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60

With minReplicas: 80, the cluster always has 80 pods pre-scheduled. At burst start, HPA fires within the first scrape interval (15 s) and can double the pod count every 30 seconds. From 80 to 600 pods takes roughly 4 scale steps — about 2 minutes — if nodes are already warm. This is why the cluster layer matters.

Layer 2 — Karpenter for the Cluster

Karpenter must provision new nodes faster than the HPA exhaust the existing ones. Two NodePool configurations are used: a baseline pool of on-demand m6i.2xlarge nodes (always warm) and a burst pool of spot m6i.4xlarge / c6i.4xlarge nodes (provisioned on demand, spot for cost, large instance to minimise node-provision latency).

# nodepool-burst.yaml — Karpenter burst pool for traffic spikes
# Spot instances: ~70% cost savings vs on-demand; acceptable for stateless API pods
# Large instance types: a single m6i.4xlarge provides 16 vCPU — one node runs ~50 pods
# This means 10 new nodes covers 500 new pods; Karpenter can provision all 10 in parallel

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: burst-spot
  namespace: kube-system
spec:
  disruption:
    consolidationPolicy: WhenEmpty      # only consolidate fully empty nodes post-burst
    consolidateAfter: 5m
    budgets:
      - nodes: "30%"                    # safe scale-down rate after burst drains
  limits:
    cpu: "1280"                         # 80 x m6i.4xlarge; hard ceiling
    memory: "5120Gi"
  weight: 10                            # lower priority than on-demand baseline pool
  template:
    metadata:
      labels:
        pool: burst-spot
    spec:
      taints:
        - key: pool
          value: burst-spot
          effect: NoSchedule            # only burst-tolerating pods land here
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.4xlarge", "m6i.8xlarge", "m7i.4xlarge", "c6i.4xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: default
      kubelet:
        maxPods: 110

The API deployment must tolerate the burst pool taint so Karpenter can schedule pods there. Add tolerations to the Deployment spec: key: pool, operator: Equal, value: burst-spot, effect: NoSchedule. Additionally, use topologySpreadConstraints to spread across all three AZs — if a spot interruption hits one AZ during burst, the remaining pods in the other two AZs absorb the traffic without breaching SLO.

Two-layer autoscaling: HPA fires immediately against warm pods in the baseline pool; Karpenter provisions burst-pool spot nodes in parallel within 90 seconds.

Layer 3 — Queue-Based Scaling for the Worker Tier

The async worker tier processes jobs enqueued by the API (image resizing, webhook delivery, email sends). During a burst, the queue depth spikes before worker capacity catches up. KEDA scales workers directly from SQS queue depth, which is a far more reliable signal than CPU for queue consumers.

# keda-scaledobject-worker.yaml
# Scale workers based on SQS queue depth, not CPU.
# cooldownPeriod 120s: keep workers alive after burst to drain the backlog
# minReplicaCount 5: never fully scale to zero; cold starts hurt burst response

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: job-worker
  namespace: production
spec:
  scaleTargetRef:
    name: job-worker
  minReplicaCount: 5
  maxReplicaCount: 400
  cooldownPeriod: 120
  pollingInterval: 10
  advanced:
    restoreToOriginalReplicaCount: true
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 0
          policies:
            - type: Percent
              value: 200
              periodSeconds: 15
        scaleDown:
          stabilizationWindowSeconds: 180
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/job-queue
        queueLength: "20"          # target: 20 messages in flight per worker pod
        awsRegion: us-east-1
        scaleOnInFlight: "true"

Pre-Scaling for Predictable Bursts

The most important insight for calendar-driven bursts: reactive autoscaling is too slow. The 09:00 email drop is known 48 hours in advance. Pre-scale 10 minutes early, before the traffic arrives, using KEDA's CronTrigger or a simple Kubernetes CronJob that patches minReplicas directly.

#!/bin/bash
# pre-scale.sh — run as a Kubernetes CronJob at 08:50 in each target timezone
# Raises minReplicas floors before the burst; a second job restores them at 10:30

set -euo pipefail

NAMESPACES=("production")
BURST_MIN_API=300       # 15,000 rps pre-provisioned (300 pods x 50 rps)
BURST_MIN_WORKER=100
RESTORE_MIN_API=80
RESTORE_MIN_WORKER=5

ACTION=${1:-"burst"}    # "burst" or "restore"

if [[ "$ACTION" == "burst" ]]; then
  kubectl patch hpa api-server -n production \
    --type=merge \
    -p "{\"spec\":{\"minReplicas\":${BURST_MIN_API}}}"

  kubectl patch scaledobject job-worker -n production \
    --type=merge \
    -p "{\"spec\":{\"minReplicaCount\":${BURST_MIN_WORKER}}}"

  echo "Pre-scale applied: api=${BURST_MIN_API}, worker=${BURST_MIN_WORKER}"
else
  kubectl patch hpa api-server -n production \
    --type=merge \
    -p "{\"spec\":{\"minReplicas\":${RESTORE_MIN_API}}}"

  kubectl patch scaledobject job-worker -n production \
    --type=merge \
    -p "{\"spec\":{\"minReplicaCount\":${RESTORE_MIN_WORKER}}}"

  echo "Restored: api=${RESTORE_MIN_API}, worker=${RESTORE_MIN_WORKER}"
fi

Use KEDA CronTrigger as an alternative: KEDA supports a cron trigger type that sets a desiredReplicas on a schedule without external scripts. This is cleaner for GitOps workflows since the intent lives in the ScaledObject manifest, not a separate CronJob.

Load Shedding as the Last Line of Defense

Even with perfect pre-scaling, an unexpected 20x spike or a multi-AZ failure can exhaust capacity. The strategy must include load shedding so the system degrades gracefully rather than collapsing. Implement two layers:

Nginx rate limiting at the ingress: limit_req_zone with a per-tenant burst allowance. Tenants on free tiers are capped first; paid tiers get a larger burst budget.
Application-level circuit breaker: when the internal queue depth exceeds a threshold, the API returns 429 Too Many Requests with a Retry-After header. This is cheaper than letting requests queue inside the app and timeout with 504.

Production pitfall — autoscaling + spot interruption during burst: Spot instances can be reclaimed with 2 minutes notice at any time, including during a burst. Mitigate by: (1) using multiple instance families in the Karpenter NodePool so AWS can always find capacity; (2) setting PodDisruptionBudget maxUnavailable: 10% on the API deployment so a spot interruption never evicts more than 10 % of pods simultaneously; (3) running a minimum floor of on-demand nodes in the baseline pool that alone can sustain 1x load — spot is only for the burst multiplier.

Validation: Load Test the Strategy Before Production

Every autoscaling strategy must be validated under synthetic load before a real event. Use k6 to simulate the burst shape: ramp to 10x over 60 seconds, hold for 5 minutes, drain over 10 minutes. Observe HPA reaction time, Karpenter provisioning latency, P99 latency, and error rate in Grafana. If P99 breaches 400 ms during the ramp phase (before new nodes join), increase the pre-scale floor or reduce HPA target utilization.

// burst-loadtest.js — k6 script simulating a 10x traffic burst
// Run: k6 run --out prometheus=remote_write_url burst-loadtest.js

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 400 },    // baseline: 400 VUs ~ 4000 rps
    { duration: '1m', target: 4000 },   // ramp to 10x burst
    { duration: '5m', target: 4000 },   // sustain burst
    { duration: '2m', target: 400 },    // drain back
    { duration: '3m', target: 400 },    // confirm stable at baseline
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],   // SLO: P99 under 500ms
    http_req_failed:   ['rate<0.001'],   // error rate under 0.1%
  },
};

export default function () {
  const res = http.get('https://api.example.com/v1/healthz', {
    headers: { 'X-Tenant-ID': `tenant-${Math.floor(Math.random() * 1000)}` },
  });
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(0.1);
}

Strategy summary — the four decisions: (1) Set HPA minReplicas high enough that warm capacity absorbs the first 60 seconds before new nodes are ready. (2) Use Karpenter burst-pool spot nodes of large instance type to minimise node-provisioning round trips. (3) Pre-scale for known calendar events — reactive scaling alone is too slow for predictable bursts. (4) Add load shedding and PodDisruptionBudgets as safety nets for the unexpected. Together, these four layers let the system absorb a 10x burst without breaching SLO.