Capacity Planning & Autoscaling

Cluster Autoscaling & Karpenter

18 min Lesson 5 of 27

Cluster Autoscaling & Karpenter

HPA and VPA operate entirely within the existing node pool — they shuffle pods and resize resource requests, but they cannot create new capacity when the cluster exhausts allocatable CPU or memory. That responsibility belongs to the cluster autoscaler layer. This lesson covers the two dominant approaches on Kubernetes: the venerable Cluster Autoscaler (CA) and AWS's modern replacement, Karpenter. We also cover spot instance integration, node consolidation, and the production failure modes that catch engineers off guard at scale.

How Cluster Autoscaler Works

Cluster Autoscaler watches for Unschedulable pods — pods the scheduler marks pending because no node has enough allocatable capacity. When CA sees such a pod it simulates whether adding a node from a configured Auto Scaling Group (ASG) would unblock it. If yes, it triggers an ASG scale-out. On the scale-in path CA periodically checks whether any node's running pods could fit on the remaining fleet; if so, it cordons the node, drains it respecting PodDisruptionBudgets, and triggers an ASG scale-in.

Key tuning parameters that matter at production scale:

--scale-down-utilization-threshold (default 0.5) — a node is considered underutilized when requested CPU and memory are both below this fraction. Raising it to 0.7 on cost-sensitive clusters speeds consolidation but risks churn during bursty traffic.
--scale-down-delay-after-add (default 10m) — how long after a scale-out before scale-in is re-evaluated. Set too low and you get flapping; 15–20 minutes is safer for workloads with irregular traffic shapes.
--max-node-provision-time (default 15m) — CA gives up on a node group if the node is not Ready within this window. With spot instances, set 8–10m to fail fast and try a different instance pool.
--balance-similar-node-groups — critical for multi-AZ deployments: forces CA to scale out evenly across availability zones rather than filling one AZ first.

CA anti-pattern — over-requested resources: CA scales on requested resources, not actual consumption. Pods requesting 2 vCPU but using 0.3 vCPU fill the cluster on paper while real utilization sits at 15%. Fix this upstream with VPA right-sizing before trusting CA to make sensible scale decisions.

Karpenter: The Modern Approach

Karpenter (CNCF incubating, AWS-native) takes a fundamentally different architecture. Instead of managing ASGs, Karpenter directly calls the EC2 RunInstances API to provision individual instances. Instead of pre-configured node groups with fixed sizes, you define NodePools that describe constraints — instance families, architectures, capacity types — and Karpenter selects the optimal instance in real time by binpacking pending pods against live EC2 pricing and availability.

Provisioning latency drops from the typical 3–5 minutes of CA (ASG warm-up + AMI bootstrap + kubelet registration) to under 60 seconds for most instance types. More importantly, Karpenter can select the exact right instance size for a batch of pending pods rather than always rounding up to the next size in a pre-defined ASG.

# Install Karpenter via Helm (EKS v1.x) -- configure IRSA for the controller first
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "1.0.6" \
  --namespace "kube-system" \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${KARPENTER_INTERRUPT_QUEUE}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --wait

---
# NodePool: defines what Karpenter may provision
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m5.xlarge","m5.2xlarge","m6i.xlarge","m6i.2xlarge","c5.2xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a","us-east-1b","us-east-1c"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: "1000"
    memory: 4000Gi

---
# EC2NodeClass: AWS-specific details (AMI, subnets, security groups, IAM role)
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  role: "KarpenterNodeRole-${CLUSTER_NAME}"
  instanceStorePolicy: RAID0

Node Provisioning Flow

The diagram below contrasts the CA provisioning path (through ASGs) with Karpenter's direct EC2 path. Understanding this helps you reason about latency budgets and failure modes during traffic spikes.

Node provisioning flow: Cluster Autoscaler (via ASG) vs Karpenter (direct EC2 API). Karpenter eliminates the ASG layer and picks the optimal instance size per workload.

Spot Instance Strategy

Running a majority of stateless, interruption-tolerant workloads on spot instances can cut EC2 costs by 60–80%. The correct approach is diversification: spread requests across many instance families and sizes so that a single spot pool interruption does not drain your capacity. Karpenter handles this natively by evaluating multiple instance types per NodePool requirement. With CA, you achieve diversification by defining multiple ASGs and setting --expander=least-waste or random.

Handling spot interruptions gracefully requires two things:

Interruption notice handling — EC2 sends a 2-minute warning before reclaiming a spot instance. Karpenter integrates with an SQS queue (interruptionQueue) to receive these events and proactively cordon + drain the node before the 2-minute window expires. Without this, your pods get SIGKILL with no draining.
Pod disruption budgets — every production deployment needs a PDB so that the draining step cannot take down more than a safe fraction of replicas simultaneously. A common configuration is minAvailable: 50% for stateless services.

# Separate NodePool for spot-only batch workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-batch
spec:
  template:
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]         # spot-only; batch jobs tolerate interruption
        - key: node.kubernetes.io/instance-type
          operator: In
          # Diversify across 8+ instance types to avoid capacity drought
          values: ["m5.2xlarge","m5.4xlarge","m6i.2xlarge","m6i.4xlarge",
                   "c5.4xlarge","c5a.4xlarge","r5.2xlarge","r6i.2xlarge"]
      taints:
        - key: workload-type
          value: batch
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s
  limits:
    cpu: "500"

---
# PodDisruptionBudget -- protect stateless service during node drain
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-service
  minAvailable: "50%"     # at least half the replicas always up during eviction

Node Consolidation

Provisioning is only half the story. A cluster that scales out on traffic spikes will accumulate underutilized nodes after traffic recedes. Consolidation is the process of compacting running workloads onto fewer nodes and terminating the surplus.

Karpenter's consolidation engine (consolidationPolicy: WhenEmptyOrUnderutilized) runs continuously. It evaluates every node against a model of where its pods could be rescheduled. When a valid consolidation move is found — either emptying a node entirely, or replacing it with a smaller/cheaper instance — Karpenter executes the drain + new-node-provision cycle. This is far more aggressive than CA's scale-in, which only evicts nodes that are completely empty after pods naturally migrate away.

Production consolidation settings: Use WhenEmptyOrUnderutilized with a consolidateAfter: 30s for stateless microservices clusters — it keeps costs tight. For clusters running stateful workloads (databases, Kafka brokers) use WhenEmpty only, or add a karpenter.sh/do-not-disrupt: "true" annotation to those pods to exclude them from consolidation consideration entirely.

Production Failure Modes

Several failure modes surface repeatedly in large Karpenter deployments:

NodePool limits hit: Once limits.cpu or limits.memory is exhausted, Karpenter stops provisioning. Pods remain pending indefinitely. Monitor with karpenter_nodepools_limit_usage_percentage and alert before it reaches 90%.
AMI drift causing bootstrap failures: If al2023@latest rolls out a broken release, every new node fails to join the cluster. Pin to a specific AMI alias during outages: al2023@v20250501. Track karpenter_nodes_total{phase="NotReady"} in your alerting stack.
Consolidation thrashing: If your HPA is reactive and your NodePool consolidation is aggressive, you can enter a loop: HPA scales down -> Karpenter consolidates -> load spike -> HPA scales up -> Karpenter provisions. Mitigate by setting KEDA or HPA scale-down stabilization windows longer than the consolidation window.
PDB blocking drain: If a PDB has minAvailable equal to total replicas (a common mistake), eviction is permanently blocked and consolidation stalls. Audit PDBs regularly with kubectl get pdb -A.

Choosing between CA and Karpenter: If you are on EKS and run primarily stateless workloads, Karpenter is the clear choice for new clusters — lower latency, better cost optimization, and simpler operations. For clusters on GKE or AKS, or clusters with strict compliance requirements around instance type control, the cloud-native cluster autoscalers (GKE Autopilot, Azure AKS Cluster Autoscaler) remain appropriate. Do not mix CA and Karpenter managing the same node groups — pick one per node group.