Cluster Autoscaling & Karpenter
Cluster Autoscaling & Karpenter
HPA and VPA operate entirely within the existing node pool — they shuffle pods and resize resource requests, but they cannot create new capacity when the cluster exhausts allocatable CPU or memory. That responsibility belongs to the cluster autoscaler layer. This lesson covers the two dominant approaches on Kubernetes: the venerable Cluster Autoscaler (CA) and AWS's modern replacement, Karpenter. We also cover spot instance integration, node consolidation, and the production failure modes that catch engineers off guard at scale.
How Cluster Autoscaler Works
Cluster Autoscaler watches for Unschedulable pods — pods the scheduler marks pending because no node has enough allocatable capacity. When CA sees such a pod it simulates whether adding a node from a configured Auto Scaling Group (ASG) would unblock it. If yes, it triggers an ASG scale-out. On the scale-in path CA periodically checks whether any node's running pods could fit on the remaining fleet; if so, it cordons the node, drains it respecting PodDisruptionBudgets, and triggers an ASG scale-in.
Key tuning parameters that matter at production scale:
--scale-down-utilization-threshold(default 0.5) — a node is considered underutilized when requested CPU and memory are both below this fraction. Raising it to 0.7 on cost-sensitive clusters speeds consolidation but risks churn during bursty traffic.--scale-down-delay-after-add(default 10m) — how long after a scale-out before scale-in is re-evaluated. Set too low and you get flapping; 15–20 minutes is safer for workloads with irregular traffic shapes.--max-node-provision-time(default 15m) — CA gives up on a node group if the node is not Ready within this window. With spot instances, set 8–10m to fail fast and try a different instance pool.--balance-similar-node-groups— critical for multi-AZ deployments: forces CA to scale out evenly across availability zones rather than filling one AZ first.
Karpenter: The Modern Approach
Karpenter (CNCF incubating, AWS-native) takes a fundamentally different architecture. Instead of managing ASGs, Karpenter directly calls the EC2 RunInstances API to provision individual instances. Instead of pre-configured node groups with fixed sizes, you define NodePools that describe constraints — instance families, architectures, capacity types — and Karpenter selects the optimal instance in real time by binpacking pending pods against live EC2 pricing and availability.
Provisioning latency drops from the typical 3–5 minutes of CA (ASG warm-up + AMI bootstrap + kubelet registration) to under 60 seconds for most instance types. More importantly, Karpenter can select the exact right instance size for a batch of pending pods rather than always rounding up to the next size in a pre-defined ASG.
Node Provisioning Flow
The diagram below contrasts the CA provisioning path (through ASGs) with Karpenter's direct EC2 path. Understanding this helps you reason about latency budgets and failure modes during traffic spikes.
Spot Instance Strategy
Running a majority of stateless, interruption-tolerant workloads on spot instances can cut EC2 costs by 60–80%. The correct approach is diversification: spread requests across many instance families and sizes so that a single spot pool interruption does not drain your capacity. Karpenter handles this natively by evaluating multiple instance types per NodePool requirement. With CA, you achieve diversification by defining multiple ASGs and setting --expander=least-waste or random.
Handling spot interruptions gracefully requires two things:
- Interruption notice handling — EC2 sends a 2-minute warning before reclaiming a spot instance. Karpenter integrates with an SQS queue (
interruptionQueue) to receive these events and proactively cordon + drain the node before the 2-minute window expires. Without this, your pods get SIGKILL with no draining. - Pod disruption budgets — every production deployment needs a PDB so that the draining step cannot take down more than a safe fraction of replicas simultaneously. A common configuration is
minAvailable: 50%for stateless services.
Node Consolidation
Provisioning is only half the story. A cluster that scales out on traffic spikes will accumulate underutilized nodes after traffic recedes. Consolidation is the process of compacting running workloads onto fewer nodes and terminating the surplus.
Karpenter's consolidation engine (consolidationPolicy: WhenEmptyOrUnderutilized) runs continuously. It evaluates every node against a model of where its pods could be rescheduled. When a valid consolidation move is found — either emptying a node entirely, or replacing it with a smaller/cheaper instance — Karpenter executes the drain + new-node-provision cycle. This is far more aggressive than CA's scale-in, which only evicts nodes that are completely empty after pods naturally migrate away.
WhenEmptyOrUnderutilized with a consolidateAfter: 30s for stateless microservices clusters — it keeps costs tight. For clusters running stateful workloads (databases, Kafka brokers) use WhenEmpty only, or add a karpenter.sh/do-not-disrupt: "true" annotation to those pods to exclude them from consolidation consideration entirely.
Production Failure Modes
Several failure modes surface repeatedly in large Karpenter deployments:
- NodePool limits hit: Once
limits.cpuorlimits.memoryis exhausted, Karpenter stops provisioning. Pods remain pending indefinitely. Monitor withkarpenter_nodepools_limit_usage_percentageand alert before it reaches 90%. - AMI drift causing bootstrap failures: If
al2023@latestrolls out a broken release, every new node fails to join the cluster. Pin to a specific AMI alias during outages:al2023@v20250501. Trackkarpenter_nodes_total{phase="NotReady"}in your alerting stack. - Consolidation thrashing: If your HPA is reactive and your NodePool consolidation is aggressive, you can enter a loop: HPA scales down -> Karpenter consolidates -> load spike -> HPA scales up -> Karpenter provisions. Mitigate by setting KEDA or HPA scale-down stabilization windows longer than the consolidation window.
- PDB blocking drain: If a PDB has
minAvailableequal to total replicas (a common mistake), eviction is permanently blocked and consolidation stalls. Audit PDBs regularly withkubectl get pdb -A.