Cluster Autoscaling & Karpenter
Cluster Autoscaling & Karpenter
Horizontal Pod Autoscaling (HPA) scales your workloads; Cluster Autoscaling scales your infrastructure. When every node is full and a new Pod cannot be scheduled, you need capacity to appear — fast, cheaply, and with the right shape. Getting this right is the difference between a platform that absorbs traffic spikes gracefully and one that pages you at 3 AM because Pods have been Pending for 20 minutes.
This lesson covers two generations of the same idea: the Cluster Autoscaler (CA), the long-standing Kubernetes-project tool, and Karpenter, the modern AWS-originated (now CNCF) alternative that fixes most of CA's structural limitations.
Why Nodes Run Out: The Bin-Packing Problem
Kubernetes schedules Pods onto nodes using a bin-packing strategy: it fits as many Pods as possible into the available capacity. Each Pod declares requests (guaranteed minimum CPU/memory) and limits (hard cap). The scheduler sums requests per node and will not place a Pod if it would exceed the node's allocatable capacity.
In practice, nodes are never 100% utilized: system daemons (kubelet, kube-proxy, log agents, CNI plugins) consume 5–15% of each node's capacity before workloads even start. A 4-vCPU node typically exposes ~3.5 vCPU as allocatable. This gap means you always need slightly more nodes than raw math suggests.
kubectl describe node <name> and look at the Allocatable block — that is the number the scheduler uses. The delta between Capacity and Allocatable is reserved for the OS and Kubernetes system components.
Cluster Autoscaler (CA) — The Classic Approach
CA watches for Pending Pods and checks whether adding a node from any configured Node Group (AWS Auto Scaling Groups, GKE Managed Instance Groups, Azure VMSS) would make the Pod schedulable. If yes, it increments the group's desired count. It also scans for underutilized nodes and, after a configurable idle window (default 10 minutes), drains and terminates them.
Key CA limitations:
- ASG-coupled thinking: CA operates on pre-defined instance types within each ASG. You must create an ASG per instance family — mixing instance types requires multiple groups and careful priority configuration.
- Slow scale-up: CA polls every 10 seconds, then waits for the cloud provider to provision a node (1–3 minutes for EC2), then kubelet bootstraps (~30–60 s). Total: often 3–5 minutes from
Pendingto running Pod. - Scale-down conservatism: CA will not remove a node if any non-mirrored, non-DaemonSet Pod on it lacks a controller, or if a PodDisruptionBudget would be violated. This is safe but leaves idle nodes running longer than necessary.
Karpenter — A Fundamentally Different Model
Karpenter (now a CNCF incubating project, originally built by AWS) abandons the Node Group abstraction entirely. Instead of managing ASGs, Karpenter calls cloud APIs directly to launch exactly the instance type that fits the pending workload. This eliminates the ASG indirection layer and makes provisioning both faster and more cost-efficient.
Karpenter introduces two CRDs:
- NodePool — replaces the old
Provisioner(pre-v0.32). Defines which instance families, zones, capacity types (on-demand or spot), taints/labels, and disruption budgets are allowed. - EC2NodeClass (AWS-specific) — describes the underlying EC2 configuration: AMI family, subnet selectors, security group selectors, instance profile, userData.
Spot Capacity: Saving 60–90% on Compute
Spot Instances (AWS) / Preemptible VMs (GCP) / Spot VMs (Azure) offer deep discounts — typically 60–90% cheaper than on-demand — in exchange for up to a 2-minute termination notice. Used correctly, spot is transformative for batch workloads, stateless services, and even some stateful workloads with proper disruption handling.
Production rules for spot-safe workloads:
- Always define a
PodDisruptionBudget(PDB) so Karpenter/CA cannot drain too many replicas simultaneously. - Spread Pods across multiple instance families and Availability Zones — spot pools are per-instance-type per-AZ; diversification dramatically reduces simultaneous interruption risk.
- Handle
SIGTERMgracefully. Your application must finish in-flight requests withinterminationGracePeriodSeconds(recommend 60–120 s). - Never run singleton critical components (etcd, CA itself, admission webhooks) on spot.
Karpenter handles spot interruption via SQS interruption queue: when EC2 sends a spot interruption notice, Karpenter receives it from SQS, cordons the node immediately, and starts scheduling replacement capacity — all before the 2-minute window expires. This gives far better MTTI (mean time to interrupt and recover) than the default node-problem-detector approach.
karpenter.k8s.aws/instance-category: [c, m, r] combined with instance-generation >= 4 typically yields 30–50 eligible instance types per zone — the widest possible spot pool, maximizing capacity availability.
Consolidation: Karpenter's Superpower
Consolidation is Karpenter's ability to continuously right-size your node fleet. When nodes are underutilized, Karpenter simulates whether all their Pods could fit on fewer (or smaller) nodes, then executes the bin-packing: it launches a cheaper replacement node, drains the inefficient nodes, and lets the Pods reschedule. This happens autonomously, within your PDB constraints, without any manual intervention.
Set consolidationPolicy: WhenEmptyOrUnderutilized for full consolidation. Use WhenEmpty only for sensitive production tiers where you want to avoid any voluntary disruption of running Pods.
consolidateAfter: 0s in production. Aggressive consolidation triggers constant node churn, which disrupts Pods unnecessarily and can cause cascading failures if your workload has slow startup times. A value of 1m to 5m gives the scheduler time to stabilize after a scale event before another consolidation round fires.
Observing the Autoscaler in Action
Pair Karpenter with Kubernetes Metrics Server and HPA for a complete autoscaling stack: HPA scales Pods in response to CPU/memory/custom metrics, which causes Pending Pods when nodes are full, which Karpenter sees and resolves by launching new capacity. The two controllers never interfere — they operate on different resources (Pods vs Nodes).