VPA & Right-Sizing Workloads
VPA & Right-Sizing Workloads
Horizontal Pod Autoscaler scales out by adding replicas. Vertical Pod Autoscaler (VPA) takes a different approach: it observes a workload over time and recommends — or directly applies — better CPU and memory requests and limits. At Google-scale, VPA is not optional. Pods with wildly over-provisioned requests starve the scheduler of allocatable capacity; pods with under-provisioned requests get OOMKilled or CPU-throttled in ways that latency SLOs cannot tolerate. Right-sizing is how you reclaim 30–50% of cluster cost without touching a line of application code.
How VPA Works Internally
VPA consists of three controllers that run in the kube-system namespace (or a dedicated namespace when deployed via the official manifests):
- Recommender — continuously watches historical CPU and memory usage via the Metrics API and builds a recommendation using an exponentially weighted moving average (EWMA). It also factors in peak usage — the recommendation is intentionally above median to cover burst traffic.
- Admission Controller (webhook) — intercepts pod creation and mutates resource requests inline. This is the path for
updateMode: AutoandInitial. - Updater — for
Automode, evicts pods whose current requests deviate significantly from the recommendation so the admission controller can rewrite them on restart. Eviction respects PodDisruptionBudgets.
Auto mode causes a feedback loop: VPA raises the CPU request, which lowers observed utilization percentage, which causes HPA to scale in, which concentrates load, which causes VPA to raise requests again. The safe combination is HPA on custom metrics (RPS, queue depth) while VPA manages requests/limits.
Installing VPA
VPA is not bundled with Kubernetes. Install it from the autoscaler repository. The default installation targets the kube-system namespace and uses a self-signed cert for the admission webhook.
VPA Update Modes
The updatePolicy.updateMode field controls how aggressively VPA applies its recommendations. Choose based on how disruptive a pod restart is for your workload:
Off— Recommender runs and writes recommendations to the VPA status; nothing is applied automatically. Use this for the first 7–14 days on any new workload to audit recommendations before trusting them.Initial— Admission controller applies recommendations at pod creation time only. Existing pods are never evicted. Safe for stateful workloads or pods where restart is expensive.Recreate— Updater will evict pods when the current requests deviate far from the recommendation, but only if the pod can be safely evicted per its PDB. Useful for Deployments with multiple replicas.Auto— Same asRecreatecurrently; in the future may apply in-place updates when the Kubernetes in-place resize API stabilizes (KEP-1287, alpha in 1.27+).
After a week, inspect the recommendation:
Goldilocks: Fleet-Wide Right-Sizing Analysis
Running individual VPA objects per workload in Off mode is the right approach, but reading kubectl describe vpa for 200 Deployments is not. Goldilocks (by Fairwinds) automates this: it creates a VPA object in Off mode for every Deployment in a namespace you annotate, then presents a web dashboard with current requests vs. recommended requests side by side, along with estimated monthly cost savings.
Setting resourcePolicy Bounds Correctly
Without bounds, VPA will recommend whatever its model sees — potentially setting a memory request to 32 GiB for a workload that had one outlier OOM event, or reducing a CPU request to 10m for a batch job that is simply idle at night. Always set minAllowed and maxAllowed per container:
minAllowed— never let VPA drop below your known minimum for startup, JVM heap warm-up, or health-check pass rate. For Java services, a safe floor is typically 512Mi memory even for lightly loaded replicas.maxAllowed— cap at a fraction of your largest node's allocatable capacity. A recommendation above 50% of a node's allocatable resource causes scheduler anti-affinity to reduce pod density, which defeats the purpose.controlledValues: RequestsOnlyvs.RequestsAndLimits— for CPU, preferRequestsOnlywith no CPU limit (a Google-recommended pattern to avoid CFS throttling). For memory, useRequestsAndLimitsso that OOM budgets scale with actual usage.
requests.cpu so the scheduler can bin-pack correctly, and let the node's fair-share scheduler handle bursts. Use controlledValues: RequestsOnly in VPA to prevent it from setting a CPU limit.
Production Failure Modes to Know
VPA in Auto mode has evicted entire Deployments to zero in the following scenarios — all seen in production:
- PDB misconfiguration: A Deployment with 3 replicas and
maxUnavailable: 3(or no PDB at all) allows VPA Updater to evict all pods simultaneously. Enforce a minimum PDB ofminAvailable: 1ormaxUnavailable: 33%for every Deployment that serves traffic. - Cold-start amplification: A workload that uses very little CPU at idle but spikes heavily on first request (Node.js, Python with lazy imports, JVM). VPA observes idle usage and cuts the CPU request. The next deploy restarts pods with the new low request, the JVM warm-up causes throttling, health checks fail, the Deployment rolls back, and the cycle repeats. Mitigation:
minAllowed.cpuset to the measured p95 warm-up consumption. - Recommendation churn: A workload with highly variable memory (e.g., a batch job mixed with a web server in the same pod). VPA oscillates between low and high recommendations, causing frequent evictions. Solution: split batch and web into separate containers or separate Deployments so VPA can model each independently.
minAvailable: 1 and you only have 1 replica, the eviction is blocked — but the recommendation is never applied either, leaving you stuck. The correct approach for single-replica workloads is updateMode: Initial and a manual rolling restart during a maintenance window.
Integrating VPA Recommendations into GitOps
In a GitOps workflow (ArgoCD, Flux), VPA Auto mode fights the reconciler: VPA mutates the pod spec in-cluster while the reconciler keeps reverting the Deployment manifest in git to the original requests. The canonical solution is to keep VPA in Off mode, export recommendations periodically, and feed them back into git via a pipeline stage.
This keeps git as the source of truth while still benefiting from VPA's data-driven analysis. Engineering teams at Spotify, Shopify, and similar scale use this pattern to run quarterly right-sizing reviews rather than relying on fully automated mutation in production.