Capacity Planning & Autoscaling

VPA & Right-Sizing Workloads

18 min Lesson 4 of 27

VPA & Right-Sizing Workloads

Horizontal Pod Autoscaler scales out by adding replicas. Vertical Pod Autoscaler (VPA) takes a different approach: it observes a workload over time and recommends — or directly applies — better CPU and memory requests and limits. At Google-scale, VPA is not optional. Pods with wildly over-provisioned requests starve the scheduler of allocatable capacity; pods with under-provisioned requests get OOMKilled or CPU-throttled in ways that latency SLOs cannot tolerate. Right-sizing is how you reclaim 30–50% of cluster cost without touching a line of application code.

How VPA Works Internally

VPA consists of three controllers that run in the kube-system namespace (or a dedicated namespace when deployed via the official manifests):

Recommender — continuously watches historical CPU and memory usage via the Metrics API and builds a recommendation using an exponentially weighted moving average (EWMA). It also factors in peak usage — the recommendation is intentionally above median to cover burst traffic.
Admission Controller (webhook) — intercepts pod creation and mutates resource requests inline. This is the path for updateMode: Auto and Initial.
Updater — for Auto mode, evicts pods whose current requests deviate significantly from the recommendation so the admission controller can rewrite them on restart. Eviction respects PodDisruptionBudgets.

VPA and HPA cannot share the same metric. If HPA scales on CPU utilization (the most common default), enabling VPA in Auto mode causes a feedback loop: VPA raises the CPU request, which lowers observed utilization percentage, which causes HPA to scale in, which concentrates load, which causes VPA to raise requests again. The safe combination is HPA on custom metrics (RPS, queue depth) while VPA manages requests/limits.

Installing VPA

VPA is not bundled with Kubernetes. Install it from the autoscaler repository. The default installation targets the kube-system namespace and uses a self-signed cert for the admission webhook.

# Clone the official autoscaler repo (pin to a release tag in production)
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler

# Generate certs and deploy all three components
./hack/vpa-up.sh

# Verify all three components are running
kubectl -n kube-system get pods | grep vpa
# vpa-admission-controller-xxxx   1/1   Running
# vpa-recommender-xxxx            1/1   Running
# vpa-updater-xxxx                1/1   Running

VPA Update Modes

The updatePolicy.updateMode field controls how aggressively VPA applies its recommendations. Choose based on how disruptive a pod restart is for your workload:

Off — Recommender runs and writes recommendations to the VPA status; nothing is applied automatically. Use this for the first 7–14 days on any new workload to audit recommendations before trusting them.
Initial — Admission controller applies recommendations at pod creation time only. Existing pods are never evicted. Safe for stateful workloads or pods where restart is expensive.
Recreate — Updater will evict pods when the current requests deviate far from the recommendation, but only if the pod can be safely evicted per its PDB. Useful for Deployments with multiple replicas.
Auto — Same as Recreate currently; in the future may apply in-place updates when the Kubernetes in-place resize API stabilizes (KEP-1287, alpha in 1.27+).

# VPA object targeting a web API Deployment — start in Off mode to collect data
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4000m
        memory: 8Gi
      controlledResources:
        - cpu
        - memory
      controlledValues: RequestsAndLimits

After a week, inspect the recommendation:

kubectl describe vpa api-server-vpa -n production

# Look for the Recommendation section:
# Recommendation:
#   Container Recommendations:
#     Container Name:  api
#     Lower Bound:
#       Cpu:     250m
#       Memory:  512Mi
#     Target:
#       Cpu:     620m
#       Memory:  920Mi
#     Uncapped Target:
#       Cpu:     620m
#       Memory:  920Mi
#     Upper Bound:
#       Cpu:     2100m
#       Memory:  3Gi

# The Target is what VPA would set as the new request.
# Upper Bound accounts for traffic bursts seen in the observation window.

Goldilocks: Fleet-Wide Right-Sizing Analysis

Running individual VPA objects per workload in Off mode is the right approach, but reading kubectl describe vpa for 200 Deployments is not. Goldilocks (by Fairwinds) automates this: it creates a VPA object in Off mode for every Deployment in a namespace you annotate, then presents a web dashboard with current requests vs. recommended requests side by side, along with estimated monthly cost savings.

# Install Goldilocks via Helm
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks \
  --create-namespace

# Label any namespace to enable Goldilocks monitoring
kubectl label namespace production goldilocks.fairwinds.com/enabled=true

# Port-forward the dashboard
kubectl -n goldilocks port-forward svc/goldilocks-dashboard 8080:80

# Open http://localhost:8080 — shows all Deployments in labeled namespaces
# with columns: current request, VPA target, estimated monthly savings

VPA's three controllers read metrics, write recommendations, and apply them on pod creation or restart. Goldilocks reads VPA objects fleet-wide for cost analysis.

Setting resourcePolicy Bounds Correctly

Without bounds, VPA will recommend whatever its model sees — potentially setting a memory request to 32 GiB for a workload that had one outlier OOM event, or reducing a CPU request to 10m for a batch job that is simply idle at night. Always set minAllowed and maxAllowed per container:

minAllowed — never let VPA drop below your known minimum for startup, JVM heap warm-up, or health-check pass rate. For Java services, a safe floor is typically 512Mi memory even for lightly loaded replicas.
maxAllowed — cap at a fraction of your largest node's allocatable capacity. A recommendation above 50% of a node's allocatable resource causes scheduler anti-affinity to reduce pod density, which defeats the purpose.
controlledValues: RequestsOnly vs. RequestsAndLimits — for CPU, prefer RequestsOnly with no CPU limit (a Google-recommended pattern to avoid CFS throttling). For memory, use RequestsAndLimits so that OOM budgets scale with actual usage.

The "no CPU limit" pattern: Google's internal production guidelines (and the Kubernetes docs) recommend omitting CPU limits entirely on latency-sensitive services. CFS throttling is silent, does not generate an event, and causes p99 latency spikes that look like application regressions. Set only requests.cpu so the scheduler can bin-pack correctly, and let the node's fair-share scheduler handle bursts. Use controlledValues: RequestsOnly in VPA to prevent it from setting a CPU limit.

Production Failure Modes to Know

VPA in Auto mode has evicted entire Deployments to zero in the following scenarios — all seen in production:

PDB misconfiguration: A Deployment with 3 replicas and maxUnavailable: 3 (or no PDB at all) allows VPA Updater to evict all pods simultaneously. Enforce a minimum PDB of minAvailable: 1 or maxUnavailable: 33% for every Deployment that serves traffic.
Cold-start amplification: A workload that uses very little CPU at idle but spikes heavily on first request (Node.js, Python with lazy imports, JVM). VPA observes idle usage and cuts the CPU request. The next deploy restarts pods with the new low request, the JVM warm-up causes throttling, health checks fail, the Deployment rolls back, and the cycle repeats. Mitigation: minAllowed.cpu set to the measured p95 warm-up consumption.
Recommendation churn: A workload with highly variable memory (e.g., a batch job mixed with a web server in the same pod). VPA oscillates between low and high recommendations, causing frequent evictions. Solution: split batch and web into separate containers or separate Deployments so VPA can model each independently.

Never run VPA in Auto mode on single-replica Deployments or StatefulSets in production without a verified restart procedure. VPA Updater respects PDBs, but if minAvailable: 1 and you only have 1 replica, the eviction is blocked — but the recommendation is never applied either, leaving you stuck. The correct approach for single-replica workloads is updateMode: Initial and a manual rolling restart during a maintenance window.

Integrating VPA Recommendations into GitOps

In a GitOps workflow (ArgoCD, Flux), VPA Auto mode fights the reconciler: VPA mutates the pod spec in-cluster while the reconciler keeps reverting the Deployment manifest in git to the original requests. The canonical solution is to keep VPA in Off mode, export recommendations periodically, and feed them back into git via a pipeline stage.

#!/bin/bash
# Export VPA recommendations for all VPA objects in a namespace
# Run as a scheduled CI job; open a PR with the diff

NAMESPACE=production
OUTPUT_DIR=k8s/resource-recommendations

mkdir -p "$OUTPUT_DIR"

for vpa in $(kubectl get vpa -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}'); do
  TARGET=$(kubectl get vpa "$vpa" -n "$NAMESPACE" \
    -o jsonpath='{.spec.targetRef.name}')

  CPU_REC=$(kubectl get vpa "$vpa" -n "$NAMESPACE" \
    -o jsonpath='{.status.recommendation.containerRecommendations[0].target.cpu}')
  MEM_REC=$(kubectl get vpa "$vpa" -n "$NAMESPACE" \
    -o jsonpath='{.status.recommendation.containerRecommendations[0].target.memory}')

  echo "Deployment: $TARGET | CPU: $CPU_REC | Memory: $MEM_REC" \
    >> "$OUTPUT_DIR/recommendations.txt"
done

# Diff against committed values, then open a PR via gh CLI
git diff --exit-code "$OUTPUT_DIR/" || \
  gh pr create --title "chore: VPA resource recommendations $(date +%F)" \
    --body "Automated VPA recommendation export. Review before merging." \
    --base main

This keeps git as the source of truth while still benefiting from VPA's data-driven analysis. Engineering teams at Spotify, Shopify, and similar scale use this pattern to run quarterly right-sizing reviews rather than relying on fully automated mutation in production.