FinOps & Cloud Cost Optimization

Kubernetes Cost Management

18 min Lesson 7 of 26

Kubernetes Cost Management

Kubernetes solved resource scheduling beautifully but shipped with almost no cost accountability. A cluster is a shared pool: compute, memory, and network are consumed by hundreds of pods across dozens of namespaces, yet the cloud bill arrives as a single line item — the node pool. Recovering per-team, per-service cost from that shared pool is the central challenge of Kubernetes FinOps, and at big-tech scale it determines whether engineering leadership can make rational investment decisions or is operating in the dark.

Why Kubernetes Cost Is Hard

Three structural properties make Kubernetes cost uniquely difficult compared to VM-based infrastructure.

Bin packing obscures ownership. The scheduler places pods from different teams on the same node. The node cost is real, but attributing it to a workload requires knowing both the resource requested by each pod and the idle capacity on the node — and deciding who pays for that idle capacity.
Requests vs. limits vs. actual usage diverge. A pod can request 2 CPU and 4 GiB, be limited to 4 CPU and 8 GiB, and actually consume 0.3 CPU and 900 MiB. Cost tools must decide which number to use. Charging on requests is conservative and predictable; charging on actual usage is accurate but volatile and harder to budget against.
Shared infrastructure has no single owner. Cluster add-ons — CoreDNS, kube-proxy, metrics-server, the ingress controller, the CNI daemonset — consume real resources but cannot be attributed to any team. This overhead typically runs 8–15% of total cluster capacity and must be socialised across consumers.

Industry baseline: Most mature Kubernetes platforms at 500+ node scale report 20–35% idle capacity under normal operating conditions, because applications over-request to avoid OOMKills and throttling. Recovering even half of that waste through better request tuning is typically worth more than any Reserved Instance discount programme.

The Namespace-as-Team Model

The standard big-tech approach is to map cost boundaries to Kubernetes namespaces and enforce them with labels. Every team owns one or more namespaces; every namespace is tagged with team, env, and cost-centre labels via a MutatingAdmissionWebhook or enforced by OPA/Kyverno policies at the namespace level. Cost is then aggregated per namespace and reported weekly to team leads as showback, and monthly to finance as chargeback.

Resource quotas make the model operational: without them, a single misconfigured deployment in one namespace can consume the entire node pool and starve other teams. Every namespace that participates in chargeback should have a ResourceQuota and a LimitRange.

# Namespace with mandatory cost labels
apiVersion: v1
kind: Namespace
metadata:
  name: payments-prod
  labels:
    team: payments
    env: production
    cost-centre: cc-1042
---
# ResourceQuota: hard ceiling per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: payments-prod-quota
  namespace: payments-prod
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 80Gi
    limits.cpu: "80"
    limits.memory: 160Gi
    count/pods: "200"
---
# LimitRange: default requests/limits so pods without explicit values
# do not schedule with zero requests (which hides cost attribution)
apiVersion: v1
kind: LimitRange
metadata:
  name: payments-prod-defaults
  namespace: payments-prod
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: 512Mi
    defaultRequest:
      cpu: "100m"
      memory: 128Mi
    max:
      cpu: "8"
      memory: 16Gi

Bin Packing: Turning Idle Capacity Into Savings

The gap between what nodes provide and what pods request is called cluster slack. At 1,000 nodes x $0.50/node-hour, 25% slack costs $3,000/day. There are three levers to reduce it.

1. Vertical Pod Autoscaler (VPA) in recommendation mode. Run VPA in Off mode first — it emits recommendations without acting — so you can audit request accuracy before enabling auto-updates. After 7 days of data, sort by the ratio of requested CPU to recommended CPU; workloads with a ratio above 4x are the highest-priority right-sizing targets.

# VPA in recommendation-only mode (safe for production audit)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-gateway-vpa
  namespace: platform-prod
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  updatePolicy:
    updateMode: "Off"          # Recommend only; do not evict pods

# After a week, read recommendations:
# kubectl get vpa api-gateway-vpa -n platform-prod -o json \
#   | jq '.status.recommendation.containerRecommendations[]
#          | {container: .containerName,
#             lowerBound: .lowerBound,
#             target: .target,
#             upperBound: .upperBound}'

2. Node consolidation via Karpenter (or Cluster Autoscaler consolidation mode). Karpenter's disruption.consolidationPolicy: WhenUnderutilized drains underutilised nodes and repacks workloads onto fewer, fuller nodes. At steady-state this typically improves bin-packing efficiency from 55–65% to 75–85% within a few hours of enabling it.

3. Node instance family selection. Karpenter's NodePool lets you specify a priority-ordered list of instance families. Mixing m7i, m7g (Graviton), and c7g allows the scheduler to pick the cheapest instance that fits. Graviton instances are typically 20% cheaper per vCPU than x86 equivalents for the same workload — but validate with your own benchmarks; memory-intensive or x86-only workloads may not benefit.

Kubecost-Style Visibility

Kubecost (and its OSS core, OpenCost) runs inside the cluster and continuously models the cost of every pod, deployment, namespace, and label combination by combining Prometheus resource metrics with cloud provider pricing APIs. This gives you sub-hour cost attribution without writing any custom tooling.

OpenCost attribution pipeline: Prometheus metrics and cloud pricing APIs feed the cost model, which surfaces namespace, workload, and efficiency views.

The core cost model is: pod cost = (CPU requested / node CPU) × node hourly rate × hours running, plus the same calculation for memory, then summed. Idle node capacity is split across all pods proportionally to their requests, so teams that over-request pay for their own waste rather than socialising it to neighbours.

# Install OpenCost (Helm) alongside Prometheus
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

helm install opencost opencost/opencost \
  --namespace opencost \
  --create-namespace \
  --set opencost.exporter.cloudProviderApiKey="$(cat /path/to/aws-billing-key)" \
  --set opencost.prometheus.external.enabled=true \
  --set opencost.prometheus.external.url="http://prometheus-operated.monitoring:9090"

# Query cost by namespace for the past 7 days via the OpenCost API
kubectl port-forward svc/opencost 9003 -n opencost &
curl -s "http://localhost:9003/allocation/compute?window=7d&aggregate=namespace&accumulate=false" \
  | jq '.data[0] | to_entries[]
        | {namespace: .key,
           cpuCost: (.value.cpuCost | . * 100 | round / 100),
           memoryCost: (.value.memoryCost | . * 100 | round / 100),
           totalCost: (.value.totalCost | . * 100 | round / 100),
           efficiency: (.value.totalEfficiency | . * 100 | round / 100)}'

Rightsizing at Scale: The Automated Feedback Loop

Manual right-sizing does not scale past 50 services. The production pattern is an automated weekly pipeline: VPA recommendations are collected, filtered for statistical significance (a workload must have at least 7 days of data and variance below 40%), and surfaced as pull requests against the team's Helm values file with the current and recommended request values side by side. The PR is auto-approved if the change reduces requests by more than 20% and the service has an HPA configured (so it can scale out if needed).

The golden ratio for production pods: set CPU requests at the p95 of actual CPU usage, CPU limits at 2–3x requests (CPU is compressible — throttling is safe). Set memory requests at the p99 of actual memory usage plus a 20% headroom buffer, and memory limits equal to requests (memory is not compressible — OOMKill is better than node eviction). This combination minimises waste while keeping failure modes predictable.

Shared Cluster vs. Dedicated Clusters: The Architecture Trade-off

A shared multi-tenant cluster has the best bin-packing efficiency — overhead is amortised, nodes are fuller, and Karpenter can repack across all workloads. A per-team dedicated cluster eliminates noisy-neighbour risk, simplifies cost attribution (one cloud bill = one team), and allows independent upgrade schedules, but multiplies overhead: every cluster needs its own control plane, add-ons, and operations staff.

The industry consensus at most hyperscaler-adjacent companies is a tiered model: one shared platform cluster per environment (dev/staging/prod) for the majority of workloads, plus opt-in dedicated clusters for workloads with strict compliance, network isolation, or hardware requirements (GPU training jobs, PCI-scoped payment processors). This delivers roughly 80% of the bin-packing benefit while containing the blast radius of the remaining 20%.

The namespace-as-cost-centre trap: Namespaces map to teams cleanly in theory, but in practice many organisations end up with hundreds of namespaces — one per microservice, per PR environment, per experiment. Cost reporting then requires a second-level aggregation (namespace → team → cost-centre) and the mapping must be maintained. Model it explicitly from day one: a ConfigMap or a CMDB entry that maps namespace to team is mandatory infrastructure for any FinOps programme at scale.

Surfacing Cost in the Developer Workflow

The most effective behavioural change is making cost visible at the moment a developer makes a sizing decision, not after the bill arrives. Three integration points matter most:

CI cost estimates. A step in the deploy pipeline (post-helm-diff) calls the OpenCost API to project the monthly cost of the new deployment spec and posts it as a PR comment: "This deployment will cost approximately $1,240/month (+18% vs. current). CPU request increased from 500m to 800m across 10 replicas." Engineers cannot act on what they cannot see.
Weekly Slack digest. An automated message every Monday to each team's channel: top 5 most expensive workloads, efficiency score, week-over-week change, and a link to the Kubecost dashboard. Teams that see their cost go up without a matching business justification investigate; teams that do not see it do not.
Grafana cost panel on every service dashboard. A standard Grafana panel (using OpenCost's Prometheus metrics) showing cost/day and efficiency % next to latency and error rate. Cost becomes a first-class operational signal, not a finance artifact.

Kubernetes cost management is not a tool purchase — it is an operating model change. The tooling (OpenCost, VPA, Karpenter) is mature and largely free. The hard work is the labelling taxonomy, the quota enforcement, and the cultural shift that makes every team accountable for their own cloud spend.