Databases in Production

Databases on Kubernetes

18 min Lesson 9 of 30

Databases on Kubernetes

Running databases on Kubernetes is one of the most debated topics in platform engineering. For years the conventional wisdom was "stateless workloads only — keep your databases outside the cluster." That advice made sense in 2016 when Kubernetes storage primitives were immature. In 2025, with mature operators, CSI drivers, and production battle-testing at companies like Zalando, Cloudflare, and GitLab, the calculus has shifted. But "you can" is not the same as "you should," and the conditions under which each answer is correct matter enormously.

Why Running Databases in Kubernetes Is Hard

Kubernetes is designed around cattle — stateless pods that can be killed, rescheduled, and scaled horizontally at will. Databases are pets: they have identity, disk state, replication topology, and cluster membership that must survive restarts and node failures without data loss. Several primitives bridge this gap:

  • StatefulSet — assigns stable, ordered pod identities (pg-0, pg-1, pg-2) that survive pod restarts. Each pod gets its own PersistentVolumeClaim.
  • PersistentVolumeClaim (PVC) — the pod's durable storage, backed by a CSI driver (EBS, GCE PD, Ceph, Longhorn, etc.). The PVC outlives the pod.
  • Headless Service — a Service with clusterIP: None that exposes each pod's DNS name directly (pg-0.postgres-headless.default.svc.cluster.local), allowing the operator to route writes to the primary and reads to replicas.
  • PodDisruptionBudget (PDB) — prevents Kubernetes from draining too many database pods simultaneously during node maintenance.

Even with these primitives, you would not write a production-grade database operator from scratch. The coordination logic — leader election, replica promotion, switchover with zero data loss, configuration reload on spec change — is complex enough that the community has converged on operators as the standard delivery mechanism.

Operators: The Right Abstraction

A Kubernetes operator encodes a human operator's runbook as a control loop. It watches Custom Resources (CRs) you define — e.g. a Cluster object for PostgreSQL — and reconciles the actual cluster state toward the desired state. Good operators handle:

  • Bootstrap: initialising the primary, streaming replicas from it, and registering them in the topology
  • Failover: detecting primary failure, electing a new leader, updating the headless service endpoints, all without data loss
  • Switchover: graceful, zero-downtime primary hand-off for maintenance
  • Backup: scheduled base backups to object storage (S3, GCS) and continuous WAL archiving
  • Configuration management: translating a spec change into a coordinated rolling reload or restart
  • Minor and major version upgrades: orchestrated cluster-wide upgrades with rollback support

CloudNativePG — The Benchmark PostgreSQL Operator

CloudNativePG (CNPG) is a CNCF sandbox project and the most production-mature PostgreSQL operator available. It runs PostgreSQL directly in pods (no sidecar), uses PostgreSQL streaming replication natively, and integrates WAL archiving with object storage out of the box. Here is a minimal but production-aligned cluster manifest:

# cnpg-cluster.yaml apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: pg-prod namespace: databases spec: instances: 3 # 1 primary + 2 synchronous standbys imageName: ghcr.io/cloudnative-pg/postgresql:16.3 postgresql: parameters: max_connections: "200" shared_buffers: "256MB" wal_level: logical # enable logical replication if needed synchronous_commit: "on" bootstrap: initdb: database: app owner: app secret: name: pg-app-credentials # pre-created Secret with username/password storage: size: 100Gi storageClass: gp3-encrypted # CSI-backed, encrypted EBS gp3 backup: retentionPolicy: "30d" barmanObjectStore: destinationPath: s3://my-cluster-backups/pg-prod s3Credentials: accessKeyId: name: s3-creds key: ACCESS_KEY_ID secretAccessKey: name: s3-creds key: SECRET_ACCESS_KEY wal: compression: gzip maxParallel: 2 monitoring: enablePodMonitor: true # creates a PodMonitor for Prometheus scraping resources: requests: memory: "2Gi" cpu: "1" limits: memory: "4Gi" cpu: "2" --- # PodDisruptionBudget: prevent draining more than 1 DB pod at once apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pg-prod-pdb namespace: databases spec: minAvailable: 2 selector: matchLabels: cnpg.io/cluster: pg-prod

Apply it and CNPG bootstraps the cluster, configures streaming replication, starts WAL archiving, and creates a PodMonitor so Prometheus scrapes the built-in postgres_exporter metrics automatically. Check cluster status with:

# Install the CNPG kubectl plugin (once) kubectl krew install cnpg # Cluster health summary kubectl cnpg status pg-prod -n databases # Example output: # Cluster Summary # Name: pg-prod # Namespace: databases # System ID: 7384… # PostgreSQL Image: ghcr.io/cloudnative-pg/postgresql:16.3 # Primary instance: pg-prod-1 # Status: Cluster in healthy state # Instances: 3 # Ready instances: 3 # # Instances # NAME CURRENT LSN REPLICATION LAG STATUS ROLE # pg-prod-1 0/1A3C5F8 — healthy Primary # pg-prod-2 0/1A3C5F8 0s healthy Standby (sync) # pg-prod-3 0/1A3C5F8 0s healthy Standby (sync) # Trigger a manual switchover (zero-downtime primary hand-off) kubectl cnpg promote pg-prod pg-prod-2 -n databases # Initiate an on-demand backup to object storage kubectl cnpg backup pg-prod -n databases
Other mature operators worth knowing: Zalando Postgres Operator (uses Patroni under the hood, battle-tested at Zalando scale), PGO (Crunchy Data) (strong enterprise support, robust backup/restore UX), Percona Operator for MySQL/MongoDB (covers XtraDB Cluster and Percona Server for MongoDB), and KubeDB (multi-engine: PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch). For Redis, Redis Enterprise Operator and the open-source spotahome/redis-operator are the leading choices.

Storage: The Make-or-Break Decision

The operator is only as reliable as the storage layer beneath it. Storage choice has a larger impact on database performance and data safety than almost any other Kubernetes decision.

Kubernetes database storage stack Application Pod CloudNativePG Operator StatefulSet Pods PVC per Pod CSI Driver (EBS / GCE PD / Ceph / Longhorn) Physical / Network Block Storage
Kubernetes database storage stack: the operator manages topology; the CSI driver connects PVCs to durable block storage beneath the cluster.

Key storage principles for production databases on Kubernetes:

  • Use block storage, not shared filesystems. NFS and most CephFS configurations are not safe for PostgreSQL or MySQL data directories because they do not provide the fsync guarantees databases depend on. Use block volumes (EBS, GCE PD, Ceph RBD in block mode).
  • Enable volume encryption at the StorageClass level, not at the application layer. An encrypted: "true" annotation or a KMS key reference in the StorageClass ensures every PVC is encrypted automatically.
  • Request Guaranteed QoS for storage I/O. On nodes with multiple workloads, noisy-neighbour disk I/O is a major latency source. Use topologySpreadConstraints or dedicated node groups (via node selectors and taints) to isolate database pods.
  • Never use emptyDir for database data. emptyDir is ephemeral — it is destroyed when the pod is evicted. This is the single most common data-loss mistake when first running databases in Kubernetes.
  • Set a reclaimPolicy: Retain StorageClass for all database PVCs. The default Delete policy will permanently destroy your data volume the moment a PVC is deleted — whether intentionally or by operator error.
The PVC delete trap: If you run kubectl delete pvc pg-prod-1 on a StorageClass with reclaimPolicy: Delete, the underlying EBS volume is permanently destroyed within seconds. Always use reclaimPolicy: Retain for database volumes. After a deliberate decommission, manually delete the PV and the backing volume only after confirming the data is no longer needed or a verified backup exists.

When NOT to Run Databases on Kubernetes

Even with mature operators, there are situations where running a database inside Kubernetes adds complexity with no proportional benefit:

  • You are already on a managed service and have no operational problem. RDS, Cloud SQL, and Aurora solve real problems cheaply. Migrating to a CNPG cluster to gain control you do not need is pure complexity.
  • Your team does not have Kubernetes expertise. Operating a CNPG cluster requires understanding StatefulSets, PVCs, StorageClasses, network policies, and the operator's own CRDs. A team unfamiliar with Kubernetes should not adopt it as a database platform first.
  • Your storage layer is not production-grade. If the cluster uses ephemeral local disks, a poorly tuned Longhorn installation, or NFS, running a production database on it will end badly. Fix the infrastructure before running stateful workloads.
  • Compliance or data residency requirements dictate isolation. Some regulatory frameworks require databases to run on dedicated, single-tenant hardware. A shared Kubernetes cluster violates these requirements regardless of operator maturity.
  • Extremely high I/O workloads. A cluster running on general-purpose network block storage will never match the throughput of a bare-metal NVMe host. For write-intensive OLTP at extreme scale, a dedicated host or bare-metal database server is still the right answer.
The pragmatic decision framework: Start with managed cloud databases. Move to a Kubernetes operator when: (1) you are running Kubernetes for everything else and the operational consistency gain is real, (2) you need features the managed service does not expose, or (3) the managed service cost exceeds the engineering cost of a well-operated CNPG cluster. At companies like Cloudflare and GitLab, the break-even is real at sufficient scale — but the prerequisite is a platform team that owns the operator layer as seriously as any other infrastructure service.

Key Operational Practices

If you do run databases on Kubernetes, these practices separate production-grade deployments from experiments:

  • Always configure a PDB. Without a PodDisruptionBudget, a kubectl drain during node maintenance can evict both the primary and a replica simultaneously, causing a brief outage or failover cascade.
  • Use anti-affinity rules to spread database pods across physical nodes and availability zones. A primary and both replicas on the same node means a single node failure takes down the entire cluster.
  • Test failover regularly using the operator's promote command. Know the RTO from a primary failure before you experience it in production.
  • Verify WAL archiving and restore. Running kubectl cnpg backup pg-prod is not enough — regularly restore that backup into a separate namespace or staging cluster to confirm it is valid.
  • Separate the database namespace with NetworkPolicies that allow only application namespaces to reach database ports. Databases should never be reachable from the open internet or from unrelated workloads in the same cluster.