StatefulSets
StatefulSets
Most Kubernetes workloads are stateless: any Pod is identical to any other, and you can kill or replace Pods in any order. Databases, message brokers, and distributed caches cannot work this way. A Kafka broker has a persistent identity that other brokers and clients use in their configuration. A Cassandra node owns a shard of data that must follow it across restarts. Elasticsearch nodes form a cluster by name and must rejoin with the same node ID after a rolling upgrade. StatefulSets exist to give Pods a stable, persistent identity — a predictable hostname, an ordered startup and teardown sequence, and a dedicated PersistentVolumeClaim that survives Pod rescheduling.
What Makes a StatefulSet Different
A Deployment treats all its Pods as fungible cattle. A StatefulSet treats each Pod as a named, ordered individual. The guarantees it provides are:
- Stable network identity: each Pod gets a DNS name of the form
<pod-name>.<headless-service>.<namespace>.svc.cluster.local— for examplepostgres-0.postgres-headless.production.svc.cluster.local. This hostname is stable across restarts and rescheduling to different nodes. - Stable storage: each Pod has its own PersistentVolumeClaim created from a
volumeClaimTemplatesblock. The claim is named<volume-name>-<pod-name>(e.g.data-postgres-0). When Podpostgres-0is rescheduled, it reattaches to the same PVC — and therefore to the same underlying data. - Ordered, controlled rollout: Pods are created and deleted in a deterministic order (0, 1, 2 … for creation; reverse for deletion). A Pod must be Running and Ready before the next ordinal is started.
clusterIP: None) to be created separately. This Service is what registers per-Pod DNS records in the cluster DNS. Without it the stable hostname guarantee does not function.Anatomy of a StatefulSet Manifest
Below is a production-realistic PostgreSQL StatefulSet. Note the pairing of serviceName, the headless Service definition, and volumeClaimTemplates:
Stable Identity Under the Hood
When you apply this manifest, Kubernetes creates Pods named postgres-0, postgres-1, and postgres-2 — in that order, waiting for each to be Ready before proceeding. It also creates PVCs named data-postgres-0, data-postgres-1, and data-postgres-2 automatically. The cluster DNS then registers A records so that postgres-0.postgres-headless.production.svc.cluster.local always resolves to the IP of the Pod currently named postgres-0, regardless of which physical node it runs on.
Rolling Updates and the Partition Field
StatefulSet rolling updates proceed in reverse ordinal order (highest index first). The partition field in updateStrategy.rollingUpdate is one of the most useful — and underused — production controls. Setting partition: 2 on a three-replica set means only Pod postgres-2 is updated when you change the Pod template. Pods postgres-0 and postgres-1 keep running the old spec. This is a built-in canary mechanism for stateful workloads: validate the new version on the replica with the smallest blast radius before rolling it to the primary.
PVC Lifecycle and the Retention Policy
By default, PVCs created by a StatefulSet are not deleted when the StatefulSet is scaled down or deleted. This is intentional: you do not want a kubectl delete statefulset to wipe your database data. Since Kubernetes 1.27 the persistentVolumeClaimRetentionPolicy field lets you control this explicitly:
whenDeleted: Delete on production databases unless you have verified, tested backups. A kubectl delete statefulset --cascade=foreground combined with that policy would destroy every PVC. The safe default is Retain on both axes — PVCs outlive the StatefulSet and must be manually cleaned up.Production Failure Modes
Understanding where StatefulSets break is as important as knowing how to configure them:
- Split-brain after node failure: If a node becomes unreachable (network partition rather than crash), the API server marks its Pods as
Unknownbut does not forcibly terminate them. The StatefulSet controller will not create a replacement Pod for ordinal N while a Pod of the same name might still be running — it prefers to err on the side of no duplicate rather than risk two Pods writing to the same PVC. Production fix: set a shortnode.kubernetes.io/unreachabletoleration on the Pod spec, or use a pod disruption budget combined with a cluster autoscaler node-drain workflow. - PVC capacity exhaustion: Unlike Deployments, a StatefulSet Pod cannot move to a different PVC if its storage is full. Always set storage alerts at 75% capacity and use
allowVolumeExpansion: trueon the StorageClass so you can resize without downtime. - Init Container ordering for cluster bootstrap: Stateful applications like Cassandra or Kafka need to know if they are the first node in a new cluster or a replacement node joining an existing one. Use an init container that queries the headless service — if DNS resolution returns zero records, it is a fresh cluster; if it returns existing IPs, it is a join operation.
role: primary patched onto the primary Pod at startup to make Service selection trivial.Scaling and Deletion
Scale with kubectl scale statefulset postgres --replicas=5 -n production. New Pods are added in ascending order (3, 4) and each must be Ready before the next is created. To scale down, run the same command with a lower count — Pods are terminated in descending order (4, 3). Their PVCs survive unless the retention policy says otherwise. Always drain a stateful replica gracefully (hand off in-flight operations, flush WAL, release cluster membership) by ensuring a proper preStop hook or by configuring the application to handle SIGTERM with a clean shutdown sequence within terminationGracePeriodSeconds.