Kubernetes Workloads & Configuration

StatefulSets

18 min Lesson 5 of 32

StatefulSets

Most Kubernetes workloads are stateless: any Pod is identical to any other, and you can kill or replace Pods in any order. Databases, message brokers, and distributed caches cannot work this way. A Kafka broker has a persistent identity that other brokers and clients use in their configuration. A Cassandra node owns a shard of data that must follow it across restarts. Elasticsearch nodes form a cluster by name and must rejoin with the same node ID after a rolling upgrade. StatefulSets exist to give Pods a stable, persistent identity — a predictable hostname, an ordered startup and teardown sequence, and a dedicated PersistentVolumeClaim that survives Pod rescheduling.

What Makes a StatefulSet Different

A Deployment treats all its Pods as fungible cattle. A StatefulSet treats each Pod as a named, ordered individual. The guarantees it provides are:

Stable network identity: each Pod gets a DNS name of the form <pod-name>.<headless-service>.<namespace>.svc.cluster.local — for example postgres-0.postgres-headless.production.svc.cluster.local. This hostname is stable across restarts and rescheduling to different nodes.
Stable storage: each Pod has its own PersistentVolumeClaim created from a volumeClaimTemplates block. The claim is named <volume-name>-<pod-name> (e.g. data-postgres-0). When Pod postgres-0 is rescheduled, it reattaches to the same PVC — and therefore to the same underlying data.
Ordered, controlled rollout: Pods are created and deleted in a deterministic order (0, 1, 2 … for creation; reverse for deletion). A Pod must be Running and Ready before the next ordinal is started.

The headless Service is not optional. A StatefulSet requires a headless Service (clusterIP: None) to be created separately. This Service is what registers per-Pod DNS records in the cluster DNS. Without it the stable hostname guarantee does not function.

Anatomy of a StatefulSet Manifest

Below is a production-realistic PostgreSQL StatefulSet. Note the pairing of serviceName, the headless Service definition, and volumeClaimTemplates:

# 1. Headless Service — must exist before the StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
  namespace: production
  labels:
    app: postgres
spec:
  clusterIP: None          # headless — no VIP, only DNS A records per pod
  selector:
    app: postgres
  ports:
  - name: postgres
    port: 5432
---
# 2. StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: production
spec:
  serviceName: postgres-headless   # links to the headless service above
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0          # set to N to canary: only pods >= N are updated
  podManagementPolicy: OrderedReady # default; use Parallel for faster (but less safe) scaling
  template:
    metadata:
      labels:
        app: postgres
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: postgres
        image: postgres:16.2
        ports:
        - containerPort: 5432
          name: postgres
        env:
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "2000m"
            memory: "4Gi"
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        readinessProbe:
          exec:
            command: ["pg_isready", "-U", "postgres"]
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3
        livenessProbe:
          exec:
            command: ["pg_isready", "-U", "postgres"]
          initialDelaySeconds: 30
          periodSeconds: 15
          failureThreshold: 5
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3-encrypted   # production: use an encrypted, high-IOPS class
      resources:
        requests:
          storage: 100Gi

Stable Identity Under the Hood

When you apply this manifest, Kubernetes creates Pods named postgres-0, postgres-1, and postgres-2 — in that order, waiting for each to be Ready before proceeding. It also creates PVCs named data-postgres-0, data-postgres-1, and data-postgres-2 automatically. The cluster DNS then registers A records so that postgres-0.postgres-headless.production.svc.cluster.local always resolves to the IP of the Pod currently named postgres-0, regardless of which physical node it runs on.

A StatefulSet with three replicas — each Pod has a stable DNS name and its own dedicated PersistentVolumeClaim.

Rolling Updates and the Partition Field

StatefulSet rolling updates proceed in reverse ordinal order (highest index first). The partition field in updateStrategy.rollingUpdate is one of the most useful — and underused — production controls. Setting partition: 2 on a three-replica set means only Pod postgres-2 is updated when you change the Pod template. Pods postgres-0 and postgres-1 keep running the old spec. This is a built-in canary mechanism for stateful workloads: validate the new version on the replica with the smallest blast radius before rolling it to the primary.

# Canary rollout: update only postgres-2 first
kubectl patch statefulset postgres -n production \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/updateStrategy/rollingUpdate/partition","value":2}]'

# Confirm postgres-2 is on the new image, then proceed to roll postgres-1
kubectl patch statefulset postgres -n production \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/updateStrategy/rollingUpdate/partition","value":1}]'

# Finally, roll postgres-0 (often the primary — update last)
kubectl patch statefulset postgres -n production \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/updateStrategy/rollingUpdate/partition","value":0}]'

# Verify all pods are on the new revision
kubectl rollout status statefulset/postgres -n production
kubectl get pods -n production -l app=postgres \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

PVC Lifecycle and the Retention Policy

By default, PVCs created by a StatefulSet are not deleted when the StatefulSet is scaled down or deleted. This is intentional: you do not want a kubectl delete statefulset to wipe your database data. Since Kubernetes 1.27 the persistentVolumeClaimRetentionPolicy field lets you control this explicitly:

spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain    # PVCs survive StatefulSet deletion (default-equivalent, safe for prod)
    whenScaled:  Delete    # PVCs are deleted when replicas are scaled down (reclaim storage)

Never set whenDeleted: Delete on production databases unless you have verified, tested backups. A kubectl delete statefulset --cascade=foreground combined with that policy would destroy every PVC. The safe default is Retain on both axes — PVCs outlive the StatefulSet and must be manually cleaned up.

Production Failure Modes

Understanding where StatefulSets break is as important as knowing how to configure them:

Split-brain after node failure: If a node becomes unreachable (network partition rather than crash), the API server marks its Pods as Unknown but does not forcibly terminate them. The StatefulSet controller will not create a replacement Pod for ordinal N while a Pod of the same name might still be running — it prefers to err on the side of no duplicate rather than risk two Pods writing to the same PVC. Production fix: set a short node.kubernetes.io/unreachable toleration on the Pod spec, or use a pod disruption budget combined with a cluster autoscaler node-drain workflow.
PVC capacity exhaustion: Unlike Deployments, a StatefulSet Pod cannot move to a different PVC if its storage is full. Always set storage alerts at 75% capacity and use allowVolumeExpansion: true on the StorageClass so you can resize without downtime.
Init Container ordering for cluster bootstrap: Stateful applications like Cassandra or Kafka need to know if they are the first node in a new cluster or a replacement node joining an existing one. Use an init container that queries the headless service — if DNS resolution returns zero records, it is a fresh cluster; if it returns existing IPs, it is a join operation.

Use a separate read Service for your application: the headless Service is for peer-to-peer discovery, not for application traffic. Create a standard ClusterIP Service (or LoadBalancer for external access) that selects on the same labels and routes to the correct replica (e.g. the primary in a primary/replica setup, or to all replicas for read-balanced workloads). Many teams use labels like role: primary patched onto the primary Pod at startup to make Service selection trivial.

Scaling and Deletion

Scale with kubectl scale statefulset postgres --replicas=5 -n production. New Pods are added in ascending order (3, 4) and each must be Ready before the next is created. To scale down, run the same command with a lower count — Pods are terminated in descending order (4, 3). Their PVCs survive unless the retention policy says otherwise. Always drain a stateful replica gracefully (hand off in-flight operations, flush WAL, release cluster membership) by ensuring a proper preStop hook or by configuring the application to handle SIGTERM with a clean shutdown sequence within terminationGracePeriodSeconds.