Volumes & PersistentVolumes
Volumes & PersistentVolumes
Every container in Kubernetes starts with an empty, ephemeral filesystem layered on top of its image. When the container exits — by crash, OOM kill, or rolling update — that filesystem is gone. For stateless services this is a feature, not a bug. But the moment you run a database, a message broker, a machine-learning checkpoint store, or any workload whose value lives in data written to disk, you need storage that outlives the container and often the Pod itself. Kubernetes models storage at three distinct abstraction levels — the raw Volume, the cluster-level PersistentVolume (PV), and the user-level claim against it, the PersistentVolumeClaim (PVC). Understanding where each fits, and why the binding model is designed the way it is, is prerequisite knowledge for running production stateful workloads at scale.
Ephemeral Volumes: Lifetime Tied to the Pod
A Volume in Kubernetes is not a PersistentVolume — it is a directory made available inside containers of a Pod, with a lifetime scoped to the Pod. When the Pod is deleted, the volume is torn down. Ephemeral volumes are useful for exactly three patterns:
emptyDir— an empty directory created when the Pod starts, backed by the node's disk (or memory withmedium: Memory). Ideal for scratch space, sidecar-shared caches, and multi-container communication within a Pod. Used heavily for read-through caches in Envoy-based service meshes.configMap/secret— mounts API objects as files. A secret volume mounted at/etc/tlsgives containers access to TLS certificates without baking them into the image. These are also ephemeral in that they follow the Pod.projected— combines multiple sources (configMap, secret, serviceAccountToken, downwardAPI) into a single mount point. The standard way to expose a short-lived, auto-rotated service account token to a container in modern clusters (replaces the old mounted secret approach).
medium: Memory) counts against the container's memory limit. If your init container writes 500 MiB of decompressed data to a memory-backed emptyDir and your container limit is 512 MiB, the Pod will be OOMKilled before it even starts its main workload. Always set explicit sizeLimit on emptyDir volumes in production manifests.PersistentVolumes: Cluster-Level Storage Resources
A PersistentVolume (PV) is a cluster-scoped resource that represents a piece of storage — an AWS EBS volume, a GCP Persistent Disk, an NFS share, a Ceph RBD image — that has been provisioned and registered with Kubernetes. Think of a PV the way you think of a Node: it is a resource in the cluster inventory, independent of any particular workload. A PV encodes four critical properties:
- Capacity — the storage size (
storage: 50Gi). - Access modes — how many nodes and in what mode can mount the volume (see below).
- Reclaim policy — what happens to the underlying storage when the PVC is deleted (
Retain,Delete, or the deprecatedRecycle). - VolumeMode —
Filesystem(default, mounted as a directory) orBlock(raw block device, used by databases that manage their own I/O like Cassandra or some PostgreSQL configurations).
Access Modes — the Most Misunderstood Field
Access modes define the contract between the storage backend and the cluster scheduler. There are four modes defined by the API:
- ReadWriteOnce (RWO) — the volume can be mounted read-write by a single node at a time. This is the mode supported by all block-storage backends (EBS, GCP PD, Azure Disk). RWO does not mean single Pod — multiple Pods on the same node can mount it.
- ReadOnlyMany (ROX) — the volume can be mounted read-only by many nodes simultaneously. Useful for distributing read-only reference data (ML model weights, static asset bundles) across a fleet.
- ReadWriteMany (RWX) — the volume can be mounted read-write by many nodes simultaneously. Only shared filesystems support this: NFS, AWS EFS, Azure Files, GCP Filestore, CephFS. Block storage backends do not support RWX.
- ReadWriteOncePod (RWOP) — introduced in Kubernetes 1.22, this is a stricter variant of RWO that enforces single-Pod semantics at the API level, not just single-node. RWOP is the correct choice for a primary database volume where two Pods racing to mount the same volume would cause split-brain or data corruption.
ReadWriteOncePod (Kubernetes 1.29+ stable) and configure pod disruption budgets carefully.PersistentVolumeClaims: Portable Storage Requests
A PersistentVolumeClaim (PVC) is a namespace-scoped request for storage. A developer writes a PVC specifying the minimum size, access mode, and (optionally) a storageClassName. The PV controller in the control plane scans available PVs and binds the first one that satisfies all three criteria. The binding is exclusive and one-to-one: once a PV is bound to a PVC, no other PVC can bind it. This is a critical design property — it means a 100Gi PV will be consumed entirely by a 10Gi PVC if that is the only available match, wasting 90Gi. StorageClasses and dynamic provisioning (next lesson) solve this by creating right-sized PVs on demand.
Reclaim Policies and What Actually Happens to Your Data
The persistentVolumeReclaimPolicy field determines what the cluster does with the underlying storage resource when the PVC is deleted:
- Retain — the PV is not deleted and is not made available for rebinding. It enters
Releasedstate. An administrator must manually inspect the data, optionally take a snapshot, then delete the PV object to release the underlying storage. This is the correct policy for production databases. - Delete — the PV object and the underlying storage asset (EBS volume, GCP PD, etc.) are deleted automatically when the PVC is deleted. This is the default for dynamically provisioned PVs. Safe for stateless scratch storage; dangerous for databases.
- Recycle — deprecated since Kubernetes 1.11 and removed in 1.25. Do not use it.
spec.resources.requests.storage to a larger value — you cannot shrink a PVC. The CSI driver will expand the filesystem online (for most block-storage drivers on Kubernetes 1.24+) without restarting the Pod. Always enable allowVolumeExpansion: true in your StorageClass (covered next lesson). Monitor actual disk usage with kubectl exec + df -h or expose it via the kubelet_volume_stats_* metrics in Prometheus, and alert at 80% capacity to give yourself time to expand before hitting the limit.Volume Subpaths: Sharing One Volume Across Multiple Uses
A common production pattern for small workloads is to mount a single PVC into multiple paths within a container using subPath. For example, a single NFS PV might provide separate directories for application logs, uploads, and configuration backups for a small team. Use subPath with caution: changes to a mounted ConfigMap or Secret do not propagate to containers using subPath mounts (this is a known Kubernetes limitation — the inode is pinned at mount time). For ConfigMap/Secret rotation, prefer a full mount and read the file path.
Production Failure Modes to Internalize
Storage failures are among the most operationally damaging in Kubernetes because they often surface silently. The most common patterns at scale:
- PVC stuck in Pending — no PV satisfies the claim. Check that capacity, access mode, and
storageClassNameall match an available PV.kubectl describe pvc <name>will show the binding failure reason. - Node-level volume attach timeout — especially with EBS: when a Pod is rescheduled after a node failure, the EBS detach/re-attach cycle can take 60–90 seconds. During this window the new Pod is stuck in
ContainerCreating. Mitigate with node termination handlers (AWS Node Termination Handler) that proactively detach volumes before the instance is terminated. - Full volume kills the process — a write to a full ext4 filesystem returns ENOSPC, and most databases (PostgreSQL, MySQL) crash rather than degrade gracefully. Monitor at 80%; expand before hitting the limit. Consider using
fsGroupand setting filesystem reserved blocks to 0 (tune2fs -m 0) on database volumes since the reserved 5% has no value for a DB volume owned by a single process.