Cloud & Kubernetes Security Hardening

Kubernetes Hardening: Pod Security

18 min Lesson 5 of 28

Kubernetes Hardening: Pod Security

Every workload running in Kubernetes executes inside a Pod. By default, that Pod inherits the host kernel, can run as root, and can mount arbitrary host paths — a single compromised container can pivot into a full cluster takeover. Pod security is the practice of stripping those defaults away so that a container escape lands in a stripped-down sandbox rather than a root shell with access to every secret in the cluster.

This lesson covers the three-layer model that production teams at large-scale companies rely on: Pod Security Standards (PSS) at the namespace level, securityContext at the Pod and container level, and runtime defaults that enforce secure-by-default behavior even when developers forget.

Pod Security Standards: Policy at the Namespace Level

Pod Security Standards replaced the deprecated PodSecurityPolicy (PSP) in Kubernetes 1.25. PSS defines three named policy levels enforced by a built-in admission controller — no CRDs, no webhooks, no external dependencies.

  • Privileged: Completely unrestricted. Reserved for system namespaces like kube-system and CNI plugins that legitimately need host access.
  • Baseline: Prevents the most dangerous escalations (privileged containers, hostPID, hostIPC, dangerous capabilities like NET_ADMIN) while allowing most legacy workloads to run unmodified.
  • Restricted: Heavily constrained — requires non-root UID, drops all capabilities, disallows privilege escalation, enforces a seccomp profile. The gold standard for application workloads.

Each level can be set in three modes: enforce (reject violating Pods at admission), audit (allow but emit a policy violation audit event), and warn (allow but surface a warning to the API client). The production migration pattern is to apply warn and audit in Restricted mode everywhere first, monitor violations for a sprint, fix workloads, then flip namespaces to enforce.

# Apply all three modes to a production namespace kubectl label namespace production \ pod-security.kubernetes.io/enforce=restricted \ pod-security.kubernetes.io/enforce-version=latest \ pod-security.kubernetes.io/warn=restricted \ pod-security.kubernetes.io/warn-version=latest \ pod-security.kubernetes.io/audit=restricted \ pod-security.kubernetes.io/audit-version=latest # Dry-run: what would Restricted reject in this namespace today? kubectl label --dry-run=server --overwrite namespace production \ pod-security.kubernetes.io/enforce=restricted 2>&1 | grep Warning
Pin the version, not just latest. Using enforce-version=latest means a Kubernetes upgrade can start rejecting previously-compliant Pods the moment stricter checks are added to the Restricted profile. In production, pin to a specific version like v1.30 and upgrade deliberately during maintenance windows.

securityContext: Hardening Individual Pods and Containers

PSS sets the policy floor. securityContext is where you implement it per workload. Kubernetes exposes two levels: spec.securityContext (Pod-level, applies to all containers) and spec.containers[].securityContext (container-level, overrides Pod-level for that container).

Here is a production-ready deployment that satisfies the Restricted standard with explicit, documented intent on every field:

apiVersion: apps/v1 kind: Deployment metadata: name: api-server namespace: production spec: replicas: 3 selector: matchLabels: app: api-server template: metadata: labels: app: api-server annotations: # Explicitly document the seccomp profile choice seccomp.security.alpha.kubernetes.io/pod: runtime/default spec: # Pod-level security context securityContext: runAsNonRoot: true # Reject root UID at runtime, not just "suggest" it runAsUser: 1000 # Explicit UID; avoid UID 0 and well-known UIDs runAsGroup: 3000 fsGroup: 2000 # Volumes owned by this GID — no chmod needed seccompProfile: type: RuntimeDefault # Kernel syscall filter via container runtime supplementalGroups: [] # No extra group memberships containers: - name: api image: registry.example.com/api:v2.4.1@sha256:abc123 # Pinned digest securityContext: allowPrivilegeEscalation: false # Blocks setuid/setgid and sudo readOnlyRootFilesystem: true # Container FS is immutable capabilities: drop: ["ALL"] # Start with zero Linux capabilities # add: ["NET_BIND_SERVICE"] # Only add back if port < 1024 needed resources: requests: cpu: "100m" memory: "128Mi" limits: cpu: "500m" memory: "256Mi" volumeMounts: - name: tmp mountPath: /tmp # Writable scratch space, not root FS - name: cache mountPath: /var/cache/app volumes: - name: tmp emptyDir: {} - name: cache emptyDir: {} automountServiceAccountToken: false # No API server access unless required
Pod Security layers: PSS namespace label, Pod securityContext, Container securityContext Namespace: production PSS Label: enforce=restricted Pod spec.securityContext (Pod-level) runAsNonRoot · runAsUser · fsGroup · seccompProfile Container: api allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities.drop: [ALL] seccompProfile: RuntimeDefault inherits Pod-level + container overrides Container: log-agent (sidecar) allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities.drop: [ALL] runAsUser: 2000 (overrides Pod) container-level overrides Pod-level UID enforces
Pod Security layers: the PSS namespace label enforces policy at admission; Pod-level securityContext sets shared defaults; container-level overrides apply per container.

Runtime Defaults: Secure Without Developer Effort

Relying solely on developers remembering to set securityContext does not scale. Production platforms use two enforcement mechanisms to make the secure path the default path.

1. Admission Webhooks with OPA/Kyverno

A mutating admission webhook can inject a sane securityContext into every Pod that does not specify one. A validating webhook can then reject Pods that still violate policy after mutation. Kyverno is the simpler option for Kubernetes-native teams; OPA/Gatekeeper offers more flexibility for complex policies shared across clouds.

# Kyverno policy: mutate any Pod missing a seccompProfile apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: add-default-seccomp spec: rules: - name: add-seccomp-profile match: any: - resources: kinds: ["Pod"] mutate: patchStrategicMerge: spec: securityContext: +(seccompProfile): # Only add if not already set type: RuntimeDefault containers: - (name): "?*" securityContext: +(allowPrivilegeEscalation): false +(readOnlyRootFilesystem): true capabilities: +(drop): ["ALL"]

2. seccomp RuntimeDefault as a Global Default

Since Kubernetes 1.27, you can enable --feature-gates=SeccompDefault=true on the kubelet and set --seccomp-default to apply the RuntimeDefault seccomp profile to all Pods that do not specify one. This is the closest Kubernetes comes to a global safe default at the node level — activate it on new node groups before rolling to existing ones.

Prefer RuntimeDefault over Unconfined, always. The RuntimeDefault seccomp profile (provided by containerd or cri-o) blocks ~300 dangerous syscalls including ptrace, mount, and unshare that are the primary tools in container-escape exploit chains. The performance overhead is immeasurable in virtually all workloads. There is no good reason to run Unconfined in production except for dedicated security tooling.

Production Failure Modes and What They Cost You

Understanding why each field exists requires seeing what happens without it:

  • Missing runAsNonRoot: true: An image built without a USER directive runs as UID 0 inside the container. If the container runtime has a kernel vulnerability, UID 0 inside maps directly to root outside. CVE-2019-5736 (runc overwrite) required root to exploit.
  • Missing allowPrivilegeEscalation: false: Binaries with the setuid bit set (like sudo, newgrp, or pkexec) can escalate to root even when the container starts as a non-root user. This is how CVE-2021-4034 (PwnKit) worked.
  • Missing readOnlyRootFilesystem: true: An attacker who achieves code execution can write persistence tools, exfiltrate data to disk, or drop a reverse shell to the container's writable layer. A read-only filesystem limits the blast radius to in-memory operations.
  • Missing capability drops: The default set of Linux capabilities granted to a container includes NET_RAW (craft raw packets, ARP spoofing), SYS_CHROOT, and MKNOD. Drop all and add back only what is provably needed.
  • Missing automountServiceAccountToken: false: Every Pod gets a service account token mounted by default. In a compromised container, that token provides API server access and is the primary pivot for lateral movement attacks within the cluster.
Image-level root is the most common finding in production audits. Teams set runAsUser: 1000 in the deployment manifest but forget to add a USER 1000 directive to the Dockerfile. Kubernetes will happily run a container as the UID you specify — but if the binary in the image was built expecting UID 0, it may crash on missing file permissions. Always build with USER nonroot in the Dockerfile AND enforce it at the Pod level.

Verifying Your Hardening

After applying securityContext settings, validate them without guessing:

# Confirm the running container's effective UID kubectl exec -n production deploy/api-server -- id # Expected: uid=1000 gid=3000 groups=2000 # Confirm privilege escalation is blocked kubectl exec -n production deploy/api-server -- sudo id # Expected: sudo: command not found (or permission denied) # Confirm root filesystem is read-only kubectl exec -n production deploy/api-server -- touch /test-write # Expected: touch: cannot touch '/test-write': Read-only file system # Confirm no extra capabilities kubectl exec -n production deploy/api-server -- cat /proc/1/status | grep Cap # CapPrm and CapEff should be 0000000000000000 (no capabilities) # Check active seccomp profile on running container kubectl get pod -n production -l app=api-server -o jsonpath=\ '{.items[0].spec.securityContext.seccompProfile}'

Combine these checks with a policy scanner like Trivy (trivy k8s --report summary cluster) or kube-bench for CIS Benchmark compliance. Running these in your CI pipeline — against the rendered manifests, before cluster admission — catches regressions before they reach production.