Cloud & Kubernetes Security Hardening

Kubernetes Hardening: Pod Security

18 min Lesson 5 of 28

Kubernetes Hardening: Pod Security

Every workload running in Kubernetes executes inside a Pod. By default, that Pod inherits the host kernel, can run as root, and can mount arbitrary host paths — a single compromised container can pivot into a full cluster takeover. Pod security is the practice of stripping those defaults away so that a container escape lands in a stripped-down sandbox rather than a root shell with access to every secret in the cluster.

This lesson covers the three-layer model that production teams at large-scale companies rely on: Pod Security Standards (PSS) at the namespace level, securityContext at the Pod and container level, and runtime defaults that enforce secure-by-default behavior even when developers forget.

Pod Security Standards: Policy at the Namespace Level

Pod Security Standards replaced the deprecated PodSecurityPolicy (PSP) in Kubernetes 1.25. PSS defines three named policy levels enforced by a built-in admission controller — no CRDs, no webhooks, no external dependencies.

Privileged: Completely unrestricted. Reserved for system namespaces like kube-system and CNI plugins that legitimately need host access.
Baseline: Prevents the most dangerous escalations (privileged containers, hostPID, hostIPC, dangerous capabilities like NET_ADMIN) while allowing most legacy workloads to run unmodified.
Restricted: Heavily constrained — requires non-root UID, drops all capabilities, disallows privilege escalation, enforces a seccomp profile. The gold standard for application workloads.

Each level can be set in three modes: enforce (reject violating Pods at admission), audit (allow but emit a policy violation audit event), and warn (allow but surface a warning to the API client). The production migration pattern is to apply warn and audit in Restricted mode everywhere first, monitor violations for a sprint, fix workloads, then flip namespaces to enforce.

# Apply all three modes to a production namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/warn-version=latest \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/audit-version=latest

# Dry-run: what would Restricted reject in this namespace today?
kubectl label --dry-run=server --overwrite namespace production \
  pod-security.kubernetes.io/enforce=restricted 2>&1 | grep Warning

Pin the version, not just latest. Using enforce-version=latest means a Kubernetes upgrade can start rejecting previously-compliant Pods the moment stricter checks are added to the Restricted profile. In production, pin to a specific version like v1.30 and upgrade deliberately during maintenance windows.

securityContext: Hardening Individual Pods and Containers

PSS sets the policy floor. securityContext is where you implement it per workload. Kubernetes exposes two levels: spec.securityContext (Pod-level, applies to all containers) and spec.containers[].securityContext (container-level, overrides Pod-level for that container).

Here is a production-ready deployment that satisfies the Restricted standard with explicit, documented intent on every field:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
      annotations:
        # Explicitly document the seccomp profile choice
        seccomp.security.alpha.kubernetes.io/pod: runtime/default
    spec:
      # Pod-level security context
      securityContext:
        runAsNonRoot: true          # Reject root UID at runtime, not just "suggest" it
        runAsUser: 1000             # Explicit UID; avoid UID 0 and well-known UIDs
        runAsGroup: 3000
        fsGroup: 2000               # Volumes owned by this GID — no chmod needed
        seccompProfile:
          type: RuntimeDefault      # Kernel syscall filter via container runtime
        supplementalGroups: []      # No extra group memberships
      containers:
        - name: api
          image: registry.example.com/api:v2.4.1@sha256:abc123  # Pinned digest
          securityContext:
            allowPrivilegeEscalation: false   # Blocks setuid/setgid and sudo
            readOnlyRootFilesystem: true       # Container FS is immutable
            capabilities:
              drop: ["ALL"]                    # Start with zero Linux capabilities
              # add: ["NET_BIND_SERVICE"]      # Only add back if port < 1024 needed
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"
          volumeMounts:
            - name: tmp
              mountPath: /tmp           # Writable scratch space, not root FS
            - name: cache
              mountPath: /var/cache/app
      volumes:
        - name: tmp
          emptyDir: {}
        - name: cache
          emptyDir: {}
      automountServiceAccountToken: false   # No API server access unless required

Pod Security layers: the PSS namespace label enforces policy at admission; Pod-level securityContext sets shared defaults; container-level overrides apply per container.

Runtime Defaults: Secure Without Developer Effort

Relying solely on developers remembering to set securityContext does not scale. Production platforms use two enforcement mechanisms to make the secure path the default path.

1. Admission Webhooks with OPA/Kyverno

A mutating admission webhook can inject a sane securityContext into every Pod that does not specify one. A validating webhook can then reject Pods that still violate policy after mutation. Kyverno is the simpler option for Kubernetes-native teams; OPA/Gatekeeper offers more flexibility for complex policies shared across clouds.

# Kyverno policy: mutate any Pod missing a seccompProfile
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-default-seccomp
spec:
  rules:
    - name: add-seccomp-profile
      match:
        any:
          - resources:
              kinds: ["Pod"]
      mutate:
        patchStrategicMerge:
          spec:
            securityContext:
              +(seccompProfile):          # Only add if not already set
                type: RuntimeDefault
            containers:
              - (name): "?*"
                securityContext:
                  +(allowPrivilegeEscalation): false
                  +(readOnlyRootFilesystem): true
                  capabilities:
                    +(drop): ["ALL"]

2. seccomp RuntimeDefault as a Global Default

Since Kubernetes 1.27, you can enable --feature-gates=SeccompDefault=true on the kubelet and set --seccomp-default to apply the RuntimeDefault seccomp profile to all Pods that do not specify one. This is the closest Kubernetes comes to a global safe default at the node level — activate it on new node groups before rolling to existing ones.

Prefer RuntimeDefault over Unconfined, always. The RuntimeDefault seccomp profile (provided by containerd or cri-o) blocks ~300 dangerous syscalls including ptrace, mount, and unshare that are the primary tools in container-escape exploit chains. The performance overhead is immeasurable in virtually all workloads. There is no good reason to run Unconfined in production except for dedicated security tooling.

Production Failure Modes and What They Cost You

Understanding why each field exists requires seeing what happens without it:

Missing runAsNonRoot: true: An image built without a USER directive runs as UID 0 inside the container. If the container runtime has a kernel vulnerability, UID 0 inside maps directly to root outside. CVE-2019-5736 (runc overwrite) required root to exploit.
Missing allowPrivilegeEscalation: false: Binaries with the setuid bit set (like sudo, newgrp, or pkexec) can escalate to root even when the container starts as a non-root user. This is how CVE-2021-4034 (PwnKit) worked.
Missing readOnlyRootFilesystem: true: An attacker who achieves code execution can write persistence tools, exfiltrate data to disk, or drop a reverse shell to the container's writable layer. A read-only filesystem limits the blast radius to in-memory operations.
Missing capability drops: The default set of Linux capabilities granted to a container includes NET_RAW (craft raw packets, ARP spoofing), SYS_CHROOT, and MKNOD. Drop all and add back only what is provably needed.
Missing automountServiceAccountToken: false: Every Pod gets a service account token mounted by default. In a compromised container, that token provides API server access and is the primary pivot for lateral movement attacks within the cluster.

Image-level root is the most common finding in production audits. Teams set runAsUser: 1000 in the deployment manifest but forget to add a USER 1000 directive to the Dockerfile. Kubernetes will happily run a container as the UID you specify — but if the binary in the image was built expecting UID 0, it may crash on missing file permissions. Always build with USER nonroot in the Dockerfile AND enforce it at the Pod level.

Verifying Your Hardening

After applying securityContext settings, validate them without guessing:

# Confirm the running container's effective UID
kubectl exec -n production deploy/api-server -- id
# Expected: uid=1000 gid=3000 groups=2000

# Confirm privilege escalation is blocked
kubectl exec -n production deploy/api-server -- sudo id
# Expected: sudo: command not found  (or permission denied)

# Confirm root filesystem is read-only
kubectl exec -n production deploy/api-server -- touch /test-write
# Expected: touch: cannot touch '/test-write': Read-only file system

# Confirm no extra capabilities
kubectl exec -n production deploy/api-server -- cat /proc/1/status | grep Cap
# CapPrm and CapEff should be 0000000000000000 (no capabilities)

# Check active seccomp profile on running container
kubectl get pod -n production -l app=api-server -o jsonpath=\
  '{.items[0].spec.securityContext.seccompProfile}'

Combine these checks with a policy scanner like Trivy (trivy k8s --report summary cluster) or kube-bench for CIS Benchmark compliance. Running these in your CI pipeline — against the rendered manifests, before cluster admission — catches regressions before they reach production.