Advanced Docker & Container Security

Container Runtimes & OCI

18 min Lesson 8 of 28

Container Runtimes & OCI

When you run docker run nginx, you probably think of Docker as "the thing that runs containers." That mental model worked in 2015, but today's production infrastructure is far more layered. Kubernetes nodes do not talk to Docker at all — they talk to containerd, which talks to runc, which makes a handful of Linux syscalls. Understanding this stack is not academic trivia; it determines which runtime flags you can set, how your security policy is enforced, and which runtime you choose when you need stronger isolation than the default.

The Problem That Created OCI

By 2015, Docker had become synonymous with containers, but its monolithic architecture created fragility for the ecosystem. Kubernetes, CoreOS rkt, and others all wanted to run containers, but there was no standard — every player was re-implementing image formats and runtime behavior independently. The Open Container Initiative (OCI), housed under the Linux Foundation, was formed to fix this. It defines two specifications:

OCI Image Spec — the format of a container image: a manifest, a filesystem layer stack (tar), and a config JSON describing the entrypoint, environment, and more.
OCI Runtime Spec — a config.json that describes everything needed to run a container: root filesystem path, process args, namespaces, cgroups, capabilities, seccomp profile, mount points, hooks.

Any tool that produces an OCI image can be run by any OCI-compliant runtime, and any OCI-compliant runtime can be plugged into any OCI-aware orchestrator. Docker, Buildah, Kaniko, and Buildkit all output OCI images. runc, crun, gVisor's runsc, and Kata Containers all implement the OCI runtime spec.

The Runtime Stack in Full

Modern Kubernetes clusters use a three-layer runtime stack. Each layer has a distinct job:

The three-layer container runtime stack: kubelet contacts containerd via CRI, containerd hands an OCI bundle to runc via the shim, and runc makes the kernel syscalls that create the container.

containerd: The High-Level Runtime

containerd is a CNCF graduated project and the default runtime in every major managed Kubernetes service (EKS, GKE, AKS). It manages the full lifecycle of containers: pulling and storing images (via its snapshot system), setting up overlay filesystems, configuring CNI networking, and managing container state — but it deliberately does not execute the container process itself. That job belongs to the low-level runtime.

containerd exposes a CRI-compatible gRPC API that kubelet speaks. You can also speak to it directly with the ctr CLI (low-level) or the friendlier nerdctl (Docker-compatible CLI backed by containerd).

# Inspect containerd state on a Kubernetes node
# (SSH into the node first — these are node-level commands)

# List running containers via crictl (CRI-compliant CLI)
crictl ps

# Pull an image directly through containerd
ctr image pull docker.io/library/nginx:1.27-alpine

# List images in the default containerd namespace
ctr -n k8s.io images ls

# Inspect the OCI runtime config that containerd generates for a container
# (useful for auditing what seccomp/caps are actually applied)
crictl inspect <container-id> | python3 -m json.tool | grep -A5 seccomp

containerd namespaces: containerd has its own namespace concept (not Linux namespaces). Kubernetes containers live in the k8s.io namespace; standalone ctr commands use default. Always specify -n k8s.io when debugging Kubernetes containers with ctr.

runc: The Low-Level OCI Runtime

runc is the reference implementation of the OCI runtime spec, extracted from Docker's original libcontainer. It is a small Go binary (~8 MB) that does exactly one thing: given a directory with an rootfs/ and a config.json, it creates Linux namespaces, sets cgroup limits, applies seccomp filters and capability bounding sets, then exec()s the process. It exits after handing control to the shim — it does not remain resident.

You can invoke runc directly to understand what it does or to debug a misbehaving container:

# Manually run a container with runc (for understanding — not typical ops)

# 1. Create the OCI bundle directory structure
mkdir -p /tmp/mycontainer/rootfs

# 2. Export a Docker image as a rootfs
docker export $(docker create alpine) | tar -C /tmp/mycontainer/rootfs -xf -

# 3. Generate a default OCI config.json
cd /tmp/mycontainer
runc spec

# 4. Inspect the generated spec — this is exactly what runc reads
cat config.json | python3 -m json.tool | head -60

# 5. Run the container
runc run mycontainer-1

# 6. List running runc containers (in another terminal)
runc list

# 7. Delete
runc delete mycontainer-1

Where Docker Fits Today

Docker (the CLI and daemon) was refactored starting in 2017. Today dockerd is essentially a developer-experience layer on top of containerd: it handles the Docker API, image builds via BuildKit, and the familiar CLI. When you run docker run, the call path is: docker CLI → dockerd → containerd → shim → runc → kernel. On Kubernetes nodes, dockerd is not in the path at all — kubelet speaks CRI directly to containerd.

Production insight: As of Kubernetes 1.24, the dockershim (the adapter that let kubelet speak to dockerd) was removed. If your cluster was still using Docker as the Kubernetes runtime, it had to migrate to containerd or CRI-O. Most cloud-managed clusters switched automatically. If you ever see a node in NotReady state after a Kubernetes upgrade, check that the CRI socket path in the kubelet config matches the installed runtime (/run/containerd/containerd.sock for containerd, /var/run/crio/crio.sock for CRI-O).

Alternative Runtimes: When runc Is Not Enough

runc shares the host kernel — every container on a node uses the same kernel. For multi-tenant workloads where you cannot fully trust the container workload (e.g., a public CI service or a serverless platform), this is a security boundary concern. Two production-grade alternatives handle this:

gVisor (runsc) — Google's OCI runtime that interposes a user-space kernel (the "Sentry") between the container process and the host kernel. System calls from the container hit the Sentry, which re-implements a large subset of Linux in Go. This drastically reduces the host kernel attack surface. GKE's Sandbox nodes use runsc. Performance overhead is real (~10-20% for CPU-bound, higher for syscall-heavy workloads).
Kata Containers — runs each container (or Pod) inside a lightweight VM using QEMU or Cloud Hypervisor. Full hardware VM isolation; the container process never touches the host kernel. Used in Azure Confidential Containers. Higher startup latency (~1s vs ~100ms for runc).

Both are OCI-compliant, so you can use them as a drop-in replacement for runc inside containerd by registering a RuntimeClass in Kubernetes:

# Kubernetes RuntimeClass for gVisor (requires gVisor installed on nodes)
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc    # matches the containerd shim handler name
---
# Use it in a Pod spec — everything else is identical
apiVersion: v1
kind: Pod
metadata:
  name: sandboxed-app
spec:
  runtimeClassName: gvisor    # <-- the only change
  containers:
    - name: app
      image: gcr.io/myproject/myapp:v2.1.0
      resources:
        requests:
          memory: "128Mi"
          cpu: "250m"
        limits:
          memory: "256Mi"
          cpu: "500m"

Not all workloads run on gVisor: The Sentry implements the most common syscalls, but some applications use obscure or new kernel interfaces that the Sentry does not support yet. Always test before migrating — stateful databases, eBPF-dependent tools, and some Go runtime internals have historically had compatibility issues. Check the gVisor compatibility list and run your test suite against runsc in a staging environment before rolling to production.

Choosing a Runtime: Decision Matrix

At big-tech scale, the runtime choice is driven by the threat model and workload type:

Internal, trusted workloads on dedicated nodes — runc (default). Lowest overhead, widest compatibility.
Multi-tenant SaaS or public CI — runsc (gVisor) for CPU/memory-bound jobs; Kata for anything requiring strong VM-level isolation.
Regulated environments (FedRAMP, PCI-DSS) — Kata Containers or confidential computing VMs; auditors often require hardware-level isolation proof.
Edge / IoT with constrained resources — crun (C reimplementation of runc, ~15× faster startup, ~50× lower memory footprint).

The OCI spec is the contract: switching runtimes requires zero changes to your images, Dockerfiles, or Kubernetes manifests — only the RuntimeClass field in the Pod spec changes. This is the practical payoff of the OCI standardization effort.