Container Runtimes & OCI
Container Runtimes & OCI
When you run docker run nginx, you probably think of Docker as "the thing that runs containers." That mental model worked in 2015, but today's production infrastructure is far more layered. Kubernetes nodes do not talk to Docker at all — they talk to containerd, which talks to runc, which makes a handful of Linux syscalls. Understanding this stack is not academic trivia; it determines which runtime flags you can set, how your security policy is enforced, and which runtime you choose when you need stronger isolation than the default.
The Problem That Created OCI
By 2015, Docker had become synonymous with containers, but its monolithic architecture created fragility for the ecosystem. Kubernetes, CoreOS rkt, and others all wanted to run containers, but there was no standard — every player was re-implementing image formats and runtime behavior independently. The Open Container Initiative (OCI), housed under the Linux Foundation, was formed to fix this. It defines two specifications:
- OCI Image Spec — the format of a container image: a manifest, a filesystem layer stack (tar), and a config JSON describing the entrypoint, environment, and more.
- OCI Runtime Spec — a
config.jsonthat describes everything needed to run a container: root filesystem path, process args, namespaces, cgroups, capabilities, seccomp profile, mount points, hooks.
Any tool that produces an OCI image can be run by any OCI-compliant runtime, and any OCI-compliant runtime can be plugged into any OCI-aware orchestrator. Docker, Buildah, Kaniko, and Buildkit all output OCI images. runc, crun, gVisor's runsc, and Kata Containers all implement the OCI runtime spec.
The Runtime Stack in Full
Modern Kubernetes clusters use a three-layer runtime stack. Each layer has a distinct job:
containerd: The High-Level Runtime
containerd is a CNCF graduated project and the default runtime in every major managed Kubernetes service (EKS, GKE, AKS). It manages the full lifecycle of containers: pulling and storing images (via its snapshot system), setting up overlay filesystems, configuring CNI networking, and managing container state — but it deliberately does not execute the container process itself. That job belongs to the low-level runtime.
containerd exposes a CRI-compatible gRPC API that kubelet speaks. You can also speak to it directly with the ctr CLI (low-level) or the friendlier nerdctl (Docker-compatible CLI backed by containerd).
k8s.io namespace; standalone ctr commands use default. Always specify -n k8s.io when debugging Kubernetes containers with ctr.
runc: The Low-Level OCI Runtime
runc is the reference implementation of the OCI runtime spec, extracted from Docker's original libcontainer. It is a small Go binary (~8 MB) that does exactly one thing: given a directory with an rootfs/ and a config.json, it creates Linux namespaces, sets cgroup limits, applies seccomp filters and capability bounding sets, then exec()s the process. It exits after handing control to the shim — it does not remain resident.
You can invoke runc directly to understand what it does or to debug a misbehaving container:
Where Docker Fits Today
Docker (the CLI and daemon) was refactored starting in 2017. Today dockerd is essentially a developer-experience layer on top of containerd: it handles the Docker API, image builds via BuildKit, and the familiar CLI. When you run docker run, the call path is: docker CLI → dockerd → containerd → shim → runc → kernel. On Kubernetes nodes, dockerd is not in the path at all — kubelet speaks CRI directly to containerd.
NotReady state after a Kubernetes upgrade, check that the CRI socket path in the kubelet config matches the installed runtime (/run/containerd/containerd.sock for containerd, /var/run/crio/crio.sock for CRI-O).
Alternative Runtimes: When runc Is Not Enough
runc shares the host kernel — every container on a node uses the same kernel. For multi-tenant workloads where you cannot fully trust the container workload (e.g., a public CI service or a serverless platform), this is a security boundary concern. Two production-grade alternatives handle this:
- gVisor (runsc) — Google's OCI runtime that interposes a user-space kernel (the "Sentry") between the container process and the host kernel. System calls from the container hit the Sentry, which re-implements a large subset of Linux in Go. This drastically reduces the host kernel attack surface. GKE's Sandbox nodes use
runsc. Performance overhead is real (~10-20% for CPU-bound, higher for syscall-heavy workloads). - Kata Containers — runs each container (or Pod) inside a lightweight VM using QEMU or Cloud Hypervisor. Full hardware VM isolation; the container process never touches the host kernel. Used in Azure Confidential Containers. Higher startup latency (~1s vs ~100ms for runc).
Both are OCI-compliant, so you can use them as a drop-in replacement for runc inside containerd by registering a RuntimeClass in Kubernetes:
runsc in a staging environment before rolling to production.
Choosing a Runtime: Decision Matrix
At big-tech scale, the runtime choice is driven by the threat model and workload type:
- Internal, trusted workloads on dedicated nodes —
runc(default). Lowest overhead, widest compatibility. - Multi-tenant SaaS or public CI —
runsc(gVisor) for CPU/memory-bound jobs; Kata for anything requiring strong VM-level isolation. - Regulated environments (FedRAMP, PCI-DSS) — Kata Containers or confidential computing VMs; auditors often require hardware-level isolation proof.
- Edge / IoT with constrained resources —
crun(C reimplementation of runc, ~15× faster startup, ~50× lower memory footprint).
RuntimeClass field in the Pod spec changes. This is the practical payoff of the OCI standardization effort.