Docker & Containerization

Containers vs VMs

18 min Lesson 1 of 30

Containers vs VMs

Before you write a single Dockerfile, you need a precise mental model of what a container actually is at the kernel level. Engineers who skip this step treat Docker as a black box and hit mysterious failures in production — from privilege-escalation security holes to processes that escape their expected resource limits. This lesson builds the model from first principles.

The Virtual Machine Model

A virtual machine achieves isolation by emulating an entire hardware stack. A hypervisor (VMware ESXi, KVM, Hyper-V, AWS Nitro) sits between the physical hardware and one or more guest operating systems. Each guest gets a slice of CPU, RAM, and disk that looks — from the guest's perspective — like dedicated hardware. The guest boots its own kernel, manages its own memory pages, and runs its own init system (systemd, OpenRC, etc.).

This model is strong: a bug in the guest kernel cannot corrupt the host kernel because the two kernels never share memory. The attack surface between tenant VMs is the hypervisor, and hypervisors are small, heavily audited codebases. Multi-tenant cloud providers (AWS, GCP, Azure) rely on this guarantee to run competing customers on the same physical host.

But the model is also heavy. Booting a VM requires loading a full kernel (seconds to minutes), allocating memory for the OS overhead (often 200 MB–1 GB of RAM just for the guest OS before your application starts), and maintaining a complete disk image. Cold-starting 50 VMs in response to a traffic spike is measured in minutes, not seconds.

The Container Model: Two Kernel Primitives

Containers are not a new concept invented by Docker. They are a Linux kernel feature that Docker packaged into a usable developer experience in 2013. Two kernel subsystems do the real work: namespaces and cgroups.

Namespaces: What You Can See

A Linux namespace is a wrapper around a global system resource that makes the process inside the namespace believe it has its own isolated instance of that resource. The kernel maintains separate namespaces for:

  • pid — Process ID namespace. PID 1 inside a container is just a regular process on the host (say, PID 3841), but the container sees it as PID 1. The container cannot see or signal host processes or processes in other containers.
  • net — Network namespace. Each container gets its own loopback interface, its own routing table, its own iptables rules, and its own set of sockets. Two containers can both bind port 8080 without conflict because they live in different network namespaces.
  • mnt — Mount namespace. The container's filesystem view is isolated. The container sees a root filesystem (the image layers) that is different from the host's /. You can mount host directories into this namespace, which is the foundation of Docker volumes.
  • uts — UNIX Time-sharing System namespace. The container can have its own hostname and domain name, independent of the host.
  • ipc — Inter-process communication namespace. Shared memory segments and semaphores are isolated per namespace, preventing cross-container IPC.
  • user — User namespace. Maps user IDs inside the container to different user IDs on the host. The container's root (UID 0) can map to an unprivileged UID (e.g., UID 100000) on the host — the foundation of rootless containers.
  • cgroup (Linux 4.6+) — Hides the host cgroup hierarchy from the container, so it sees only its own resource limits as the top-level limits.
Key idea: Namespaces answer the question "what can this process see?" A containerized process is still a normal Linux process. It does not run in a different kernel — it runs in the same kernel as every other process on the host, but its view of the system is scoped to its namespace.

You can inspect the namespace membership of any process directly from the host:

# List all namespaces the current shell belongs to ls -la /proc/$$/ns/ # Output (abbreviated) — each symlink names the namespace type and its inode: # lrwxrwxrwx 1 root root 0 Jun 11 08:00 cgroup -> 'cgroup:[4026531835]' # lrwxrwxrwx 1 root root 0 Jun 11 08:00 ipc -> 'ipc:[4026531839]' # lrwxrwxrwx 1 root root 0 Jun 11 08:00 mnt -> 'mnt:[4026531840]' # lrwxrwxrwx 1 root root 0 Jun 11 08:00 net -> 'net:[4026531993]' # lrwxrwxrwx 1 root root 0 Jun 11 08:00 pid -> 'pid:[4026531836]' # lrwxrwxrwx 1 root root 0 Jun 11 08:00 user -> 'user:[4026531837]' # lrwxrwxrwx 1 root root 0 Jun 11 08:00 uts -> 'uts:[4026531838]' # Start a container and compare its namespace inodes with the host shell: docker run --rm -d --name demo nginx:alpine CPID=$(docker inspect demo --format '{{.State.Pid}}') echo "Host shell net ns: $(readlink /proc/$$/ns/net)" echo "Container net ns: $(readlink /proc/$CPID/ns/net)" # They will be DIFFERENT inodes — confirming network isolation. # You can also enter a container\'s namespace from the host (powerful for debugging): nsenter --target $CPID --net ip addr # This runs 'ip addr' inside the container\'s network namespace without a shell.

cgroups: What You Can Consume

Namespaces control visibility. Control groups (cgroups) control consumption. A cgroup is a kernel mechanism for grouping processes and enforcing limits on the resources they collectively use. Docker translates every --memory, --cpus, and --pids-limit flag into cgroup entries in /sys/fs/cgroup/.

The two cgroup versions differ significantly:

  • cgroups v1 (legacy) — Each resource controller (cpu, memory, blkio, pids, …) has its own independent hierarchy under /sys/fs/cgroup/<controller>/. A process can be in different positions in different hierarchies simultaneously — complex and error-prone.
  • cgroups v2 (unified, default since kernel 5.8 / Ubuntu 22.04 / RHEL 9) — A single unified hierarchy. All controllers are under /sys/fs/cgroup/. Simpler delegation model, better support for rootless containers, pressure stall information (PSI) for memory and CPU, and improved OOM killer behavior. Always prefer v2 on new systems.
# Verify which cgroup version the host runs: stat -fc %T /sys/fs/cgroup/ # cgroup2fs = v2 (unified) tmpfs = v1 (legacy) # Run a container with explicit resource limits: docker run --rm -d \ --name limited-app \ --memory="256m" \ --memory-swap="256m" \ # Disable swap (swap = memory limit, no extra swap) --cpus="0.5" \ # 50% of one CPU core --pids-limit=100 \ # Max 100 processes/threads inside the container nginx:alpine # Find the cgroup the container was placed in (cgroups v2): CPID=$(docker inspect limited-app --format '{{.State.Pid}}') cat /proc/$CPID/cgroup # 0::/system.slice/docker-<container-id>.scope # Inspect the actual memory limit imposed by the kernel: cat /sys/fs/cgroup/system.slice/docker-$(docker inspect limited-app --format '{{.Id}}').scope/memory.max # 268435456 (= 256 * 1024 * 1024 bytes — exactly what we requested) # What happens when a container exceeds its memory limit: # The OOM killer terminates the most memory-hungry process inside the cgroup. # On the host: dmesg | grep -i "oom killer" # In Kubernetes this appears as a pod in OOMKilled status.
Production pitfall: Setting --memory without --memory-swap allows the container to use additional swap equal to the memory limit (total swap = 2× memory). On a host with heavy swap usage this causes severe latency spikes. Always set both flags, or set --memory-swap equal to --memory to disable swap entirely for latency-sensitive workloads.

The Architectural Difference — Visualized

The diagram below shows exactly what is shared and what is isolated in each model. This is the diagram to internalize:

VM stack vs Container stack side by side Virtual Machines Physical Hardware (CPU / RAM / Disk) Hypervisor (KVM / ESXi / Nitro) VM 1 Guest OS + Kernel Libs / Runtime App A VM 2 Guest OS + Kernel Libs / Runtime App B Each VM: full OS + kernel + libs (~500 MB+, seconds to boot) Containers Physical Hardware (CPU / RAM / Disk) Host OS Kernel (SHARED) namespaces + cgroups isolate each container Container Runtime (containerd + runc) Container 1 Libs / Runtime App A Container 2 Libs / Runtime App B Each container: libs only (~5-100 MB, milliseconds to start) No guest kernel — kernel calls go directly to host
Left: each VM carries a full guest OS and kernel, isolated by the hypervisor. Right: containers share the host kernel — namespaces provide isolation, cgroups enforce resource limits. The fundamental trade-off is security boundary depth vs. startup speed and density.

Why Containers Won (Operationally)

The container model delivers three practical advantages that drove adoption at scale:

  • Startup time: A container process starts in 50–300 ms because no kernel needs to boot. A VM needs 5–60 seconds even with an optimized image. At Kubernetes scale — where pods are created and destroyed continuously in response to load — this difference determines whether autoscaling can keep up with traffic spikes.
  • Density: A 4 vCPU / 16 GB RAM VM running bare Ubuntu loses roughly 1–2 GB to the OS before your application sees a byte. Containers add almost no OS overhead (the host kernel is already running). On the same hardware you might run 5 VMs or 150 containers, which is the economic driver behind container orchestration.
  • Image portability: A container image bundles exactly the libraries and binaries the application needs. The image that passes your CI pipeline is the exact same binary artifact that runs in production. With VMs, the "works on my machine" problem was partly replaced by "works in staging" — configuration drift between image builds and VM base images was a constant operational burden.
Pro practice: In production, containers and VMs are complementary, not competitors. Google, AWS, and Azure run containers inside VMs. The VM provides the hypervisor-level security boundary (multi-tenant isolation) while containers provide density and fast scheduling within that boundary. Kubernetes nodes are VMs; the pods running on them are containers. Understand both layers.

The Security Trade-off — What Containers Are NOT

The shared kernel is the source of containers' speed — and their primary security limitation. If a process inside a container exploits a kernel vulnerability (a container escape), it can gain access to the host and all other containers on it. This attack surface does not exist with VMs, because a kernel vulnerability in a guest VM cannot cross the hypervisor.

Google's gVisor and Amazon's Firecracker address this by adding an additional isolation layer — gVisor interposes on syscalls with a userspace kernel, Firecracker runs containers inside lightweight VMs (microVMs) that boot in 125 ms. Kubernetes itself supports the RuntimeClass API to schedule specific pods onto more isolated runtimes.

For most workloads on a private cluster, Linux namespaces + cgroups + seccomp profiles + AppArmor/SELinux provide adequate isolation. For multi-tenant SaaS (running untrusted customer code) or anything processing sensitive regulated data on shared infrastructure, the defense-in-depth argument for gVisor or Firecracker is strong.

Key idea: A container is a process isolation mechanism, not a security boundary in the same sense as a VM. Knowing this distinction prevents both underestimating container security (namespaces do provide real isolation against casual misconfiguration) and overestimating it (a kernel CVE can still break everything).

Putting It Together: What Happens When You Run a Container

When you execute docker run nginx:alpine, the following sequence happens — most of it in under 300 ms:

  1. Docker CLI sends a gRPC request to the Docker daemon (dockerd).
  2. dockerd delegates to containerd (the industry-standard container runtime, now a CNCF project).
  3. containerd invokes runc (the OCI-compliant low-level runtime) to create the container.
  4. runc calls clone(2) with namespace flags (CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) to create a new process in fresh namespaces.
  5. runc writes cgroup entries under /sys/fs/cgroup/ to enforce CPU and memory limits.
  6. The image layers are mounted as an overlay filesystem (OverlayFS) on the host and presented to the container as its root filesystem.
  7. The container process starts — it sees PID 1, a private network interface, and an isolated filesystem.

This is the complete mental model: namespaces for visibility isolation, cgroups for resource enforcement, and a layered filesystem for image portability. Everything else in Docker — Dockerfiles, volumes, networks, Compose — is built on top of these three primitives.