Advanced Docker & Container Security

Resource Limits & cgroups

18 min Lesson 7 of 28

Resource Limits & cgroups

A container is, at its core, a process — or a tree of processes — running on the host kernel. Without explicit limits, a single runaway container can consume every CPU cycle and every byte of RAM on the host, bringing down every other container and the node itself. Linux control groups (cgroups) are the kernel mechanism that prevents this: they enforce hard boundaries on CPU, memory, I/O, and PIDs per process group. Docker and Kubernetes both sit on top of cgroups to implement their resource-limit models.

How cgroups Work Under the Hood

When Docker starts a container, the daemon creates a cgroup hierarchy under /sys/fs/cgroup/ and places every container process inside it. The kernel then enforces the limits you set at that hierarchy level — no matter how aggressively the container tries to break out. With cgroups v2 (the default on Linux 5.2+ and all modern distros), the hierarchy is unified into a single tree and the accounting is more accurate, particularly for memory.

cgroups v1 vs v2: Docker Engine 20.10+ and Kubernetes 1.25+ both support cgroups v2. On v2, the memory.max knob replaces memory.limit_in_bytes from v1, and CPU is controlled via cpu.max instead of cpu.cfs_quota_us. The Docker and kubectl CLI flags remain the same regardless — the engine translates for you.

Memory Limits and OOM Behavior

Memory is the most dangerous unbounded resource. When a container exceeds its memory limit, the kernel OOM (Out-Of-Memory) killer terminates a process inside the cgroup. In Docker, this manifests as the container exiting with status 137 (SIGKILL from the kernel). Without a limit, the OOM killer may choose any process on the host — including the Docker daemon or a process in a completely unrelated container.

# Run with a 512 MB hard memory limit and a 768 MB swap limit
# (--memory-swap includes RAM, so 768m - 512m = 256 MB of actual swap)
docker run -d \
  --memory=512m \
  --memory-swap=768m \
  --name api-server \
  myapp:latest

# Inspect the actual cgroup setting on the host (cgroups v2)
cat /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.max

# Watch live memory usage across all containers
docker stats

# Detect whether a container was OOM-killed
docker inspect api-server --format '{{.State.OOMKilled}}'
# Returns: true

# Check exit code (137 = SIGKILL from OOM killer)
docker inspect api-server --format '{{.State.ExitCode}}'

Do not set --memory-swap equal to --memory. That disables swap entirely, which sounds safe but causes OOM kills at the memory limit with no soft landing. In production, either allow a small amount of swap (1.5x-2x the memory limit) or disable swap at the host level entirely for latency-sensitive workloads and rely on fast OOM kills as your circuit breaker.

You can also configure what happens before the OOM kill using --oom-score-adj. The kernel OOM killer scores each process between -1000 (never kill) and +1000 (kill first). Setting a high score on your container makes it the preferred victim, protecting host-level processes.

CPU Limits: Shares, Quotas, and Periods

CPU limiting works differently from memory because CPU time is compressible — a container that requests more than its share is throttled, not killed. Docker exposes two independent knobs:

--cpus (or --cpu-quota + --cpu-period) — sets a hard ceiling. --cpus=1.5 means the container may use at most 1.5 CPU-seconds per second, regardless of available capacity. Implemented as a CFS (Completely Fair Scheduler) quota.
--cpu-shares — sets a relative weight (default 1024). Only takes effect when CPUs are contended. A container with 2048 shares gets twice the CPU time of a 1024-share container when both are busy. When the host is idle, shares are irrelevant — a low-share container can burst freely.

# Hard limit: container may use at most 2 CPUs (out of however many the host has)
docker run -d \
  --cpus=2 \
  --name worker \
  myapp:latest

# Equivalent using raw CFS knobs (period=100ms, quota=200ms = 2 CPUs)
docker run -d \
  --cpu-period=100000 \
  --cpu-quota=200000 \
  --name worker \
  myapp:latest

# Soft weight: this container gets 2x CPU priority over default containers under contention
docker run -d \
  --cpu-shares=2048 \
  --name priority-service \
  myapp:latest

# Pin to specific CPUs (useful for NUMA-aware workloads on large hosts)
docker run -d \
  --cpuset-cpus="0,1" \
  --name pinned-service \
  myapp:latest

CPU shares (soft) vs CPU quota (hard): shares only matter when CPUs are fully contended; quotas are always enforced.

ulimits: File Descriptors and Process Counts

Beyond CPU and memory, two other resources cause subtle production failures: open file descriptors and process/thread counts. Both are controlled via ulimit-style settings that Docker inherits from the daemon default and that you can override per container.

A high-concurrency service (database, web server, message broker) can exhaust file descriptor limits under load, causing cryptic "too many open files" errors long before it runs out of CPU or memory. Likewise, a fork bomb or a thread-leaking JVM can consume all PIDs on the host, rendering the node unable to start any new process — including recovery scripts.

# Raise the open file descriptor soft and hard limit to 65535
docker run -d \
  --ulimit nofile=65535:65535 \
  --name postgres \
  postgres:16

# Limit the maximum number of processes (prevents fork bombs)
docker run -d \
  --ulimit nproc=512:512 \
  --name untrusted-job \
  myapp:latest

# Set daemon-wide defaults in /etc/docker/daemon.json
# (applies to all containers unless overridden at run time)
# {
#   "default-ulimits": {
#     "nofile": { "Name": "nofile", "Soft": 65535, "Hard": 65535 },
#     "nproc":  { "Name": "nproc",  "Soft": 1024,  "Hard": 1024  }
#   }
# }

# Inspect current ulimits of a running container
docker inspect <container> --format '{{json .HostConfig.Ulimits}}'

Kubernetes Resource Requests and Limits

In Kubernetes, CPU and memory are configured at the container level inside a Pod spec. There are two distinct concepts: requests (the scheduler guarantee — the node must have this much free) and limits (the cgroup ceiling — the container cannot exceed this). Setting only limits without requests is a common mistake that leads to the scheduler placing too many pods on a node.

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: myapp:latest
          resources:
            requests:
              cpu: "250m"      # 0.25 CPU guaranteed by scheduler
              memory: "256Mi"  # 256 MB guaranteed on the node
            limits:
              cpu: "1000m"     # 1.0 CPU hard ceiling (throttled, not killed)
              memory: "512Mi"  # 512 MB hard ceiling (OOM killed if exceeded)

Production sizing rule: Set memory request equal to memory limit (or very close to it). Memory is incompressible — once allocated, the kernel cannot reclaim it without killing the process. Having requests much lower than limits encourages over-scheduling, causing a cascade of OOM kills under load. For CPU, a 2x-4x ratio between request and limit is reasonable for bursty workloads, but watch your throttling metrics in Prometheus (container_cpu_cfs_throttled_seconds_total) — heavy throttling is as damaging as OOM kills for latency-sensitive services.

Verifying and Monitoring Limits in Production

Setting limits is only half the work. You must also verify they are being enforced and alert when containers approach them. Key signals to watch:

Memory usage percentage — alert at 80% of the limit. At 100% you get killed with no warning.
OOM kill counter — container_oom_events_total in Prometheus; any value above zero is a production incident.
CPU throttle ratio — container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total; above 25% suggests your CPU limit is too low.
File descriptor usage — compare /proc/<pid>/fd count against the ulimit for long-running services.

Never run containers with no resource limits in production. This is the single most common cause of cascading node failures in shared Kubernetes clusters. Enforce it at the admission level using a LimitRange object in every namespace — it automatically injects default requests and limits for any pod that omits them, so a misconfigured deployment cannot bypass the policy.