System-Level Bottlenecks
System-Level Bottlenecks
Load tests surface symptoms — high latency, dropped requests, queue saturation. Diagnosing the root cause means moving one layer deeper: the operating system and hardware that everything runs on. The USE method (Utilization, Saturation, Errors), coined by Brendan Gregg, gives a disciplined checklist for every physical resource: CPU, memory, disk I/O, and network. Apply it in that order whenever a load test reveals unexplained degradation.
The USE Method in Practice
For each resource, three metrics matter:
- Utilization — what fraction of available capacity is busy (0–100 %).
- Saturation — work queued because the resource is already at capacity (e.g., run-queue depth, disk queue depth).
- Errors — hardware or driver-level faults: ECC memory corrections, TCP retransmits, NIC drops.
High utilization alone is not alarming; saturation always is. A CPU at 95 % with zero run-queue lag is merely busy. A CPU at 70 % with a persistent run-queue of 8 on a 4-core machine is saturated and will produce erratic p99 latency spikes.
CPU Bottlenecks
CPU saturation shows up as a rising load average relative to core count and a non-zero %wa (I/O wait) or high %us + %sy in top. The canonical quick-look commands:
At large scale, scheduler affinity matters. A JVM pinned to a NUMA node it does not own pays a ~100 ns cross-NUMA memory penalty per allocation. Verify with numastat -c <pid> and set numactl --cpunodebind=0 --membind=0 for latency-sensitive services. For containerised workloads, match the cpu.cpuset cgroup to a single NUMA node.
%st in top is CPU cycles taken by the hypervisor. Sustained steal > 5 % indicates noisy neighbours on the same physical host — escalate to the cloud provider or migrate to a dedicated host. Steal does not appear in your application metrics; it silently inflates p99.
Memory Bottlenecks
Memory pressure in Linux manifests through two separate mechanisms: the OOM killer (hard limit) and swapping / page reclaim (soft saturation). Either degrades p99 latency long before an OOM event or an actual swap file fills.
For production services: disable swap on latency-sensitive nodes (swapoff -a + remove swap entries from /etc/fstab). A process hitting swap on a modern SSD still adds 10–100 µs of latency per page fault. Kubernetes nodes that run latency-critical pods must have memory.swappiness=0 (or vm.swappiness=1) set in the node's sysctl profile.
Memory fragmentation is a subtler source of latency spikes. THP collapse failures cause multi-millisecond stalls in memory-hungry services (Redis, Java with large heaps). Monitor thp_collapse_alloc_failed in /proc/vmstat and set /sys/kernel/mm/transparent_hugepage/enabled to madvise so only explicitly opted-in allocations use THP.
Disk I/O Bottlenecks
Disk I/O saturation is lethal to databases and write-heavy microservices. The key metric is avgqu-sz from iostat — a persistent queue depth above 1 on an NVMe device indicates the device is backlogged. At big-tech scale, even short I/O spikes are noticeable because they push kernel page cache pressure, which evicts hot data and compounds the problem.
Tuning levers in production: set the I/O scheduler to none (pass-through) for NVMe devices — the in-kernel mq-deadline or bfq schedulers add overhead that NVMe hardware queues already handle. Verify with cat /sys/block/nvme0n1/queue/scheduler. For databases, direct I/O (O_DIRECT) bypasses the page cache and eliminates double-buffering; PostgreSQL uses it via effective_io_concurrency and wal_buffers sizing.
resources.limits with an I/O-aware storage class or use dedicated node pools for I/O-intensive workloads.
Network Bottlenecks
Network saturation shows up as TCP retransmits, socket send-queue overflow, and NIC hardware drops — all invisible in application-level metrics unless you instrument them explicitly. On a 10 Gbit/s NIC, a single service sending 9+ Gbit/s of data crowds out every other pod on the same host.
Critical kernel tunables for high-throughput services (applied via sysctl or /etc/sysctl.d/):
TCP BBR (Bottleneck Bandwidth and RTT) is the congestion algorithm deployed by Google at scale; it achieves significantly better throughput than CUBIC in lossy or high-BDP networks. Enabling it requires kernel 4.9+ (any modern production Linux). Pair with the fq (fair-queue) qdisc — BBR depends on pacing that only fq provides.
Connecting System Metrics to Load-Test Results
When a k6 run shows p99 latency climbing past 500 ms, the diagnostic path is:
- Check
vmstat 1— is the run-queue saturated (CPU) or is there swap I/O (memory)? - Check
iostat -xz 1— isawaitelevated oravgqu-sz > 1? - Check
nstat -az | grep Retrans— are TCP retransmits rising? - If all three are clean, the bottleneck is application-layer (thread pool exhaustion, GC, lock contention) — escalate to flame graphs and application tracing.
node_exporter metrics and build a USE dashboard in Grafana with panels for node_cpu_seconds_total, node_memory_SwapTotal_bytes, node_disk_io_time_seconds_total, and node_network_transmit_drop_total. Overlay your load-test timeline as a Grafana annotation so you can visually correlate traffic ramps with resource exhaustion events — this is standard operating procedure in SRE postmortems.