We are still cooking the magic in the way!
Monitoring System Resources
Monitoring System Resources
At big-tech scale, a server that goes silent is a server that is failing. Production engineers spend a large part of their day reading resource metrics — CPU utilisation, memory pressure, disk I/O throughput and latency — and mapping those numbers to the four signals that matter: utilisation, saturation, errors, and requests (the USE Method, coined by Brendan Gregg). Mastering the standard Linux observability toolchain is the prerequisite for everything else in this tutorial: you cannot harden, tune, or automate a system you cannot see.
The USE Method
Before reaching for any tool, internalise the framework. For every resource in the system, ask three questions:
- Utilisation — what percentage of the resource's capacity is being used? (e.g. CPU at 80 %)
- Saturation — is work queueing because the resource is full? (e.g. run-queue length > 0, swap in-use)
- Errors — are requests to the resource failing? (e.g. disk I/O errors, NIC drops)
Apply USE to every subsystem: CPUs, memory, network interfaces, block devices, and even kernel locks. A single saturated resource with zero errors is almost always a capacity problem; errors with low utilisation point to a hardware or driver fault.
CPU: top and htop
top ships on every Linux system and gives a live, sorted process table. The header rows are the most important part:
The fields to focus on: us (user-space work), sy (kernel work), wa (iowait — CPU sitting idle waiting for disk), si (software interrupt — common on network-heavy hosts), and st (steal — CPU cycles taken by the hypervisor; non-zero on a noisy-neighbour VM).
htop adds colour, mouse support, horizontal scrolling, and tree view (F5). Install it with dnf install htop or apt install htop. In production, htop -d 5 (refresh every 0.5 s) is useful during an incident; for scripted checks, use top -b -n 1.
%iowait does NOT mean the disk is the bottleneck — it means CPUs are idle while at least one process blocks on I/O. Cross-check with iostat to see whether the disk is actually saturated.
CPU: vmstat — the snapshot that tells the whole story
vmstat (virtual memory statistics) reports CPU, memory, swap, I/O, and scheduling in one compact line. Run it with an interval to watch trends:
The run queue (r) is the saturation signal for CPU. If it consistently exceeds twice the number of logical CPUs, you have CPU saturation — processes are waiting their turn.
Memory: free
free -h prints memory and swap usage in human-readable form. Modern Linux kernels aggressively use free RAM as page cache, so the "used" column can look alarmingly high on a healthy server:
free column alone. A Linux server with 200 MB "free" and 10 GB "available" is perfectly healthy — the kernel is using RAM for disk cache, which it will immediately reclaim when an application needs it. Only alert when available drops below a safe threshold (typically 10–15 % of total RAM) or when swap usage is non-zero.
Disk I/O: iostat
iostat (part of the sysstat package) is the primary tool for block-device analysis. It maps directly to USE: utilisation (%util), saturation (aqu-sz, the average queue depth), and errors (from dmesg or smartctl):
A %util near 100 % on a spinning disk means the disk is a bottleneck. On modern NVMe SSDs, %util can saturate at well below 100 % due to internal parallelism — rely on aqu-sz and await instead. Latency above 20 ms for a spinning disk or above 1 ms for an NVMe under normal load is a signal to investigate.
Putting It Together: Incident Investigation Pattern
Practical One-Liner Checklist for On-Call
The following sequence runs in under 60 seconds and covers every major resource. Save it as a runbook snippet:
perf, bpftrace, and sar come next once the bottleneck is localised.
Persisting Metrics with sar
sar (System Activity Reporter, also from sysstat) is iostat and vmstat combined with historical recording. On most distributions, enabling sysstat writes a 10-minute sample to /var/log/sa/ automatically. This is invaluable for post-incident analysis ("what was the I/O rate at 03:47 last Thursday?"):
At companies running thousands of servers, sar data is collected into centralised time-series stores (Prometheus + node_exporter, Datadog, CloudWatch) and dashboards replace manual CLI work during normal operations. But the CLI tools remain essential during SSH-only incidents, bootstrapping new hosts, and diagnosing issues that pre-date the monitoring stack.