Linux System Administration

Monitoring System Resources

18 min Lesson 4 of 28

Monitoring System Resources

At big-tech scale, a server that goes silent is a server that is failing. Production engineers spend a large part of their day reading resource metrics — CPU utilisation, memory pressure, disk I/O throughput and latency — and mapping those numbers to the four signals that matter: utilisation, saturation, errors, and requests (the USE Method, coined by Brendan Gregg). Mastering the standard Linux observability toolchain is the prerequisite for everything else in this tutorial: you cannot harden, tune, or automate a system you cannot see.

The USE Method

Before reaching for any tool, internalise the framework. For every resource in the system, ask three questions:

Utilisation — what percentage of the resource's capacity is being used? (e.g. CPU at 80 %)
Saturation — is work queueing because the resource is full? (e.g. run-queue length > 0, swap in-use)
Errors — are requests to the resource failing? (e.g. disk I/O errors, NIC drops)

Apply USE to every subsystem: CPUs, memory, network interfaces, block devices, and even kernel locks. A single saturated resource with zero errors is almost always a capacity problem; errors with low utilisation point to a hardware or driver fault.

CPU: top and htop

top ships on every Linux system and gives a live, sorted process table. The header rows are the most important part:

# Run top, then press keys:
# P  — sort by CPU%   M  — sort by MEM%   1  — toggle per-CPU view
# k  — kill process   r  — renice         q  — quit
top

# Non-interactive one-shot snapshot (1 iteration, batch mode):
top -b -n 1 | head -30

# Sample output header:
# %Cpu(s):  3.2 us,  0.4 sy,  0.0 ni, 95.9 id,  0.3 wa,  0.0 hi,  0.2 si,  0.0 st
#            ^^^^                       ^^^^       ^^^^
#          user space                  idle       iowait (waiting for disk)

The fields to focus on: us (user-space work), sy (kernel work), wa (iowait — CPU sitting idle waiting for disk), si (software interrupt — common on network-heavy hosts), and st (steal — CPU cycles taken by the hypervisor; non-zero on a noisy-neighbour VM).

htop adds colour, mouse support, horizontal scrolling, and tree view (F5). Install it with dnf install htop or apt install htop. In production, htop -d 5 (refresh every 0.5 s) is useful during an incident; for scripted checks, use top -b -n 1.

Production tip: High %iowait does NOT mean the disk is the bottleneck — it means CPUs are idle while at least one process blocks on I/O. Cross-check with iostat to see whether the disk is actually saturated.

CPU: vmstat — the snapshot that tells the whole story

vmstat (virtual memory statistics) reports CPU, memory, swap, I/O, and scheduling in one compact line. Run it with an interval to watch trends:

# vmstat <interval> <count>
# First line is averages since boot; subsequent lines are per-interval deltas
vmstat 2 10

# Example output:
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  2  0      0 512000 128000 2048000   0    0     0    40  850 1200  5  2 91  2  0
# ^  ^                               ^^   ^^    ^^    ^^
# |  blocked                        swapin/out  block in/out (kB/s)
# run-queue

# Key: r (run queue) > nCPUs × 2 means CPU saturation
#      b > 0 means processes blocked in uninterruptible sleep (disk/net)
#      si/so > 0 means active swapping — a major red flag

The run queue (r) is the saturation signal for CPU. If it consistently exceeds twice the number of logical CPUs, you have CPU saturation — processes are waiting their turn.

Memory: free

free -h prints memory and swap usage in human-readable form. Modern Linux kernels aggressively use free RAM as page cache, so the "used" column can look alarmingly high on a healthy server:

free -h
#               total        used        free      shared  buff/cache   available
# Mem:           15Gi        3.2Gi       1.1Gi       512Mi        11Gi        11Gi
#                                                              ^^^^^^^^    ^^^^^^^^^
#                                              (kernel page cache)    (truly free for apps)

# The "available" column is what matters — not "free".
# available = free + reclaimable page cache

# Watch memory pressure over time:
vmstat -s | grep -E 'memory|swap|pages'

Production pitfall: Never alert on the free column alone. A Linux server with 200 MB "free" and 10 GB "available" is perfectly healthy — the kernel is using RAM for disk cache, which it will immediately reclaim when an application needs it. Only alert when available drops below a safe threshold (typically 10–15 % of total RAM) or when swap usage is non-zero.

Disk I/O: iostat

iostat (part of the sysstat package) is the primary tool for block-device analysis. It maps directly to USE: utilisation (%util), saturation (aqu-sz, the average queue depth), and errors (from dmesg or smartctl):

# Install sysstat if needed:
dnf install sysstat     # RHEL/CentOS/Rocky
apt install sysstat     # Debian/Ubuntu

# iostat with extended stats, human-readable, 2-second interval, 5 iterations:
iostat -xh 2 5

# Key columns (extended -x):
# r/s    w/s   — reads and writes per second (IOPS)
# rkB/s  wkB/s — read/write throughput in kB/s
# await  — average I/O latency (ms) including queue wait time
# r_await w_await — separate read and write latency
# aqu-sz — average request queue size (saturation signal)
# %util  — percentage of time the device was busy (utilisation)

# Example: SSD healthy vs. spinning disk saturated
# Device    r/s   w/s  rkB/s  wkB/s  await  aqu-sz  %util
# nvme0n1  120.0  80.0 2048.0 1024.0   0.8    0.1     12.0   <-- healthy SSD
# sda        5.0  40.0   64.0  512.0  80.0   15.0     98.0   <-- spinning disk saturated!

A %util near 100 % on a spinning disk means the disk is a bottleneck. On modern NVMe SSDs, %util can saturate at well below 100 % due to internal parallelism — rely on aqu-sz and await instead. Latency above 20 ms for a spinning disk or above 1 ms for an NVMe under normal load is a signal to investigate.

Putting It Together: Incident Investigation Pattern

USE Method triage: check CPU, memory, and disk I/O in sequence using the right tool for each layer.

Practical One-Liner Checklist for On-Call

The following sequence runs in under 60 seconds and covers every major resource. Save it as a runbook snippet:

# 1. CPU — utilisation and run queue
vmstat 1 5

# 2. CPU — top offenders (batch, no interact)
top -b -n 1 -o %CPU | head -20

# 3. Memory — available RAM and swap pressure
free -h
vmstat -s | grep -i swap

# 4. Disk I/O — per-device extended stats
iostat -xh 2 3

# 5. Which processes are causing disk I/O?
iotop -bo -n 3        # requires iotop: dnf install iotop

# 6. System-wide file descriptor / socket pressure
cat /proc/sys/fs/file-nr          # used / free / max file handles
ss -s                              # socket summary

Key insight: The goal of all these tools is the same — answer the USE questions for every resource. Once you identify which resource is saturated (run queue for CPU, available dropping + swap active for memory, or await/aqu-sz for disk), you know where to dig deeper. The tools above are the first tier; deeper tools like perf, bpftrace, and sar come next once the bottleneck is localised.

Persisting Metrics with sar

sar (System Activity Reporter, also from sysstat) is iostat and vmstat combined with historical recording. On most distributions, enabling sysstat writes a 10-minute sample to /var/log/sa/ automatically. This is invaluable for post-incident analysis ("what was the I/O rate at 03:47 last Thursday?"):

# Enable sysstat collection (runs sadc every 10 min):
systemctl enable --now sysstat

# View CPU history for today:
sar -u

# View memory history:
sar -r

# View disk I/O history for a specific device:
sar -d -p      # -p uses human-readable device names

# View historical data from a saved file:
sar -u -f /var/log/sa/sa10     # 10th of the month

At companies running thousands of servers, sar data is collected into centralised time-series stores (Prometheus + node_exporter, Datadog, CloudWatch) and dashboards replace manual CLI work during normal operations. But the CLI tools remain essential during SSH-only incidents, bootstrapping new hosts, and diagnosing issues that pre-date the monitoring stack.