Linux System Administration

Performance Analysis & Troubleshooting

18 min Lesson 5 of 28

Performance Analysis & Troubleshooting

A production server starts responding slowly at 2 AM. Your pager fires. You have minutes to triage. This lesson teaches the systematic methodology that senior SREs at big-tech companies use to identify bottlenecks, interpret performance signals, and resolve them — even on unfamiliar systems.

Load Average: What It Actually Means

Load average is the single number most engineers misread. It appears in uptime, top, and htop output as three values: 1-minute, 5-minute, and 15-minute exponentially weighted moving averages of the run-queue length — the count of processes either running on CPU or waiting in an uninterruptible sleep state (disk I/O, kernel locks).

The critical insight: load average is not CPU utilisation. A machine with 4 CPU cores and a load average of 4.0 is fully saturated — every core has one process to run and nothing is waiting. The same machine with load 8.0 has on average 4 processes waiting for CPU at any moment. On a 32-core machine, load 8.0 is comfortable headroom.

Rule of thumb: divide load average by the number of CPU cores (nproc or lscpu | grep "^CPU(s)"). A ratio above 1.0 means saturation; above 2.0 means serious contention. A ratio near zero with high response times points to I/O wait, not CPU.
# Read load average + core count in one go uptime nproc # 5-second snapshot of per-CPU utilisation (requires sysstat) mpstat -P ALL 5 1 # I/O wait column (wa) in iostat — high iowait + low CPU = storage bottleneck iostat -xz 5 1

The Performance Triage Methodology

Netflix engineer Brendan Gregg formalised the USE Method: for every resource (CPU, memory, disks, network, filesystems), measure Utilisation, Saturation, and Errors. Applied top-down, this eliminates guesswork:

  1. CPUtop / mpstat — check %user, %sys, %iowait, %steal (on VMs)
  2. Memoryfree -h / vmstat 1 — watch si/so swap columns; any swapping is saturation
  3. Storage I/Oiostat -xz 5%util near 100% and high await (ms) confirm disk saturation
  4. Networkss -s / sar -n DEV 1 — connection counts, retransmits, interface errors
  5. Processesps aux --sort=-%cpu, pidstat 1 — identify the specific offending process
Linux Performance Triage Flow Alert / Slow System 1. uptime / load avg load/cores ratio > 1? Check I/O Wait iostat -xz 5 No/Low 2. mpstat / top CPU %user / %sys high? High 3. free / vmstat swap si/so > 0? Low 4. ps / pidstat Find offending PID 5. strace / lsof Diagnose system calls / FDs OOM / Memory Leak dmesg | grep oom iotop / lsof Find disk-hungry PID
Linux performance triage flow: start from load average, branch to CPU, memory, or I/O, then drill to the offending process.

Finding Bottlenecks with vmstat and iostat

vmstat 1 is your heartbeat monitor — run it for 10 seconds and watch the columns. The r column (run queue) mirrors load average but in real time. The b column shows processes in uninterruptible sleep — these are almost always blocked on I/O. The swap columns si/so should be zero; any non-zero value means the kernel is frantically paging.

# vmstat: r=runnable, b=blocked, si/so=swap-in/out, us/sy/wa/id=CPU% vmstat 1 10 # iostat extended: util%, await(ms), r/s w/s — one device per line # -x = extended, -z = skip idle devices, 5 = interval, 3 = count iostat -xz 5 3 # Which processes are doing the disk I/O right now? iotop -o -b -n 3 # Per-process CPU/memory sampling (like top but scriptable) pidstat -u -r -d 2 5
In production at scale: these one-shot tools are for immediate triage. After stabilising the incident, retrospective analysis should come from time-series metrics (Prometheus + Grafana, Datadog, CloudWatch). The CLI tools answer "what is happening right now"; the observability stack answers "when did this start and how often does it happen".

strace: Tracing System Calls

strace intercepts every system call a process makes — open(), read(), write(), connect(), futex(). This is how you diagnose a hung process that burns CPU but does nothing useful, or a process spinning on a failing stat() call.

Production caution: strace adds ~10–30% overhead via ptrace, which serialises system calls. Never attach it to a high-throughput process on a saturated production host without understanding this cost. Prefer attaching to a single replica, or use strace -c (summary only) for lighter profiling.
# Attach strace to a running PID (-p), show timestamps (-t), follow children (-f) strace -p 12345 -tt -f 2>&1 | head -50 # Summary mode: count calls + time spent — minimal overhead, great for profiling strace -c -p 12345 # Trace only specific calls (e.g., file opens and network connects) strace -e trace=openat,connect -p 12345 # Trace a new command and save full output to file strace -o /tmp/trace.txt -tt nginx -t

Classic findings from strace: a process repeatedly calling stat() on a missing config file (misconfiguration); a process stuck in futex(FUTEX_WAIT) forever (deadlock on a mutex); hundreds of connect() calls all returning ECONNREFUSED (downstream dependency down).

lsof: Open Files and File Descriptors

On Linux, everything is a file: sockets, pipes, device nodes, actual files. lsof (list open files) reveals what a process has open. This is essential when debugging "Too many open files" errors, finding which process holds a deleted file (preventing disk space from being reclaimed), or discovering which process is connected to a remote IP.

# All open files for a PID lsof -p 12345 # Show only network connections for a PID lsof -p 12345 -i # Which process has a port open? lsof -i :8080 # Find who is holding a deleted-but-open file (disk usage won't reclaim until closed) lsof +L1 # Count open FDs per process — spot FD leak candidates lsof | awk '{print $1}' | sort | uniq -c | sort -rn | head -20

Putting It Together: A Real Triage Walkthrough

Load average spikes from 1.2 to 14 on an 8-core host. Here is the systematic approach:

  1. Run uptime — confirm 14/8 = 1.75× saturation.
  2. Run vmstat 1r=10, b=4, wa=60%. High I/O wait confirmed.
  3. Run iostat -xz 1/dev/sda at 99% util, await=420ms. Disk is the bottleneck.
  4. Run iotop -omysqld doing 180 MB/s writes.
  5. Run strace -c -p <mysqld PID> for 5 seconds — 90% time in write().
  6. Run lsof -p <mysqld PID> -i — 3,200 client connections. Connection pool exhausted, clients queuing.
  7. Root cause: a runaway batch job opened thousands of connections, all issuing large writes. Fix: kill the batch job, cap connection pool, add slow-query analysis.
Document everything during triage. Run commands inside a script session or pipe output to a timestamped file. Post-incident reviews require evidence. At big-tech companies, the timeline of commands run is part of the incident report.