Performance Analysis & Troubleshooting
Performance Analysis & Troubleshooting
A production server starts responding slowly at 2 AM. Your pager fires. You have minutes to triage. This lesson teaches the systematic methodology that senior SREs at big-tech companies use to identify bottlenecks, interpret performance signals, and resolve them — even on unfamiliar systems.
Load Average: What It Actually Means
Load average is the single number most engineers misread. It appears in uptime, top, and htop output as three values: 1-minute, 5-minute, and 15-minute exponentially weighted moving averages of the run-queue length — the count of processes either running on CPU or waiting in an uninterruptible sleep state (disk I/O, kernel locks).
The critical insight: load average is not CPU utilisation. A machine with 4 CPU cores and a load average of 4.0 is fully saturated — every core has one process to run and nothing is waiting. The same machine with load 8.0 has on average 4 processes waiting for CPU at any moment. On a 32-core machine, load 8.0 is comfortable headroom.
nproc or lscpu | grep "^CPU(s)"). A ratio above 1.0 means saturation; above 2.0 means serious contention. A ratio near zero with high response times points to I/O wait, not CPU.
The Performance Triage Methodology
Netflix engineer Brendan Gregg formalised the USE Method: for every resource (CPU, memory, disks, network, filesystems), measure Utilisation, Saturation, and Errors. Applied top-down, this eliminates guesswork:
- CPU —
top/mpstat— check%user,%sys,%iowait,%steal(on VMs) - Memory —
free -h/vmstat 1— watchsi/soswap columns; any swapping is saturation - Storage I/O —
iostat -xz 5—%utilnear 100% and highawait(ms) confirm disk saturation - Network —
ss -s/sar -n DEV 1— connection counts, retransmits, interface errors - Processes —
ps aux --sort=-%cpu,pidstat 1— identify the specific offending process
Finding Bottlenecks with vmstat and iostat
vmstat 1 is your heartbeat monitor — run it for 10 seconds and watch the columns. The r column (run queue) mirrors load average but in real time. The b column shows processes in uninterruptible sleep — these are almost always blocked on I/O. The swap columns si/so should be zero; any non-zero value means the kernel is frantically paging.
strace: Tracing System Calls
strace intercepts every system call a process makes — open(), read(), write(), connect(), futex(). This is how you diagnose a hung process that burns CPU but does nothing useful, or a process spinning on a failing stat() call.
strace adds ~10–30% overhead via ptrace, which serialises system calls. Never attach it to a high-throughput process on a saturated production host without understanding this cost. Prefer attaching to a single replica, or use strace -c (summary only) for lighter profiling.
Classic findings from strace: a process repeatedly calling stat() on a missing config file (misconfiguration); a process stuck in futex(FUTEX_WAIT) forever (deadlock on a mutex); hundreds of connect() calls all returning ECONNREFUSED (downstream dependency down).
lsof: Open Files and File Descriptors
On Linux, everything is a file: sockets, pipes, device nodes, actual files. lsof (list open files) reveals what a process has open. This is essential when debugging "Too many open files" errors, finding which process holds a deleted file (preventing disk space from being reclaimed), or discovering which process is connected to a remote IP.
Putting It Together: A Real Triage Walkthrough
Load average spikes from 1.2 to 14 on an 8-core host. Here is the systematic approach:
- Run
uptime— confirm 14/8 = 1.75× saturation. - Run
vmstat 1—r=10,b=4,wa=60%. High I/O wait confirmed. - Run
iostat -xz 1—/dev/sdaat 99% util,await=420ms. Disk is the bottleneck. - Run
iotop -o—mysqlddoing 180 MB/s writes. - Run
strace -c -p <mysqld PID>for 5 seconds — 90% time inwrite(). - Run
lsof -p <mysqld PID> -i— 3,200 client connections. Connection pool exhausted, clients queuing. - Root cause: a runaway batch job opened thousands of connections, all issuing large writes. Fix: kill the batch job, cap connection pool, add slow-query analysis.
script session or pipe output to a timestamped file. Post-incident reviews require evidence. At big-tech companies, the timeline of commands run is part of the incident report.