Linux System Administration

Performance Analysis & Troubleshooting

18 min Lesson 5 of 28

Performance Analysis & Troubleshooting

A production server starts responding slowly at 2 AM. Your pager fires. You have minutes to triage. This lesson teaches the systematic methodology that senior SREs at big-tech companies use to identify bottlenecks, interpret performance signals, and resolve them — even on unfamiliar systems.

Load Average: What It Actually Means

Load average is the single number most engineers misread. It appears in uptime, top, and htop output as three values: 1-minute, 5-minute, and 15-minute exponentially weighted moving averages of the run-queue length — the count of processes either running on CPU or waiting in an uninterruptible sleep state (disk I/O, kernel locks).

The critical insight: load average is not CPU utilisation. A machine with 4 CPU cores and a load average of 4.0 is fully saturated — every core has one process to run and nothing is waiting. The same machine with load 8.0 has on average 4 processes waiting for CPU at any moment. On a 32-core machine, load 8.0 is comfortable headroom.

Rule of thumb: divide load average by the number of CPU cores (nproc or lscpu | grep "^CPU(s)"). A ratio above 1.0 means saturation; above 2.0 means serious contention. A ratio near zero with high response times points to I/O wait, not CPU.

# Read load average + core count in one go
uptime
nproc

# 5-second snapshot of per-CPU utilisation (requires sysstat)
mpstat -P ALL 5 1

# I/O wait column (wa) in iostat — high iowait + low CPU = storage bottleneck
iostat -xz 5 1

The Performance Triage Methodology

Netflix engineer Brendan Gregg formalised the USE Method: for every resource (CPU, memory, disks, network, filesystems), measure Utilisation, Saturation, and Errors. Applied top-down, this eliminates guesswork:

CPU — top / mpstat — check %user, %sys, %iowait, %steal (on VMs)
Memory — free -h / vmstat 1 — watch si/so swap columns; any swapping is saturation
Storage I/O — iostat -xz 5 — %util near 100% and high await (ms) confirm disk saturation
Network — ss -s / sar -n DEV 1 — connection counts, retransmits, interface errors
Processes — ps aux --sort=-%cpu, pidstat 1 — identify the specific offending process

Linux performance triage flow: start from load average, branch to CPU, memory, or I/O, then drill to the offending process.

Finding Bottlenecks with vmstat and iostat

vmstat 1 is your heartbeat monitor — run it for 10 seconds and watch the columns. The r column (run queue) mirrors load average but in real time. The b column shows processes in uninterruptible sleep — these are almost always blocked on I/O. The swap columns si/so should be zero; any non-zero value means the kernel is frantically paging.

# vmstat: r=runnable, b=blocked, si/so=swap-in/out, us/sy/wa/id=CPU%
vmstat 1 10

# iostat extended: util%, await(ms), r/s w/s — one device per line
# -x = extended, -z = skip idle devices, 5 = interval, 3 = count
iostat -xz 5 3

# Which processes are doing the disk I/O right now?
iotop -o -b -n 3

# Per-process CPU/memory sampling (like top but scriptable)
pidstat -u -r -d 2 5

In production at scale: these one-shot tools are for immediate triage. After stabilising the incident, retrospective analysis should come from time-series metrics (Prometheus + Grafana, Datadog, CloudWatch). The CLI tools answer "what is happening right now"; the observability stack answers "when did this start and how often does it happen".

strace: Tracing System Calls

strace intercepts every system call a process makes — open(), read(), write(), connect(), futex(). This is how you diagnose a hung process that burns CPU but does nothing useful, or a process spinning on a failing stat() call.

Production caution: strace adds ~10–30% overhead via ptrace, which serialises system calls. Never attach it to a high-throughput process on a saturated production host without understanding this cost. Prefer attaching to a single replica, or use strace -c (summary only) for lighter profiling.

# Attach strace to a running PID (-p), show timestamps (-t), follow children (-f)
strace -p 12345 -tt -f 2>&1 | head -50

# Summary mode: count calls + time spent — minimal overhead, great for profiling
strace -c -p 12345

# Trace only specific calls (e.g., file opens and network connects)
strace -e trace=openat,connect -p 12345

# Trace a new command and save full output to file
strace -o /tmp/trace.txt -tt nginx -t

Classic findings from strace: a process repeatedly calling stat() on a missing config file (misconfiguration); a process stuck in futex(FUTEX_WAIT) forever (deadlock on a mutex); hundreds of connect() calls all returning ECONNREFUSED (downstream dependency down).

lsof: Open Files and File Descriptors

On Linux, everything is a file: sockets, pipes, device nodes, actual files. lsof (list open files) reveals what a process has open. This is essential when debugging "Too many open files" errors, finding which process holds a deleted file (preventing disk space from being reclaimed), or discovering which process is connected to a remote IP.

# All open files for a PID
lsof -p 12345

# Show only network connections for a PID
lsof -p 12345 -i

# Which process has a port open?
lsof -i :8080

# Find who is holding a deleted-but-open file (disk usage won't reclaim until closed)
lsof +L1

# Count open FDs per process — spot FD leak candidates
lsof | awk '{print $1}' | sort | uniq -c | sort -rn | head -20

Putting It Together: A Real Triage Walkthrough

Load average spikes from 1.2 to 14 on an 8-core host. Here is the systematic approach:

Run uptime — confirm 14/8 = 1.75× saturation.
Run vmstat 1 — r=10, b=4, wa=60%. High I/O wait confirmed.
Run iostat -xz 1 — /dev/sda at 99% util, await=420ms. Disk is the bottleneck.
Run iotop -o — mysqld doing 180 MB/s writes.
Run strace -c -p <mysqld PID> for 5 seconds — 90% time in write().
Run lsof -p <mysqld PID> -i — 3,200 client connections. Connection pool exhausted, clients queuing.
Root cause: a runaway batch job opened thousands of connections, all issuing large writes. Fix: kill the batch job, cap connection pool, add slow-query analysis.

Document everything during triage. Run commands inside a script session or pipe output to a timestamped file. Post-incident reviews require evidence. At big-tech companies, the timeline of commands run is part of the incident report.