We are still cooking the magic in the way!
Profiling Applications
Profiling Applications
Load tests tell you that your service is slow. Profiling tells you why. Without profiling, optimisation is guesswork: engineers spend days tuning connection pools or adding caches while a single hot function in userland burns 60 % of every CPU cycle. This lesson is about finding those hot paths systematically, at the function-call and memory-allocation level, with the same discipline you apply to SLO burn-rate analysis.
The Two Axes: CPU and Memory
Every performance regression sits on one of two axes — or both:
- CPU profiling — measures where time is spent: which functions are on-CPU and for how long. Captured by sampling the call stack at a fixed interval (e.g. 100 Hz) or by instrumenting every function entry/exit.
- Memory profiling — measures where allocations are made: which call path triggered
malloc/ GC pressure. Useful when the symptom is OOMKill, high GC pause latency, or steady RSS growth (leak).
In a containerised fleet you will most often reach for CPU profiling first, because CPU saturation is the dominant cause of latency regressions in compute-bound services. Memory profiling follows when RSS or GC metrics from your Prometheus stack surface an anomaly.
Sampling vs. Instrumentation
Sampling profilers (perf, pprof in Go, async-profiler for JVM) periodically interrupt execution and record the current call stack. Overhead is low (1–3 % CPU) and safe for brief production use. Instrumentation profilers (JaCoCo, Python cProfile, Pyroscope's eBPF mode) wrap every function call, giving exact counts but adding 10–40 % overhead — reserve these for staging or controlled canary windows.
Flame Graphs: Reading the Hot Path
A flame graph (invented by Brendan Gregg) collapses thousands of stack samples into a single image. The x-axis is sample count (width = time share), the y-axis is call depth (bottom = entry point, top = leaf). The widest frames at the top of a tower are the hot path — the code spending the most CPU time without calling anything else.
The key insight: you do not optimise the wide frames at the bottom (those are your framework's request dispatcher — you cannot change them). You optimise the wide frames at the top of the tallest towers, because those are leaf functions that the CPU is actually executing.
Go: pprof in Production
Go ships net/http/pprof in the standard library. Import it and expose the debug endpoint (behind an internal-only ingress or network policy, never on the public port):
Capture a 30-second CPU profile and generate a flame graph locally:
JVM: async-profiler
async-profiler avoids the safepoint bias of JVMTI-based profilers (JProfiler, YourKit) by using Linux perf_events or AsyncGetCallTrace to sample threads regardless of JVM safepoint state. This is the correct tool for production JVM profiling:
Linux: perf + Brendan Gregg's FlameGraph
For native processes (C, C++, Rust) or kernel-space hot paths (system calls, eBPF programs), perf is the ground truth:
-O2 and stripped binaries produce frames labelled [unknown]. Solutions: (1) compile with -fno-omit-frame-pointer (Go does this by default since 1.12; add -XX:+PreserveFramePointer for JVM); (2) install debuginfo packages; (3) use eBPF-based profilers (Parca, Pyroscope eBPF mode) that resolve symbols from DWARF in a sidecar without touching the application binary.
Memory Profiling: Finding Leaks and GC Pressure
When your Prometheus process_resident_memory_bytes climbs 2 % per hour or jvm_gc_pause_seconds spikes under load, reach for allocation profiling:
- Go heap profile:
curl localhost:6060/debug/pprof/heap > heap.pb.gzthengo tool pprof -http=:9090 heap.pb.gz. Toggle "alloc_space" (total allocations since start) vs "inuse_space" (current live heap) — inuse_space reveals live leaks, alloc_space reveals allocation pressure causing GC churn. - JVM:
jcmd <pid> VM.native_memory summaryfor off-heap; async-profiler with-e allocfor allocation flame graphs. Heap dumps (jmap -dump:live) feed into Eclipse MAT for leak detection. - Python:
memray(Bloomberg) instruments allocations at the C level —memray run --live myservice.pyormemray attach <pid>for a running process.
Profiling in CI: Performance Gates
Integrating profiling into your CI pipeline prevents regressions from reaching production. A typical GitHub Actions step after the load test job:
The Senior Judgement Layer
Profiling data is only as useful as the action it drives. At senior/staff level, the discipline is not just reading the flame graph but translating it into a change with a predicted and measured impact:
- Baseline first. Record the flame graph before any change. Without a baseline, you cannot prove a fix actually improved throughput.
- One variable at a time. Optimise one hot frame, re-run the load test, re-profile. Combining changes makes regression attribution impossible.
- Beware premature micro-optimisation. A function burning 3 % of CPU that takes two hours to fix is a poor trade. Focus on frames over 10–15 % first.
- Latency vs. throughput. A CPU optimisation that halves CPU usage may not halve p99 latency if the bottleneck is a downstream call (check your distributed traces from Jaeger/Tempo alongside the flame graph).