Performance & Load Testing

Profiling Applications

18 min Lesson 6 of 28

Profiling Applications

Load tests tell you that your service is slow. Profiling tells you why. Without profiling, optimisation is guesswork: engineers spend days tuning connection pools or adding caches while a single hot function in userland burns 60 % of every CPU cycle. This lesson is about finding those hot paths systematically, at the function-call and memory-allocation level, with the same discipline you apply to SLO burn-rate analysis.

The Two Axes: CPU and Memory

Every performance regression sits on one of two axes — or both:

  • CPU profiling — measures where time is spent: which functions are on-CPU and for how long. Captured by sampling the call stack at a fixed interval (e.g. 100 Hz) or by instrumenting every function entry/exit.
  • Memory profiling — measures where allocations are made: which call path triggered malloc / GC pressure. Useful when the symptom is OOMKill, high GC pause latency, or steady RSS growth (leak).

In a containerised fleet you will most often reach for CPU profiling first, because CPU saturation is the dominant cause of latency regressions in compute-bound services. Memory profiling follows when RSS or GC metrics from your Prometheus stack surface an anomaly.

Sampling vs. Instrumentation

Sampling profilers (perf, pprof in Go, async-profiler for JVM) periodically interrupt execution and record the current call stack. Overhead is low (1–3 % CPU) and safe for brief production use. Instrumentation profilers (JaCoCo, Python cProfile, Pyroscope's eBPF mode) wrap every function call, giving exact counts but adding 10–40 % overhead — reserve these for staging or controlled canary windows.

Production rule of thumb: a sampling profiler running at 100 Hz for 60 seconds on a single pod adds roughly 1–2 % CPU overhead. That is acceptable during a live incident or a controlled 5 % canary. Never run an instrumentation profiler against 100 % of production traffic.

Flame Graphs: Reading the Hot Path

A flame graph (invented by Brendan Gregg) collapses thousands of stack samples into a single image. The x-axis is sample count (width = time share), the y-axis is call depth (bottom = entry point, top = leaf). The widest frames at the top of a tower are the hot path — the code spending the most CPU time without calling anything else.

Flame Graph — CPU hot path visualisation Flame Graph — CPU Profile (60 s sample) Call Depth (stack frames) Sample count → time share (wider = more CPU time) main() / goroutine entrypoint http.HandleFunc → ServeHTTP() runtime.schedule / idle queryUserOrders() renderJSON() db.Query() — 42 % json.Marshal N+1 loop — 34 % net/tcp sql.Exec() HOT PATH Hot path (>30 % CPU) Warm path (10–30 %) Cold path (<10 %)
A flame graph reveals an N+1 database loop consuming 34 % of CPU — the widest red frame at the top of the hot tower is the optimisation target.

The key insight: you do not optimise the wide frames at the bottom (those are your framework's request dispatcher — you cannot change them). You optimise the wide frames at the top of the tallest towers, because those are leaf functions that the CPU is actually executing.

Go: pprof in Production

Go ships net/http/pprof in the standard library. Import it and expose the debug endpoint (behind an internal-only ingress or network policy, never on the public port):

import _ "net/http/pprof" // registers /debug/pprof/* handlers // In your internal metrics mux (port 6060, not 8080): go func() { log.Fatal(http.ListenAndServe(":6060", nil)) }()

Capture a 30-second CPU profile and generate a flame graph locally:

# Port-forward to the pod under load kubectl port-forward pod/api-6d8f9b-xkz2p 6060:6060 -n production # Collect 30 s CPU profile go tool pprof -http=:9090 \ http://localhost:6060/debug/pprof/profile?seconds=30 # The browser opens an interactive flame graph at http://localhost:9090 # Switch to "Flame Graph" view; sort by "cum" (cumulative) for call chains
Continuous profiling in production: tools like Pyroscope (open-source) or Google Cloud Profiler / Datadog Continuous Profiler run a low-overhead sampling agent on every pod 24/7 and store profiles indexed by commit SHA, service, and time. When a latency SLO alert fires, you can diff the flame graph between the current deploy and the previous one — the widening frame is your regression.

JVM: async-profiler

async-profiler avoids the safepoint bias of JVMTI-based profilers (JProfiler, YourKit) by using Linux perf_events or AsyncGetCallTrace to sample threads regardless of JVM safepoint state. This is the correct tool for production JVM profiling:

# Attach to a running JVM (PID auto-detected in a container via /proc) ./profiler.sh -d 30 -f /tmp/flamegraph.html \ -e cpu --jfrsync cpu \ $(jps | grep MyService | awk '{print $1}') # In Kubernetes: exec into the pod first kubectl exec -it svc-pod-abc123 -n production -- bash cd /opt/async-profiler-3.0 ./profiler.sh -d 30 -f /tmp/cpu.html -e cpu 1 # Copy out the HTML flame graph kubectl cp production/svc-pod-abc123:/tmp/cpu.html ./cpu.html

Linux: perf + Brendan Gregg's FlameGraph

For native processes (C, C++, Rust) or kernel-space hot paths (system calls, eBPF programs), perf is the ground truth:

# Record 30 s, all CPUs, call graphs via frame pointers perf record -F 99 -a -g -- sleep 30 # Generate perf script output, fold stacks, render SVG perf script | stackcollapse-perf.pl | flamegraph.pl > cpu-flame.svg # For kernel + userspace combined (requires DWARF or frame pointers) perf record -F 99 --call-graph dwarf -p $(pgrep myservice) -- sleep 30 perf script | stackcollapse-perf.pl | flamegraph.pl \ --color=java --title="myservice 30s CPU" > combined-flame.svg
Missing symbols in perf flame graphs are the most common production pitfall. Containers built with -O2 and stripped binaries produce frames labelled [unknown]. Solutions: (1) compile with -fno-omit-frame-pointer (Go does this by default since 1.12; add -XX:+PreserveFramePointer for JVM); (2) install debuginfo packages; (3) use eBPF-based profilers (Parca, Pyroscope eBPF mode) that resolve symbols from DWARF in a sidecar without touching the application binary.

Memory Profiling: Finding Leaks and GC Pressure

When your Prometheus process_resident_memory_bytes climbs 2 % per hour or jvm_gc_pause_seconds spikes under load, reach for allocation profiling:

  • Go heap profile: curl localhost:6060/debug/pprof/heap > heap.pb.gz then go tool pprof -http=:9090 heap.pb.gz. Toggle "alloc_space" (total allocations since start) vs "inuse_space" (current live heap) — inuse_space reveals live leaks, alloc_space reveals allocation pressure causing GC churn.
  • JVM: jcmd <pid> VM.native_memory summary for off-heap; async-profiler with -e alloc for allocation flame graphs. Heap dumps (jmap -dump:live) feed into Eclipse MAT for leak detection.
  • Python: memray (Bloomberg) instruments allocations at the C level — memray run --live myservice.py or memray attach <pid> for a running process.

Profiling in CI: Performance Gates

Integrating profiling into your CI pipeline prevents regressions from reaching production. A typical GitHub Actions step after the load test job:

- name: Collect pprof profile during load test run: | # k6 load test runs in background (started by previous step) sleep 10 # let load ramp up go tool pprof -top -nodecount=20 \ http://localhost:6060/debug/pprof/profile?seconds=20 \ 2>&1 | tee pprof-top.txt # Fail if any single function exceeds 25% cumulative CPU HOT=$(grep -m1 "%" pprof-top.txt | awk '{print $3}' | tr -d '%') if [ "$HOT" -gt 25 ]; then echo "FAIL: hot function at ${HOT}% CPU — regression detected" exit 1 fi - name: Upload flame graph artifact uses: actions/upload-artifact@v4 with: name: cpu-flamegraph path: pprof-top.txt

The Senior Judgement Layer

Profiling data is only as useful as the action it drives. At senior/staff level, the discipline is not just reading the flame graph but translating it into a change with a predicted and measured impact:

  1. Baseline first. Record the flame graph before any change. Without a baseline, you cannot prove a fix actually improved throughput.
  2. One variable at a time. Optimise one hot frame, re-run the load test, re-profile. Combining changes makes regression attribution impossible.
  3. Beware premature micro-optimisation. A function burning 3 % of CPU that takes two hours to fix is a poor trade. Focus on frames over 10–15 % first.
  4. Latency vs. throughput. A CPU optimisation that halves CPU usage may not halve p99 latency if the bottleneck is a downstream call (check your distributed traces from Jaeger/Tempo alongside the flame graph).
The profiling loop at big tech: automated continuous profiling (Pyroscope / Parca deployed to every cluster) surfaces the top-5 CPU consumers per service in a weekly engineering review. Engineers who own services in the hot list are expected to file a ticket and reduce the cost within two sprints. This makes performance a first-class engineering metric alongside reliability.

ES
Edrees Salih
1 hour ago

We are still cooking the magic in the way!