Performance & Load Testing

Profiling Applications

18 min Lesson 6 of 28

Profiling Applications

Load tests tell you that your service is slow. Profiling tells you why. Without profiling, optimisation is guesswork: engineers spend days tuning connection pools or adding caches while a single hot function in userland burns 60 % of every CPU cycle. This lesson is about finding those hot paths systematically, at the function-call and memory-allocation level, with the same discipline you apply to SLO burn-rate analysis.

The Two Axes: CPU and Memory

Every performance regression sits on one of two axes — or both:

CPU profiling — measures where time is spent: which functions are on-CPU and for how long. Captured by sampling the call stack at a fixed interval (e.g. 100 Hz) or by instrumenting every function entry/exit.
Memory profiling — measures where allocations are made: which call path triggered malloc / GC pressure. Useful when the symptom is OOMKill, high GC pause latency, or steady RSS growth (leak).

In a containerised fleet you will most often reach for CPU profiling first, because CPU saturation is the dominant cause of latency regressions in compute-bound services. Memory profiling follows when RSS or GC metrics from your Prometheus stack surface an anomaly.

Sampling vs. Instrumentation

Sampling profilers (perf, pprof in Go, async-profiler for JVM) periodically interrupt execution and record the current call stack. Overhead is low (1–3 % CPU) and safe for brief production use. Instrumentation profilers (JaCoCo, Python cProfile, Pyroscope's eBPF mode) wrap every function call, giving exact counts but adding 10–40 % overhead — reserve these for staging or controlled canary windows.

Production rule of thumb: a sampling profiler running at 100 Hz for 60 seconds on a single pod adds roughly 1–2 % CPU overhead. That is acceptable during a live incident or a controlled 5 % canary. Never run an instrumentation profiler against 100 % of production traffic.

Flame Graphs: Reading the Hot Path

A flame graph (invented by Brendan Gregg) collapses thousands of stack samples into a single image. The x-axis is sample count (width = time share), the y-axis is call depth (bottom = entry point, top = leaf). The widest frames at the top of a tower are the hot path — the code spending the most CPU time without calling anything else.

A flame graph reveals an N+1 database loop consuming 34 % of CPU — the widest red frame at the top of the hot tower is the optimisation target.

The key insight: you do not optimise the wide frames at the bottom (those are your framework's request dispatcher — you cannot change them). You optimise the wide frames at the top of the tallest towers, because those are leaf functions that the CPU is actually executing.

Go: pprof in Production

Go ships net/http/pprof in the standard library. Import it and expose the debug endpoint (behind an internal-only ingress or network policy, never on the public port):

import _ "net/http/pprof"   // registers /debug/pprof/* handlers

// In your internal metrics mux (port 6060, not 8080):
go func() {
    log.Fatal(http.ListenAndServe(":6060", nil))
}()

Capture a 30-second CPU profile and generate a flame graph locally:

# Port-forward to the pod under load
kubectl port-forward pod/api-6d8f9b-xkz2p 6060:6060 -n production

# Collect 30 s CPU profile
go tool pprof -http=:9090 \
  http://localhost:6060/debug/pprof/profile?seconds=30

# The browser opens an interactive flame graph at http://localhost:9090
# Switch to "Flame Graph" view; sort by "cum" (cumulative) for call chains

Continuous profiling in production: tools like Pyroscope (open-source) or Google Cloud Profiler / Datadog Continuous Profiler run a low-overhead sampling agent on every pod 24/7 and store profiles indexed by commit SHA, service, and time. When a latency SLO alert fires, you can diff the flame graph between the current deploy and the previous one — the widening frame is your regression.

JVM: async-profiler

async-profiler avoids the safepoint bias of JVMTI-based profilers (JProfiler, YourKit) by using Linux perf_events or AsyncGetCallTrace to sample threads regardless of JVM safepoint state. This is the correct tool for production JVM profiling:

# Attach to a running JVM (PID auto-detected in a container via /proc)
./profiler.sh -d 30 -f /tmp/flamegraph.html \
  -e cpu --jfrsync cpu \
  $(jps | grep MyService | awk '{print $1}')

# In Kubernetes: exec into the pod first
kubectl exec -it svc-pod-abc123 -n production -- bash
cd /opt/async-profiler-3.0
./profiler.sh -d 30 -f /tmp/cpu.html -e cpu 1

# Copy out the HTML flame graph
kubectl cp production/svc-pod-abc123:/tmp/cpu.html ./cpu.html

Linux: perf + Brendan Gregg's FlameGraph

For native processes (C, C++, Rust) or kernel-space hot paths (system calls, eBPF programs), perf is the ground truth:

# Record 30 s, all CPUs, call graphs via frame pointers
perf record -F 99 -a -g -- sleep 30

# Generate perf script output, fold stacks, render SVG
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu-flame.svg

# For kernel + userspace combined (requires DWARF or frame pointers)
perf record -F 99 --call-graph dwarf -p $(pgrep myservice) -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl \
  --color=java --title="myservice 30s CPU" > combined-flame.svg

Missing symbols in perf flame graphs are the most common production pitfall. Containers built with -O2 and stripped binaries produce frames labelled [unknown]. Solutions: (1) compile with -fno-omit-frame-pointer (Go does this by default since 1.12; add -XX:+PreserveFramePointer for JVM); (2) install debuginfo packages; (3) use eBPF-based profilers (Parca, Pyroscope eBPF mode) that resolve symbols from DWARF in a sidecar without touching the application binary.

Memory Profiling: Finding Leaks and GC Pressure

When your Prometheus process_resident_memory_bytes climbs 2 % per hour or jvm_gc_pause_seconds spikes under load, reach for allocation profiling:

Go heap profile: curl localhost:6060/debug/pprof/heap > heap.pb.gz then go tool pprof -http=:9090 heap.pb.gz. Toggle "alloc_space" (total allocations since start) vs "inuse_space" (current live heap) — inuse_space reveals live leaks, alloc_space reveals allocation pressure causing GC churn.
JVM: jcmd <pid> VM.native_memory summary for off-heap; async-profiler with -e alloc for allocation flame graphs. Heap dumps (jmap -dump:live) feed into Eclipse MAT for leak detection.
Python: memray (Bloomberg) instruments allocations at the C level — memray run --live myservice.py or memray attach <pid> for a running process.

Profiling in CI: Performance Gates

Integrating profiling into your CI pipeline prevents regressions from reaching production. A typical GitHub Actions step after the load test job:

- name: Collect pprof profile during load test
  run: |
    # k6 load test runs in background (started by previous step)
    sleep 10  # let load ramp up
    go tool pprof -top -nodecount=20 \
      http://localhost:6060/debug/pprof/profile?seconds=20 \
      2>&1 | tee pprof-top.txt

    # Fail if any single function exceeds 25% cumulative CPU
    HOT=$(grep -m1 "%" pprof-top.txt | awk '{print $3}' | tr -d '%')
    if [ "$HOT" -gt 25 ]; then
      echo "FAIL: hot function at ${HOT}% CPU — regression detected"
      exit 1
    fi

- name: Upload flame graph artifact
  uses: actions/upload-artifact@v4
  with:
    name: cpu-flamegraph
    path: pprof-top.txt

The Senior Judgement Layer

Profiling data is only as useful as the action it drives. At senior/staff level, the discipline is not just reading the flame graph but translating it into a change with a predicted and measured impact:

Baseline first. Record the flame graph before any change. Without a baseline, you cannot prove a fix actually improved throughput.
One variable at a time. Optimise one hot frame, re-run the load test, re-profile. Combining changes makes regression attribution impossible.
Beware premature micro-optimisation. A function burning 3 % of CPU that takes two hours to fix is a poor trade. Focus on frames over 10–15 % first.
Latency vs. throughput. A CPU optimisation that halves CPU usage may not halve p99 latency if the bottleneck is a downstream call (check your distributed traces from Jaeger/Tempo alongside the flame graph).

The profiling loop at big tech: automated continuous profiling (Pyroscope / Parca deployed to every cluster) surfaces the top-5 CPU consumers per service in a weekly engineering review. Engineers who own services in the hot list are expected to file a ticket and reduce the cost within two sprints. This makes performance a first-class engineering metric alongside reliability.