Benchmarking & Measuring Performance
Benchmarking & Measuring Performance
Measuring performance in Java is deceptively hard. The JVM is a highly adaptive runtime: it interprets bytecode, profiles hot paths, compiles them to native code on the fly, garbage-collects, and inlines methods across call-site boundaries — all while your benchmark is running. Ignore these dynamics and your numbers are fiction. This lesson teaches you why naive timing lies, what warm-up is and why it matters, and how the Java Microbenchmark Harness (JMH) solves the problem correctly.
Why Naive Timing Lies
The first instinct of most developers is to wrap code in System.nanoTime() calls and compute the difference. That approach is broken for microbenchmarks in several distinct ways.
JIT compilation is not instantaneous. When the JVM first encounters a method it interprets it — slowly. After a method has been called roughly 10,000 times (the C1 threshold) the JIT compiles it to optimised native code. After ~10,000 more it may recompile with aggressive C2 optimisations. If your benchmark runs a method 100 times, the first 80 executions are interpreted and the last 20 are JIT-compiled: the average is meaningless.
Dead-code elimination. If the JIT can prove that a computation's result is never used, it eliminates the computation entirely. A benchmark that computes a sum but throws the result away may measure nothing at all.
Constant folding. A loop body that depends only on compile-time constants may be evaluated once and the loop removed. You measure a no-op.
GC interference. A garbage collection pause mid-measurement inflates your timing. Without controlling GC, successive runs differ by the GC's mood.
OS scheduling jitter. Thread preemption, CPU frequency scaling (turbo boost, power saving modes), and NUMA memory effects all add noise.
Understanding Warm-Up
Warm-up is the period during which the JVM transitions a piece of code from interpreted execution to fully optimised native code. The JVM's tiered compilation pipeline has multiple levels:
- Level 0: Pure interpretation.
- Level 1–3: C1 compiler (client compiler) — fast compilation with basic optimisations.
- Level 4: C2 compiler (server compiler) — aggressive speculative optimisations, inlining, escape analysis.
A benchmark should only measure Level 4 steady-state throughput. That means running the code enough iterations — typically thousands of calls — before you start recording measurements. The exact number of warm-up iterations needed varies by method complexity and the JVM's profiling decisions.
Consider this deceptive example:
In practice the second block is often 5–20× faster than the first, even on trivial code. Neither number is wrong — they just measure different JVM states. Production code always runs in the steady state; your benchmark should too.
The Java Microbenchmark Harness (JMH)
JMH, developed by the JVM performance engineers at Oracle and distributed via OpenJDK, is the standard tool for writing correct Java microbenchmarks. It handles warm-up, dead-code elimination prevention (via Blackhole and result consumption), forked JVM processes, and statistical aggregation automatically.
Adding JMH to a Maven Project
A Minimal JMH Benchmark
Blackhole parameter is essential. Without consuming the result, the JIT is free to determine that the computation is unused and eliminate it entirely. bh.consume(value) creates a fake dependency that defeats this optimisation without adding meaningful overhead itself.
Key JMH Annotations Explained
@BenchmarkMode—Mode.AverageTime,Mode.Throughput,Mode.SampleTime, orMode.SingleShotTime. Choose based on what matters: average latency, throughput, or percentile distribution.@Fork— runs each benchmark in a fresh JVM. This isolates JIT state between benchmarks and prevents one benchmark's profiling decisions from affecting another. Never run with@Fork(0)in production measurements.@Warmup/@Measurement— control the warm-up and measurement phases separately. Warm-up iterations are discarded; only measurement iterations contribute to the reported result.@State—Scope.Benchmark(shared),Scope.Thread(per-thread copy),Scope.Group(per benchmark group). Determines object sharing in multi-threaded benchmarks.@Setup/@TearDown— initialise and clean up state; never placed inside the@Benchmarkmethod.
Running JMH and Reading the Output
Build a fat JAR and run it from the command line:
JMH prints a table like this:
The ± value is a 99.9% confidence interval across forks and iterations. A narrow interval means the measurement is stable. A wide interval means there is high variance — more iterations, more forks, or a more isolated machine are needed.
taskset (Linux) and disable frequency scaling.
Common Benchmarking Traps to Avoid
- Benchmark loop fusion: if your benchmark method is so fast that each invocation is a few nanoseconds, the JIT may merge iterations and amortise setup costs. Use
@OperationsPerInvocationor restructure the method. - Too few warm-up iterations: complex methods with deep call graphs need more warm-up. Start with at least 5 × 1-second iterations and verify with
-prof gcthat GC is not interfering. - Benchmarking the wrong thing: measuring
HashMap.get()with String keys constructed inside the benchmark body measures String allocation and hashing, not retrieval alone. - Ignoring allocation rate: use
-prof gcto see bytes allocated per operation. A method that allocates heavily will trigger GC pauses in production even if it looks fast in isolation.
Summary
Naive timing with System.nanoTime() produces unreliable results because JIT compilation, dead-code elimination, and GC run concurrently with your measurement. Warm-up is the JVM's transition from interpreted to fully optimised code — measurements taken before warm-up completes reflect interpreter performance, not production performance. JMH solves all these problems through controlled warm-up phases, forked JVM processes, Blackhole result consumption, and statistical aggregation. Use JMH for any benchmark you intend to act on.