JVM Internals & Performance

Project: Diagnosing a Performance Problem

15 min Lesson 10 of 13

Project: Diagnosing a Performance Problem

This capstone lesson walks you through a complete, realistic performance investigation: a service that starts slow, degrades under load, and eventually runs out of memory. You will apply every skill from this tutorial — profiling, GC analysis, JIT awareness, and systematic benchmarking — to find the root causes and fix them.

The Scenario

A team reports that their ReportService takes 12 seconds to generate a report with 50,000 rows and that heap usage climbs with every call. Your job is to diagnose and fix the problem without guessing. The starting code looks like this:

// BEFORE — the slow, leaky version public class ReportService { private static final Map<String, List<Row>> cache = new HashMap<>(); public String generateReport(List<Row> rows, String format) { String key = format + rows.hashCode(); if (cache.containsKey(key)) { return cache.get(key); } StringBuilder sb = new StringBuilder(); for (Row row : rows) { String line = ""; for (int i = 0; i < row.columns().size(); i++) { line = line + row.columns().get(i); if (i < row.columns().size() - 1) { line = line + ","; } } sb.append(line).append("\n"); } String result = sb.toString(); cache.put(key, List.of(rows.toArray(new Row[0]))); // BUG: caches rows not result return result; } }
Never start with guesses. The code above has at least four distinct problems. Without measurement, you might fix the wrong one first and waste hours. Always profile before you patch.

Step 1 — Establish a Baseline Benchmark

Before touching anything, write a JMH benchmark so you have a repeatable, JIT-warmed number to compare against.

@BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MILLISECONDS) @State(Scope.Benchmark) @Warmup(iterations = 3, time = 1) @Measurement(iterations = 5, time = 2) @Fork(1) public class ReportBenchmark { private List<Row> rows; @Setup public void setUp() { rows = IntStream.range(0, 50_000) .mapToObj(i -> new Row(List.of("col1_" + i, "col2_" + i, "col3_" + i))) .collect(Collectors.toList()); } @Benchmark public String generate(ReportService svc) { return svc.generateReport(rows, "csv"); } } // Baseline result: avg 11,843 ms/op, GC overhead: 38%

Step 2 — Attach a Profiler and Read the Flame Graph

Run the benchmark with async-profiler (or JFR) to collect a CPU flame graph:

# Record a JFR profile during the benchmark java -XX:+FlightRecorder \ -XX:StartFlightRecording=filename=report.jfr,duration=30s \ -jar benchmarks.jar ReportBenchmark

Open the .jfr file in JDK Mission Control. The flame graph reveals three hot paths:

  1. String concatenation inside the inner loop — 54% of CPU time.
  2. rows.hashCode() on a 50,000-element list — called on every invocation — 28% of CPU time.
  3. HashMap.put with growing allocations — 11% of CPU time.

Step 3 — Analyse the Heap with a Heap Dump

Trigger a heap dump after ten report generations and open it in Eclipse MAT or VisualVM:

// Force a heap dump programmatically (useful in tests) com.sun.management.HotSpotDiagnosticMXBean mxBean = ManagementFactory.newPlatformMXBeanProxy( ManagementFactory.getPlatformMBeanServer(), "com.sun.management:type=HotSpotDiagnostic", com.sun.management.HotSpotDiagnosticMXBean.class); mxBean.dumpHeap("heap.hprof", true);

MAT's "Dominator Tree" shows ReportService.cache retaining 480 MB — it holds List<Row>> objects, not the formatted strings. The cache key is based on rows.hashCode(), which changes every time because Row does not override hashCode(), so the cache never actually hits. This is the memory leak.

A static unbounded Map is one of the most common memory leaks in Java services. Every entry added during load testing stays in memory until the JVM crashes or the map is cleared. Use LinkedHashMap with a max-size eviction policy or a proper cache like Caffeine.

Step 4 — Fix the Problems One at a Time

Fix each problem in isolation so you can measure the impact of each change independently.

Fix 1 — Replace implicit String concatenation with a dedicated StringBuilder inside the inner loop:
// Before (allocates a new String object on every iteration): line = line + row.columns().get(i); // After (one StringBuilder reused per row): StringBuilder rowBuilder = new StringBuilder(); for (int i = 0; i < cols.size(); i++) { if (i > 0) rowBuilder.append(','); rowBuilder.append(cols.get(i)); } sb.append(rowBuilder).append('\n'); // append char literal, not String

Benchmark after Fix 1: 4,210 ms/op — a 64% reduction from concatenation alone.

Fix 2 — Remove the broken cache or replace it with a bounded, correct one:
// Bounded LRU cache using Caffeine private final Cache<String, String> cache = Caffeine.newBuilder() .maximumSize(100) .expireAfterWrite(Duration.ofMinutes(10)) .build(); // In generateReport: derive a stable key from actual content, not hashCode() String key = format + "-" + rows.size() + "-" + rows.get(0).columns().hashCode();

Benchmark after Fix 2: 3,980 ms/op. Memory stabilises at ~20 MB regardless of how many calls are made.

Fix 3 — Pre-size the outer StringBuilder:
// Avoid repeated internal array copies int estimatedCapacity = rows.size() * 64; // ~64 chars per row on average StringBuilder sb = new StringBuilder(estimatedCapacity);

Benchmark after Fix 3: 3,410 ms/op.

Fix 4 — Stream the output instead of building one giant String:
public void writeReport(List<Row> rows, String format, OutputStream out) throws IOException { try (BufferedWriter writer = new BufferedWriter( new OutputStreamWriter(out, StandardCharsets.UTF_8), 65_536)) { for (Row row : rows) { List<String> cols = row.columns(); for (int i = 0; i < cols.size(); i++) { if (i > 0) writer.write(','); writer.write(cols.get(i)); } writer.newLine(); } } }

Streaming avoids materialising the full result in memory. For a 50,000-row report this eliminates a 4 MB intermediate allocation. Benchmark after all four fixes: 680 ms/op — an 18x improvement over the original 11,843 ms.

Step 5 — Verify Under Load

A single-threaded benchmark is not the whole story. Verify with concurrent load using a simple executor-based stress test:

ExecutorService pool = Executors.newFixedThreadPool(20); List<Future<Long>> futures = new ArrayList<>(); for (int i = 0; i < 200; i++) { futures.add(pool.submit(() -> { long start = System.nanoTime(); service.generateReport(rows, "csv"); return System.nanoTime() - start; })); } LongSummaryStatistics stats = futures.stream() .mapToLong(f -> { try { return f.get(); } catch (Exception e) { throw new RuntimeException(e); } }) .summaryStatistics(); System.out.printf("p50=%.1f ms, max=%.1f ms%n", stats.getAverage() / 1e6, (double) stats.getMax() / 1e6); pool.shutdown();
Always measure percentiles, not just averages. A p99 latency spike that the average hides is what your users actually experience under load. For production-grade measurement, use HdrHistogram or the latency histograms built into JMH.

Step 6 — Document the Investigation

Every professional performance fix should be accompanied by a brief write-up covering: the observed symptom, the profiling evidence that pointed to each root cause, the fix applied, and the before/after numbers. This creates institutional memory and prevents the same regression from sneaking back in through future code review.

Lessons Learned from This Project

  • Measure first, fix second. The flame graph revealed that inner-loop string concatenation was responsible for more than half of CPU time — something no code review would have quantified.
  • Heap dumps expose leaks that metrics miss. The cache was growing invisibly; only the dominator tree made it clear.
  • Fix one thing at a time. If you apply all four fixes simultaneously you cannot tell which one delivered the most value.
  • Streaming beats buffering for large outputs. Avoiding the large intermediate String was the single biggest remaining win after fixing the allocation hot spot.
  • Bounded caches are mandatory. Any cache without a size cap is a memory leak waiting to happen in a long-running service.

Summary

A systematic performance investigation follows a repeatable workflow: baseline benchmark → CPU flame graph → heap analysis → targeted fixes measured individually → concurrent load verification → written record. The tools change (JFR, async-profiler, MAT, JMH) but the process is always the same. Master the process and you will find the problem every time.