Resilience, Messaging & Observability

Metrics & Monitoring

18 min Lesson 9 of 12

Metrics & Monitoring

Distributed tracing tells you where time was spent in a single request. Metrics tell you how healthy your service is right now — and over time. In production, you cannot attach a debugger or read every log line; you need dashboards that summarise thousands of requests per second into numbers you can act on. This lesson covers Micrometer, the metrics facade built into Spring Boot 3, and how it feeds data into Prometheus (the collection backend) and Grafana (the visualisation layer).

The Metrics Stack in One Sentence

Your Spring Boot service exposes a /actuator/prometheus endpoint; a Prometheus server scrapes that endpoint on a schedule; Grafana queries Prometheus and renders dashboards. Each component has exactly one job, and the decoupling means you can swap any layer independently.

Why Micrometer? Micrometer is to metrics what SLF4J is to logging — a vendor-neutral facade. You write counter.increment() once; Micrometer translates it to Prometheus format, CloudWatch, Datadog, InfluxDB, or any other registry you add to the classpath. Your application code never imports a Prometheus class directly.

Adding the Dependencies

Spring Boot Actuator ships Micrometer core. To export Prometheus-format metrics add one more dependency:

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <!-- version managed by Spring Boot BOM -->
</dependency>

Then expose the endpoint in application.yml:

management:
  endpoints:
    web:
      exposure:
        include: health, info, prometheus, metrics
  metrics:
    tags:
      application: ${spring.application.name}   # global tag on every metric

The application tag propagates your service name to every metric automatically, which is essential when multiple services share one Prometheus instance.

Never expose /actuator/prometheus to the public internet. The endpoint reveals internal counters, thread pool sizes, database pool stats, and JVM memory breakdown — a gold mine for an attacker doing reconnaissance. Protect it with Spring Security, a dedicated management port (management.server.port: 9090) reachable only inside your cluster, or a network-level firewall rule. Prometheus should scrape from an internal network, not through your public load balancer.

The Four Core Metric Types

Micrometer provides four fundamental instruments. Choosing the right one matters because Prometheus aggregates them differently:

Counter — a value that only goes up. Use it for events: orders placed, errors thrown, cache misses. Never reset it; Prometheus computes rates with rate().
Gauge — a value that goes up and down. Use it for current state: active connections, queue depth, memory used. Sampled at scrape time, not accumulated.
Timer — measures duration and counts invocations simultaneously. Produces a histogram of latency percentiles. The most important instrument for latency SLOs.
DistributionSummary — like Timer but for non-time values: bytes transferred, items in a batch, request payload size.

Recording Metrics in Your Service

Inject MeterRegistry (auto-configured by Spring Boot) into any Spring bean:

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;

@Service
public class OrderService {

    private final Counter ordersCreated;
    private final Counter ordersFailed;
    private final Timer  orderProcessingTime;

    public OrderService(MeterRegistry registry) {
        this.ordersCreated = Counter.builder("orders.created")
                .description("Total orders successfully created")
                .tag("region", System.getenv("REGION"))
                .register(registry);

        this.ordersFailed = Counter.builder("orders.failed")
                .description("Total orders that failed processing")
                .register(registry);

        this.orderProcessingTime = Timer.builder("orders.processing.time")
                .description("Time taken to process an order end-to-end")
                .publishPercentiles(0.5, 0.95, 0.99)  // median, p95, p99
                .register(registry);
    }

    public Order createOrder(OrderRequest req) {
        return orderProcessingTime.record(() -> {
            try {
                Order order = processOrder(req);
                ordersCreated.increment();
                return order;
            } catch (Exception e) {
                ordersFailed.increment();
                throw e;
            }
        });
    }
}

The publishPercentiles option computes client-side percentiles (stored in the app memory). For more accurate aggregation across multiple instances use publishPercentileHistogram(true) instead and compute percentiles in Prometheus with histogram_quantile().

The @Timed Annotation — Declarative Timing

For HTTP handler methods, annotating with @Timed is cleaner than wrapping every method body:

import io.micrometer.core.annotation.Timed;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/orders")
public class OrderController {

    @PostMapping
    @Timed(value = "http.orders.create", description = "POST /orders latency",
           percentiles = {0.5, 0.95, 0.99})
    public ResponseEntity<Order> create(@RequestBody OrderRequest req) {
        // ...
    }
}

Spring Boot auto-configures a TimedAspect bean when Micrometer and Spring AOP are on the classpath. The resulting metric name follows the convention http.orders.create_seconds in Prometheus format.

Spring Boot already instruments your HTTP server. The auto-configured http.server.requests timer records every inbound request with tags for method, uri, status, and outcome. Before adding custom metrics, check whether the built-in ones already answer your question.

Gauges for Live State

Gauges are best for values that fluctuate. A common pattern is tracking a queue's current size:

import io.micrometer.core.instrument.Gauge;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;

@Service
public class WorkerService {

    private final BlockingQueue<Task> taskQueue = new LinkedBlockingQueue<>();

    public WorkerService(MeterRegistry registry) {
        Gauge.builder("worker.queue.size", taskQueue, BlockingQueue::size)
             .description("Current number of tasks waiting in the worker queue")
             .register(registry);
    }
}

Notice the gauge takes a reference to the queue and a function to read its size. Micrometer calls the function at scrape time — no manual update needed.

How Prometheus Scrapes the Endpoint

The /actuator/prometheus response is plain text in the OpenMetrics exposition format. Prometheus configuration to scrape it:

# prometheus.yml
scrape_configs:
  - job_name: 'order-service'
    scrape_interval: 15s
    static_configs:
      - targets: ['order-service:8080']
    metrics_path: /actuator/prometheus

In Kubernetes you would use service discovery annotations instead of static_configs, but the principle is the same.

Useful PromQL Queries

Once Prometheus is collecting data, these queries form the backbone of most dashboards:

rate(orders_created_total[1m]) — orders per second over the last minute.
histogram_quantile(0.95, rate(orders_processing_time_seconds_bucket[5m])) — 95th-percentile latency over 5 minutes.
rate(http_server_requests_seconds_count{status="5xx"}[1m]) — 5xx error rate per second.
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} — heap utilisation ratio.

The Four Golden Signals in Grafana

Site Reliability Engineering defines four golden signals that every service dashboard should show: Latency (p50/p95/p99 of request duration), Traffic (requests per second), Errors (error rate as a percentage of total traffic), and Saturation (CPU, heap, thread-pool fullness). Grafana lets you compose PromQL expressions into panels and set alert thresholds. A common starting point is importing the official Spring Boot Statistics dashboard (Grafana ID 6756) which covers JVM, HikariCP pool, and HTTP server metrics out of the box.

Metrics vs Logs vs Traces — the observability triangle. Metrics answer "is something wrong?" (aggregated, cheap to store). Traces answer "where exactly is the problem?" (sampled, moderate cost). Logs answer "what happened in detail?" (verbose, expensive). Use all three together: a Grafana alert fires → you open the relevant trace in Zipkin → you drill into the log lines for that trace ID.

Summary

Micrometer provides a thin, vendor-neutral API over all metric types. Adding micrometer-registry-prometheus exposes a scrape endpoint; Prometheus pulls it on a schedule and Grafana visualises the data. Record counters for events, timers for latency, and gauges for live state. Secure the actuator endpoint — never expose it publicly. Build dashboards around the four golden signals. Combined with the distributed tracing from the previous lesson, you now have the full observability stack a production microservice requires.