Resilience, Messaging & Observability

Distributed Tracing

18 min Lesson 8 of 12

Distributed Tracing

When a single HTTP request fans out across four microservices before returning a response, a slow P99 latency or a sporadic 500 error is almost impossible to diagnose with ordinary logs. Each service writes its own log lines, on its own clock, with its own format. Correlating them manually is error-prone and slow. Distributed tracing solves this by attaching a unique identifier to every request at its entry point and propagating that identifier — automatically — through every downstream call. The result is a complete, causal timeline of exactly where a request spent its time and where it failed.

Core Vocabulary

Before writing any code it helps to have the terminology straight:

Trace — the entire end-to-end journey of one logical request. Identified by a globally unique traceId.
Span — a named, timed unit of work within a trace. Every service hop, every database call, every significant operation is modelled as a span. A span knows its traceId, its own spanId, and the parentSpanId of the operation that triggered it.
Context propagation — the mechanism by which traceId and spanId cross service boundaries, typically as HTTP headers such as traceparent (W3C standard) or X-B3-TraceId (Zipkin/Brave legacy).
Exporter — the component that sends completed spans to a tracing backend (Zipkin, Jaeger, or an OTLP-compatible collector such as Grafana Tempo).

Micrometer Tracing in Spring Boot 3

Spring Boot 3 replaced the older Spring Cloud Sleuth library with Micrometer Tracing, which is a thin, vendor-neutral facade over tracing implementations (Brave/Zipkin or OpenTelemetry). Add the following to pom.xml:

<!-- Micrometer Tracing core + Brave bridge -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>

<!-- Zipkin reporter — sends spans over HTTP to a Zipkin-compatible backend -->
<dependency>
    <groupId>io.zipkin.reporter2</groupId>
    <artifactId>zipkin-reporter-brave</artifactId>
</dependency>

<!-- Spring Boot Actuator — exposes /actuator endpoints and metrics -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Brave vs OpenTelemetry: Both bridges produce W3C-compatible traces. Prefer micrometer-tracing-bridge-otel if your organisation already uses an OpenTelemetry Collector, since OTLP is rapidly becoming the universal standard. Brave is simpler to get started with and has a smaller dependency footprint.

Minimal Configuration

With those JARs on the classpath, auto-configuration activates tracing. Tune it in application.yml:

management:
  tracing:
    sampling:
      probability: 1.0   # 100 % sampling; drop to 0.1 in high-volume production

spring:
  zipkin:
    base-url: http://localhost:9411
    enabled: true

logging:
  pattern:
    level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

The logging pattern injects traceId and spanId into every log line automatically. When a request fails you can copy the traceId from any log line and open Zipkin's UI to see the full waterfall.

Automatic Instrumentation

Most Spring Boot instrumentation is zero-code. Once the dependencies are present:

Incoming HTTP requests — the TracingFilter (servlet) or TracingWebFilter (reactive) starts a new trace (or joins an existing one if a traceparent header is present) and closes the span when the response is sent.
Outgoing HTTP calls with RestTemplate or WebClient — the tracing auto-configuration adds an interceptor/exchange filter that injects propagation headers into every outbound request, so the downstream service automatically participates in the same trace.
Spring Data / JDBC — when spring-boot-starter-data-jpa is on the classpath, database calls appear as child spans named after the query.
Message listeners (Kafka, RabbitMQ) — headers in the message record carry the trace context, and the listener instrumentation picks them up.

Always name your RestTemplate or WebClient beans through the auto-configured builder. Creating a plain new RestTemplate() bypasses the tracing interceptor. Instead inject RestTemplateBuilder (synchronous) or use the auto-wired WebClient.Builder (reactive) — both are pre-configured with the tracing filter.

@Configuration
public class HttpClientConfig {

    // CORRECT — use the builder; tracing interceptor is applied automatically
    @Bean
    public RestTemplate restTemplate(RestTemplateBuilder builder) {
        return builder
            .connectTimeout(Duration.ofSeconds(3))
            .readTimeout(Duration.ofSeconds(5))
            .build();
    }
}

Creating Custom Spans

Auto-instrumentation covers the infrastructure layer. For business-logic operations that are expensive or failure-prone — a third-party API call, a complex calculation, a cache lookup — you want a dedicated span so the waterfall shows exactly how long it took. Inject Tracer and use the fluent API:

import io.micrometer.tracing.Span;
import io.micrometer.tracing.Tracer;
import org.springframework.stereotype.Service;

@Service
public class PricingService {

    private final Tracer tracer;
    private final ExternalPricingClient client;

    public PricingService(Tracer tracer, ExternalPricingClient client) {
        this.tracer = tracer;
        this.client = client;
    }

    public BigDecimal fetchLivePrice(String productId) {
        Span span = tracer.nextSpan()
            .name("external.pricing.fetch")
            .tag("product.id", productId)
            .start();

        try (Tracer.SpanInScope ignored = tracer.withSpan(span)) {
            return client.getPrice(productId);
        } catch (Exception ex) {
            span.error(ex);      // marks the span as errored in the UI
            throw ex;
        } finally {
            span.end();          // MUST always end the span, even on error
        }
    }
}

A few things to notice in this pattern:

tracer.nextSpan() creates a child of the current active span, so it slots correctly into the existing trace hierarchy.
span.tag() attaches key-value metadata that appears in the Zipkin/Jaeger span detail view — invaluable for filtering traces by product, user, tenant, or any business dimension.
span.error(ex) records the exception and sets the span status to ERROR, surfacing it immediately in the tracing UI.
The finally block is mandatory; an unclosed span leaks memory and never reaches the exporter.

Security Considerations

Trace headers can be forged by clients. If a browser or external caller sends a crafted traceparent header, your services will join that trace — potentially leaking internal service topology to an attacker who can correlate timing data. Mitigate this by trusting incoming trace context only from authenticated internal callers (e.g., services that present a valid mTLS certificate or an internal service-to-service JWT). At the perimeter (API Gateway / edge service), strip and re-issue trace headers for requests arriving from the public internet.

Additionally, be mindful of what you attach as span tags. A tag like user.id or request.body will be stored verbatim in the tracing backend. Treat the tracing system as an observability store, not a logging store, and avoid attaching PII or secrets as span attributes.

Sampling Strategy

Tracing 100 % of requests is fine in development. In production at meaningful scale, exporting every span to a backend creates non-trivial overhead and storage cost. Common strategies:

Probabilistic (head-based) — sample a fixed percentage (e.g. 10 %) decided at the trace root. Simple, predictable cost. Set with management.tracing.sampling.probability=0.1.
Rate-limited — sample at most N traces per second regardless of load. Protects the backend during traffic spikes.
Tail-based — buffer all spans and decide to keep only traces that contain an error or exceed a latency threshold. Requires a collector that supports tail sampling (e.g. OpenTelemetry Collector with the tail_sampling processor). More operationally complex but captures 100 % of interesting traces without the overhead of 100 % export.

Running Zipkin Locally

You can spin up a Zipkin instance in seconds with Docker:

docker run -d -p 9411:9411 openzipkin/zipkin

Point your service at http://localhost:9411, make a few HTTP calls, then open http://localhost:9411 in a browser. Click Run Query to see all traces, then click any trace to view its waterfall. Every span is labelled with its service name, operation name, duration, and any tags you added.

Summary

Distributed tracing turns a sea of disconnected log lines into a structured, visual timeline of every request. With Micrometer Tracing and three Maven dependencies, Spring Boot 3 instruments all HTTP server/client traffic, database calls, and messaging listeners automatically. Add custom spans for business-critical operations using the Tracer API, attach meaningful tags, always close spans in a finally block, and choose a sampling strategy that balances visibility against overhead. In the next lesson you will see how to complement traces with metrics and health dashboards to complete the observability picture.