Distributed Tracing & OpenTelemetry

Unified Observability

18 min Lesson 9 of 28

Unified Observability

You now have three pillars in place: Prometheus scrapes metrics, Loki aggregates logs, and Tempo stores distributed traces. Each pillar answers different questions. Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where the time went and which service is at fault. The real unlock — the thing that separates a reactive on-call rotation from a high-performing SRE team — is being able to pivot between all three signals in under thirty seconds, following a single thread of causality without copying and pasting IDs or switching tools.

Unified observability is not a product you buy. It is a discipline: instrument consistently, correlate deliberately, and design your dashboards so that the data leads you from symptom to root cause without dead ends.

The Three-Pillar Correlation Model

The three pillars correlate at two levels: through shared labels/resource attributes that appear identically in all three signal types, and through explicit link fields — most importantly the trace_id that ties a log line to the exact trace that generated it.

Resource attributes are the foundation. When your OTel SDK initializes, it decorates every span, every metric data point, and every log record with resource attributes: service.name, service.version, deployment.environment, k8s.pod.name, k8s.namespace.name. These become Prometheus labels, Loki stream labels, and Tempo search tags. Because the same attribute values appear in all three systems, a Grafana dashboard can fan out to all three with a single variable like ${service}.

The explicit link is the trace ID injected into structured log records. Every time your service emits a log line while handling a request, the OTel logging bridge attaches trace_id and span_id as fields. Grafana Loki can then render a "View in Tempo" button inline in a log result — one click navigates to the exact trace, already scoped to the right time window.

All three pillars share resource attributes and are linked by trace ID — Grafana pivots between them without manual copy-paste.

Exemplars: Bridging Metrics and Traces

An exemplar is a sample data point attached to a metric observation that carries a trace_id and optionally a span_id. It is Prometheus's answer to the question: "you just told me p99 latency spiked to 4 s — show me an actual request that was that slow." Without exemplars, you would have to guess a time window and search Tempo manually. With exemplars, the latency histogram bucket carries a pointer directly to a real trace that landed in that bucket.

Exemplars are part of the OpenMetrics standard and are supported by Prometheus natively since v2.43. Your OTel SDK emits them automatically for histogram metrics when a span is active. To enable them on the Prometheus side, two flags are required:

# prometheus.yml — enable exemplar storage
global:
  scrape_interval: 15s

storage:
  exemplars:
    max_exemplars: 100000   # ring buffer; ~48 MB at typical size

# Your scrape target must expose metrics in OpenMetrics text format:
scrape_configs:
  - job_name: 'checkout-service'
    scrape_interval: 15s
    scrape_protocols:
      - OpenMetricsText1.0.0   # enables exemplar parsing
      - PrometheusText0.0.4
    static_configs:
      - targets: ['checkout:9090']

In Grafana, open any histogram panel, switch the visualization to a heatmap or time series, and enable the "Exemplars" toggle. Dots appear on the graph at the moments those high-latency exemplar samples were recorded. Click a dot and Grafana deep-links you directly into Tempo with the trace preloaded. This is the fastest path from a metric anomaly to a root cause in production — no manual searching.

Pro tip: Exemplars are sampled — your SDK does not record one for every request, only for a representative subset. Ensure your trace sampling rate is high enough that high-latency outliers are not dropped before they reach Tempo. Head-based sampling at 100% for requests over 1 s, combined with a 1% tail rate for everything else, is a common production strategy for keeping storage costs manageable while guaranteeing exemplars exist for the slow tail.

Service Graphs: The Topology Layer

A service graph is a directed graph derived from trace data showing which services call which, with aggregated metrics (request rate, error rate, latency percentiles) on each edge. It answers the questions your static architecture diagram cannot: "which dependencies are actually on the critical path right now, and which are degraded?"

Grafana Tempo ships a built-in service graph generator. It reads spans from Tempo, identifies parent-child relationships, and emits Prometheus-compatible metrics under the traces_service_graph_* namespace. You store those metrics in Prometheus and render them on Grafana's Service Map panel. The entire pipeline requires only a few lines of Tempo configuration:

# tempo.yaml — enable service graph metrics generation
metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: prod-us-east-1
  storage:
    path: /var/tempo/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
  processors:
    - service-graphs          # builds the call-graph topology
    - span-metrics            # emits RED metrics per operation
  service_graphs:
    dimensions:
      - http.method
      - http.status_code
      - deployment.environment
    enable_virtual_node_label: true   # surfaces external deps (DBs, 3rd-party)

Once the service graph is running, Grafana's Service Map panel shows a live topology. Each edge is coloured by error rate: green for healthy, yellow for elevated error rate, red for degraded. Clicking any node or edge opens a pre-filtered panel for that service. This is the first screen most SREs open when an alert fires — it immediately narrows the blast radius.

Key idea: Span metrics (emitted by span-metrics processor) give you RED metrics — Rate, Errors, Duration — per service and per operation, derived entirely from trace data. This means even services that do not expose a /metrics endpoint (third-party code, off-the-shelf binaries) get meaningful observability as long as they emit spans. It is also an early-warning system: span metrics are computed in near real-time, often surfacing a problem before your dashboard's PromQL scrape interval catches up.

Grafana Data Source Linking in Practice

The final piece is configuring Grafana to know how its data sources relate to each other. This is done through derived fields in Loki and trace-to-logs / trace-to-metrics in Tempo data source settings. A minimal production Grafana provisioning file looks like this:

# grafana/provisioning/datasources/observability.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo-uid    # clicking an exemplar opens Tempo

  - name: Loki
    type: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"trace_id":"(\w+)"'
          url: '$${__value.raw}'
          datasourceUid: tempo-uid    # log line trace_id links to Tempo

  - name: Tempo
    uid: tempo-uid
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki-uid
        tags:
          - key: service.name
            value: service_name       # auto-filters Loki by service
      tracesToMetrics:
        datasourceUid: prometheus-uid
        tags:
          - key: service.name
            value: job
        queries:
          - name: Request rate
            query: rate(http_server_duration_count{job="$${__tags.job}"}[5m])
      serviceMap:
        datasourceUid: prometheus-uid

With this configuration in place, every alert becomes a starting point, not an endpoint. An SRE sees an error rate spike on the Prometheus panel, clicks an exemplar dot, lands in Tempo on the specific trace, sees which span failed, clicks the Loki link on that span, reads the exact error log line — all within a single Grafana session, all in under thirty seconds. That is unified observability working as intended.

Production pitfall — clock skew kills correlation: Exemplar links and trace-to-log pivots depend on time alignment. If your Loki ingestion timestamp differs from your span start time by more than a few seconds — due to clock skew between hosts, buffering in the OTel Collector, or Loki's max_chunk_age — the derived field link will open Loki at the wrong time window and return no results. Enforce NTP on all nodes (chrony with a consistent upstream), and configure the OTel Collector's batch processor to flush frequently (under 5 s). Monitor Collector pipeline latency with otelcol_processor_batch_timeout_trigger_send.