Unified Observability
Unified Observability
You now have three pillars in place: Prometheus scrapes metrics, Loki aggregates logs, and Tempo stores distributed traces. Each pillar answers different questions. Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where the time went and which service is at fault. The real unlock — the thing that separates a reactive on-call rotation from a high-performing SRE team — is being able to pivot between all three signals in under thirty seconds, following a single thread of causality without copying and pasting IDs or switching tools.
Unified observability is not a product you buy. It is a discipline: instrument consistently, correlate deliberately, and design your dashboards so that the data leads you from symptom to root cause without dead ends.
The Three-Pillar Correlation Model
The three pillars correlate at two levels: through shared labels/resource attributes that appear identically in all three signal types, and through explicit link fields — most importantly the trace_id that ties a log line to the exact trace that generated it.
Resource attributes are the foundation. When your OTel SDK initializes, it decorates every span, every metric data point, and every log record with resource attributes: service.name, service.version, deployment.environment, k8s.pod.name, k8s.namespace.name. These become Prometheus labels, Loki stream labels, and Tempo search tags. Because the same attribute values appear in all three systems, a Grafana dashboard can fan out to all three with a single variable like ${service}.
The explicit link is the trace ID injected into structured log records. Every time your service emits a log line while handling a request, the OTel logging bridge attaches trace_id and span_id as fields. Grafana Loki can then render a "View in Tempo" button inline in a log result — one click navigates to the exact trace, already scoped to the right time window.
Exemplars: Bridging Metrics and Traces
An exemplar is a sample data point attached to a metric observation that carries a trace_id and optionally a span_id. It is Prometheus's answer to the question: "you just told me p99 latency spiked to 4 s — show me an actual request that was that slow." Without exemplars, you would have to guess a time window and search Tempo manually. With exemplars, the latency histogram bucket carries a pointer directly to a real trace that landed in that bucket.
Exemplars are part of the OpenMetrics standard and are supported by Prometheus natively since v2.43. Your OTel SDK emits them automatically for histogram metrics when a span is active. To enable them on the Prometheus side, two flags are required:
In Grafana, open any histogram panel, switch the visualization to a heatmap or time series, and enable the "Exemplars" toggle. Dots appear on the graph at the moments those high-latency exemplar samples were recorded. Click a dot and Grafana deep-links you directly into Tempo with the trace preloaded. This is the fastest path from a metric anomaly to a root cause in production — no manual searching.
Service Graphs: The Topology Layer
A service graph is a directed graph derived from trace data showing which services call which, with aggregated metrics (request rate, error rate, latency percentiles) on each edge. It answers the questions your static architecture diagram cannot: "which dependencies are actually on the critical path right now, and which are degraded?"
Grafana Tempo ships a built-in service graph generator. It reads spans from Tempo, identifies parent-child relationships, and emits Prometheus-compatible metrics under the traces_service_graph_* namespace. You store those metrics in Prometheus and render them on Grafana's Service Map panel. The entire pipeline requires only a few lines of Tempo configuration:
Once the service graph is running, Grafana's Service Map panel shows a live topology. Each edge is coloured by error rate: green for healthy, yellow for elevated error rate, red for degraded. Clicking any node or edge opens a pre-filtered panel for that service. This is the first screen most SREs open when an alert fires — it immediately narrows the blast radius.
span-metrics processor) give you RED metrics — Rate, Errors, Duration — per service and per operation, derived entirely from trace data. This means even services that do not expose a /metrics endpoint (third-party code, off-the-shelf binaries) get meaningful observability as long as they emit spans. It is also an early-warning system: span metrics are computed in near real-time, often surfacing a problem before your dashboard's PromQL scrape interval catches up.
Grafana Data Source Linking in Practice
The final piece is configuring Grafana to know how its data sources relate to each other. This is done through derived fields in Loki and trace-to-logs / trace-to-metrics in Tempo data source settings. A minimal production Grafana provisioning file looks like this:
With this configuration in place, every alert becomes a starting point, not an endpoint. An SRE sees an error rate spike on the Prometheus panel, clicks an exemplar dot, lands in Tempo on the specific trace, sees which span failed, clicks the Loki link on that span, reads the exact error log line — all within a single Grafana session, all in under thirty seconds. That is unified observability working as intended.
max_chunk_age — the derived field link will open Loki at the wrong time window and return no results. Enforce NTP on all nodes (chrony with a consistent upstream), and configure the OTel Collector's batch processor to flush frequently (under 5 s). Monitor Collector pipeline latency with otelcol_processor_batch_timeout_trigger_send.