Distributed Tracing & OpenTelemetry

Backends: Jaeger & Tempo

18 min Lesson 6 of 28

Backends: Jaeger & Tempo

After your OpenTelemetry Collector receives spans from instrumented services, those spans must land somewhere that supports efficient storage and querying. Two backends dominate production deployments in 2025: Jaeger (the battle-tested CNCF-graduated project) and Grafana Tempo (the cloud-native, cost-optimised alternative). Choosing the wrong backend — or misconfiguring the right one — leads to either runaway storage costs or traces that vanish exactly when you need them to debug a P0 outage. This lesson covers both systems in depth: architecture, storage engines, query mechanics, and how to wire traces to logs and metrics for unified observability.

Jaeger: Architecture & Storage

Jaeger was open-sourced by Uber in 2017. Its original architecture used separate collector, query, and agent binaries, but modern deployments collapse these into the all-in-one binary (for dev) or a collector + query pair backed by an external store (for production). Jaeger's native wire format is still Thrift over UDP, but since v1.35 it accepts OTLP gRPC/HTTP natively — meaning your OTel Collector can export directly without the legacy Jaeger agent.

For storage, Jaeger supports three options that matter in production:

Elasticsearch / OpenSearch — the default at most companies. Traces are stored as JSON documents indexed by traceID, serviceName, and startTime. Retention is handled via ILM (Index Lifecycle Management). Drawback: ES index overhead balloons storage costs at high throughput (>50 k spans/s).
Cassandra — Uber's original backend. Excellent write throughput; TTL-based retention is trivial. Operationally heavier than ES for most teams.
Badger (embedded) — local disk, development only. Never use in production.

The Jaeger Query service exposes a UI on port 16686 and a gRPC API on 16685. The UI lets you search by service, operation, tags, duration range, and time window. The underlying query is a tag-indexed lookup against your storage backend, not a full-text scan.

# Minimal Jaeger all-in-one for local development (accepts OTLP gRPC on 4317)
docker run --rm -d \
  --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -e COLLECTOR_OTLP_ENABLED=true \
  jaegertracing/all-in-one:1.57

# Production: Jaeger collector + query backed by Elasticsearch 8
# collector (receives OTLP, writes to ES)
jaeger-collector \
  --es.server-urls=http://es:9200 \
  --es.num-shards=5 \
  --es.num-replicas=1 \
  --collector.otlp.grpc.host-port=:4317 \
  --log-level=warn

# query (serves the UI and gRPC API)
jaeger-query \
  --es.server-urls=http://es:9200 \
  --query.base-path=/jaeger \
  --log-level=warn

Index strategy for Jaeger + ES: By default Jaeger creates one index per day (jaeger-span-YYYY-MM-DD). At >10 M spans/day, set --es.use-aliases=true and --es.rollover-on-create=true so ILM controls rollover rather than calendar boundaries. This avoids the "Monday morning shard explosion" where the weekend's index grows unchecked.

Grafana Tempo: Architecture & Storage

Tempo was built with one goal: make trace storage as cheap as object storage. It writes spans directly to S3, GCS, or Azure Blob as Parquet-formatted blocks, with a tiny in-memory / local-disk index containing only traceID → block location. That index is why Tempo's query model is fundamentally different from Jaeger's: you must know the traceID to retrieve a trace. This seems limiting, but it is by design — Tempo offloads span attribute search to Prometheus metrics and log queries, keeping storage costs an order of magnitude lower than ES-backed Jaeger.

Since Tempo 2.0, the TraceQL query language enables attribute-based search without knowing the traceID upfront, backed by a new columnar index called the Tag Value Index. This closes most of the UX gap with Jaeger.

Jaeger uses a tag-indexed document store (ES/Cassandra); Tempo writes Parquet blocks to object storage with a lightweight TraceQL index.

# Minimal Tempo config (tempo.yaml) backed by S3
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  trace_idle_period: 10s
  max_block_bytes: 1_000_000
  max_block_duration: 5m

storage:
  trace:
    backend: s3
    s3:
      bucket: my-tempo-traces
      endpoint: s3.us-east-1.amazonaws.com
    wal:
      path: /var/tempo/wal

compactor:
  compaction:
    block_retention: 720h   # 30 days

querier:
  frontend_worker:
    frontend_address: tempo-query-frontend:9095

Querying Traces: TraceQL vs Jaeger UI

The Jaeger UI query model is simple: pick a service, pick an operation, set a time range and tag filter, click Find Traces. The backend translates this to an ES multi-term query against the tag index. For ad-hoc debugging this is fast, but it only returns traces that contain the matching span — there is no language for expressing cross-span conditions like "the db.query span took > 500 ms AND the parent HTTP span returned 200".

TraceQL fills that gap. It uses a pipeline syntax inspired by LogQL:

# TraceQL: find all traces where the checkout service had a db call > 500ms
{ span.service.name = "checkout" && span.db.system = "postgresql" } | duration > 500ms

# Find error traces in the payment service in the last hour
{ resource.service.name = "payment-svc" && status = error }

# Structural query: find spans where a child db span failed
{ .service.name = "order-api" } >> { span.db.statement != "" && status = error }

# Aggregate: p99 latency by service (TraceQL metrics, Tempo 2.4+)
{ } | rate() by (resource.service.name)

Use TraceQL metrics for RED dashboards without Prometheus instrumentation: Tempo 2.4 introduced streaming TraceQL metrics that let you compute request rate, error rate, and duration histograms directly from trace data. This is invaluable for services you do not control (third-party libraries, sidecars) that emit traces via auto-instrumentation but no Prometheus metrics.

Trace-to-Logs Correlation

The most valuable debugging workflow in a distributed system is trace → span → correlated log lines. This works because OTel propagates a traceID and spanID in every context, and your logging library injects them into each log record. In Grafana, a derived field in the Loki datasource converts the raw log field into a clickable link that opens the matching trace in Tempo:

# Grafana datasource config (provisioning/datasources/loki.yaml)
apiVersion: 1
datasources:
  - name: Loki
    type: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"trace_id":"([a-f0-9]+)"'
          url: "$${__value.raw}"
          datasourceUid: tempo   # links to the Tempo datasource
          urlDisplayLabel: "Open in Tempo"

  - name: Tempo
    type: tempo
    uid: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        filterByTraceID: true
        filterBySpanID: true
        customQuery: true
        query: '{service_name="$${__span.tags["service.name"]}"} | json | trace_id="$${__trace.traceId}"'
      serviceMap:
        datasourceUid: prometheus

On the application side, your logger must emit trace_id and span_id as structured fields. With OpenTelemetry's opentelemetry-instrumentation-logging (Python) or the OTel Log Bridge API (Java/Go), this happens automatically when you use the OTel context. If you are using a custom logger, extract the IDs manually:

// Go: inject traceID and spanID into every log record via slog
import (
    "go.opentelemetry.io/otel/trace"
    "log/slog"
)

func logWithTrace(ctx context.Context, msg string, attrs ...slog.Attr) {
    span := trace.SpanFromContext(ctx)
    sc := span.SpanContext()
    baseAttrs := []slog.Attr{
        slog.String("trace_id", sc.TraceID().String()),
        slog.String("span_id",  sc.SpanID().String()),
    }
    slog.LogAttrs(ctx, slog.LevelInfo, msg, append(baseAttrs, attrs...)...)
}

Trace-to-Metrics Correlation

Tempo's service graph feature generates Prometheus metrics (request rate, error rate, duration histograms) from span relationships — specifically from root spans and their children. It exposes these as traces_service_graph_request_total, traces_service_graph_request_failed_total, and traces_service_graph_duration_seconds. Wire these into Grafana's service map view and you get a live dependency graph of your entire system with RED metrics derived purely from traces — no Prometheus instrumentation required on each service.

Retention mismatch kills post-incident review: A common production mistake is setting trace retention shorter than your incident review window. If you keep 3 days of traces but your postmortem runs on day 5, every trace from the incident is gone. At minimum, match trace retention to your SLA for completing P1 postmortems — typically 14–30 days. For Tempo on S3, storage is cheap enough that 30 days costs a fraction of what ES would charge for the same volume. For Jaeger on ES, configure an ILM policy that moves old indices to a warm/cold tier rather than deleting them immediately.

Choosing Between Jaeger and Tempo

For greenfield deployments in 2025, Tempo is the default choice at most companies running on Kubernetes with existing Grafana stacks — the object storage backend eliminates the operational burden of running Elasticsearch or Cassandra, and native Grafana integration makes trace/log/metric correlation seamless. Jaeger remains the right call when your organization already runs a large Elasticsearch cluster for other workloads (shared operational cost), when you need the mature Jaeger UI for teams unfamiliar with Grafana, or when you require Cassandra's write throughput at extreme scale (>100 k spans/s per node).

Both backends accept OTLP natively, so the instrumentation layer is identical regardless of your choice — and you can run both in parallel during a migration without touching application code.