Why Distributed Tracing?
Why Distributed Tracing?
You have Prometheus metrics. You have structured logs in Loki or Elasticsearch. Your Grafana dashboards are green, your SLOs are within budget, and yet — your senior engineers are spending hours in post-mortems tracing a 400ms latency spike that affected 0.3% of requests last Tuesday afternoon. The metrics showed a small blip. The logs showed some slow database queries. But nobody could answer the simple question: which service actually caused the slowdown, and why did only those requests hit it?
This is the problem distributed tracing was built to solve. It is not a replacement for metrics or logs — it is the third pillar that lets you ask a fundamentally different class of questions: what happened to this specific request as it traveled across your system?
The Latency Attribution Problem
In a microservices architecture, a single user-facing request fans out into dozens of downstream calls. A checkout request might hit an API gateway, an auth service, a cart service, a pricing service, a payment service, a fraud detection service, and an order service — each potentially calling their own databases, caches, or third-party APIs. The total end-to-end latency that the user experiences is the sum of all these hops, plus the network time between them.
Metrics give you aggregates: "the checkout service p99 latency is 380ms." But which 380ms? Is it 200ms in the payment service, 100ms in the fraud check, and 80ms in everything else? Or is it 350ms in the cart service on a cold cache miss? These are completely different problems with completely different solutions, and an aggregate metric cannot tell you which one you are dealing with.
Logs tell you what happened inside each service, but stitching together the story of one request across ten services from ten separate log streams — each with its own timestamp skew, its own log format, its own sampling rate — is genuinely painful at scale. It requires a human to manually correlate entries by request ID, often across different UI screens or grep pipelines.
Traces vs Metrics vs Logs
Understanding when to reach for each signal type is a core SRE skill. They are not interchangeable — each has a distinct strength and a distinct cost profile.
Metrics are pre-aggregated numeric measurements sampled or counted over time windows. They are cheap to store (a counter is a few bytes), fast to query (time-series databases are optimized for range queries), and excellent for alerting on known conditions. Their weakness is low cardinality: a Prometheus counter has labels, but you cannot add a user_id label to a counter that fires millions of times per second without exploding your cardinality and destroying your scrape performance. Metrics answer: is something wrong, and how widespread is it?
Logs are immutable event records emitted at arbitrary points in code. They carry unlimited context — any key/value you want to add. Modern structured logging (JSON lines, logfmt) has made logs queryable, and tools like Loki or Datadog Logs let you filter and aggregate across high-cardinality fields. Their weakness is cost and correlation: at high request rates, storing and indexing every log line is expensive, and correlating logs from multiple services for a single request requires an explicit shared identifier.
Traces are causally-linked records of a request's journey. A trace is a tree of spans, where each span represents a unit of work in one service. Spans carry timing data, status, metadata, and links to their parent spans. Traces excel at latency attribution (which service took how long), dependency mapping (which services call which), and request-level debugging. Their weakness is cardinality in a different sense: storing every span for every request at high throughput is expensive, which is why sampling strategies (covered in lesson 7) are essential.
Anatomy of a Trace: The Waterfall Diagram
A trace is visualized as a waterfall chart (also called a flame graph or Gantt chart). The horizontal axis is time. Each row is one span — one unit of work in one service. Spans are nested to show causality: if span B was initiated by span A, span B is indented under A, and its time range falls within A's time range.
Every span carries a standard set of fields:
- trace_id — the 128-bit identifier shared by every span in the same trace.
- span_id — a 64-bit identifier unique to this span.
- parent_span_id — the span_id of the span that initiated this one (empty for the root span).
- name — a human-readable operation name, e.g.
cart.GetItemsordb.query. - start_time and end_time — nanosecond-precision timestamps.
- status — OK, ERROR, or UNSET.
- attributes — key/value pairs with arbitrary context:
http.method,db.statement,user.id,cart.item_count. - events — timestamped annotations within the span lifetime (e.g. "cache miss", "retry attempt 2").
Why Metrics and Logs Cannot Do This Alone
Here is the scenario that convinced every major tech company to invest heavily in distributed tracing. You have a p99 latency regression — checkout went from 120ms to 400ms overnight. Your metrics show the regression clearly. You know it is real. Now what?
With metrics alone: you check the downstream service dashboards one by one. Auth looks fine. Cart looks fine. Fraud looks fine — its p99 is 280ms, up from 95ms, but you only notice this after 25 minutes of dashboard hunting, and you still do not know if fraud is the culprit or if it is itself a victim of a database slowdown.
With logs alone: you grep for requests with high latency, find a trace ID in the logs, then open four different log search interfaces to find all log lines with that trace ID, manually calculate the time gaps between them. Possible, but takes 45 minutes and requires that every service actually logged the trace ID (which they often do not).
With traces: you open the tracing UI, filter by service=checkout AND duration>200ms, click one trace, and see the waterfall immediately. The fraud-check span is red and 275ms wide. You click it, read its attributes: fraud.provider=acme-fraud-api, http.url=https://api.acmefraud.com/v2/check. You check the fraud provider status page — they had a degradation starting at 23:47 last night. Total investigation time: 4 minutes.
Context Propagation: The Glue That Holds It Together
For a trace to work across service boundaries, each service must pass the trace context to the next one. When service A calls service B over HTTP, it injects the trace ID and span ID into HTTP headers. When service A calls service B over gRPC, it injects them into gRPC metadata. When service A publishes a message to Kafka, it injects them into the message headers. Service B extracts the context on receipt, creates a new child span, and continues the trace.
The standard header format (defined by the W3C Trace Context specification, adopted by OpenTelemetry) is:
Context propagation is what transforms a collection of per-service spans into a coherent distributed trace. Without it, you have disconnected local measurements. With it, you have a complete causal graph of a request's journey across your entire system.
The Production Case: When Tracing Pays for Itself
Distributed tracing has real costs: instrumentation work, collector infrastructure, storage for spans. At high throughput (10,000+ requests per second), storing every span naively is prohibitive. But the ROI calculation is straightforward for any organization running microservices at scale.
Consider: a p99 latency regression that degrades checkout completion rate by 2% at a company processing $10M/day. Every hour that regression persists costs roughly $83K in lost revenue. If traces cut the time-to-diagnosis from 3 hours (manual log correlation) to 10 minutes (trace waterfall), that is 2 hours and 50 minutes saved per incident — approximately $233K per incident. The cost of running a Jaeger or Tempo cluster with 10% head-based sampling is a few thousand dollars per month. The math is not close.
This is why distributed tracing went from a research prototype at Google (Dapper, 2010) to a table-stakes requirement at any company operating microservices at scale. It is also why OpenTelemetry — the vendor-neutral standard for emitting traces, metrics, and logs — was created: to prevent the vendor lock-in of proprietary tracing SDKs. The remaining lessons in this tutorial build out the full OpenTelemetry stack, from instrumentation to backends to sampling to production debugging workflows.