Sampling Strategies
Sampling Strategies
A production system processing 100,000 requests per second cannot afford to record every span for every request. A single trace for a typical microservices call might generate 30-50 spans; at full fidelity that is 3-5 million span records per second, before you even factor in storage, ingestion bandwidth, and the CPU overhead of serialization and export on every application pod. Sampling is how you make distributed tracing economically viable at scale without losing the traces that actually matter.
The fundamental tension in sampling is this: you want to reduce volume, but the traces you most need to keep — the slow ones, the errored ones, the statistically rare anomalies — are exactly the ones you cannot afford to drop. Getting this balance right separates organizations that extract real engineering value from their tracing infrastructure from those that just store a random fraction of their traffic and call it "observability."
Head-Based Sampling
Head-based sampling makes the keep-or-drop decision at the very start of a trace — at the first span, before any downstream processing has occurred. The decision is encoded in the tracestate and sampled flag of the W3C Trace Context header and propagated to every service in the call chain. Every service honors that flag: if the root span was sampled, all its children are recorded; if it was not, every service skips instrumentation for that request entirely.
The canonical form is probabilistic (rate) sampling: sample 1% or 5% of all incoming requests at random. This is trivially cheap — one random number comparison per request — and produces a statistically representative view of your traffic. It is the default in most SDK configurations.
The fatal flaw: a 1% sampler will drop 99% of your 500 errors and 99% of your p99.9 latency spikes. At 100k RPS with a 0.01% error rate (100 errors/sec), you statistically see only 1 error trace per second. That might be acceptable for trend analysis but is useless for debugging a specific production incident in real time.
Tail-Based Sampling
Tail-based sampling buffers a complete trace in memory, waits for all spans to arrive, evaluates the full trace against a policy, and then decides whether to keep or drop it. Because the decision happens after you know the outcome, you can implement policies that actually capture what matters:
- Always sample errors — any trace containing a span with
status.code = ERRORis kept at 100%. - Latency threshold sampling — keep all traces where total duration exceeds a configured threshold (e.g., > 2s for a service whose SLO is 500ms p99).
- Rate limiting per route — keep at most N traces per second per endpoint, ensuring low-traffic endpoints are still represented.
- Composite policies — combine multiple rules: always-on for errors, latency-based for slow requests, probabilistic for everything else.
The cost is architectural: tail sampling requires a stateful aggregation layer. All spans for a given traceId must be routed to the same collector instance — you cannot distribute them across a pool of stateless collectors, because no single node would have the complete trace to evaluate. This is implemented via trace-ID-based routing: a load-balancer in front of your collector tier hashes the traceId to a specific collector shard. Each shard holds a decision cache (typically an LRU in-memory store) and a buffer window (usually 30-60 seconds) for incomplete traces.
Configuring Tail Sampling in the OTel Collector
The OpenTelemetry Collector's tailsampling processor implements this pattern. Below is a production-representative configuration: always keep errors and traces slower than 1 second, rate-limit healthy traces to 10 per second, and use a 30-second buffer window for late-arriving spans.
decision_wait to at least 2x the p99 of your longest inter-service call. If a database query occasionally takes 20 seconds, a 30-second window ensures the DB span arrives before the decision is made. Spans arriving after the decision window are dropped or re-evaluated depending on collector version — always monitor the otelcol_processor_tail_sampling_late_span_goes_to_new_decision metric.The Load-Balancing Exporter: Routing by Trace ID
Because tail sampling requires all spans of a trace on one shard, you need a load-balancer layer that routes by traceId, not by connection or round-robin. The OTel Collector's loadbalancingexporter does exactly this — it sits in a "router" tier that receives spans from all services and forwards them to a fixed pool of "sampler" collector instances via consistent hashing on the trace ID.
Sampling at Scale: Production Failure Modes
Several failure modes bite teams that only tested tail sampling on small loads:
- Buffer overflow: If
num_tracesis too small and a traffic spike arrives, the collector evicts the oldest buffered traces before making a decision — silently dropping the very traces that might reveal the root cause of the spike. Monitorotelcol_processor_tail_sampling_sampling_decision_timer_firedanddropped_spans. - Shard hotspots: Consistent hashing distributes trace IDs uniformly in theory, but a service that generates very high span counts per trace (e.g., an N+1 query producing 5,000 DB spans per request) can overwhelm a single shard. Apply per-trace span limits in your SDK (
SpanLimits) or add afilterprocessor upstream. - Memory pressure: Each buffered span occupies heap on the collector. At 5,000 TPS with 40 spans/trace and a 30s window, that is 6 million spans in memory. Size collector pods accordingly and set
GOGC/ memory limits conservatively with an HPA to scale horizontally. - Collector restarts lose in-flight decisions: A rolling restart of sampler pods mid-decision-window produces split-brain: some spans for the same trace go to the old pod (evicted at shutdown) and some to the new pod (no decision context). This is unavoidable — design your SLO around it and use a short
decision_waitto minimize the window.
Hybrid Strategy: Head + Tail in Practice
Top-tier engineering organizations do not choose one or the other — they layer both. A common production pattern at scale:
- Client-side head sampling at the SDK: drop health-check endpoints (
/healthz,/readyz) entirely at 100% — these generate enormous span volume with zero diagnostic value. Use aParentBasedsampler so downstream services honor the upstream decision. - Probabilistic head sampling in the router-tier collector at 20-50% — this reduces the data the sampler tier must buffer, lowering memory requirements.
- Tail-based policy evaluation in the sampler tier: keep 100% of errors and slow traces from the pre-filtered 20-50%, and apply rate limiting on the remainder.
The result: total trace volume sent to your backend might be 0.5-2% of raw request volume, but within that 0.5-2% you have near-100% coverage of all errors and latency outliers. This is the approach used by large organizations where full fidelity would cost millions of dollars per month in storage alone.