Distributed Tracing & OpenTelemetry

Sampling Strategies

18 min Lesson 7 of 28

Sampling Strategies

A production system processing 100,000 requests per second cannot afford to record every span for every request. A single trace for a typical microservices call might generate 30-50 spans; at full fidelity that is 3-5 million span records per second, before you even factor in storage, ingestion bandwidth, and the CPU overhead of serialization and export on every application pod. Sampling is how you make distributed tracing economically viable at scale without losing the traces that actually matter.

The fundamental tension in sampling is this: you want to reduce volume, but the traces you most need to keep — the slow ones, the errored ones, the statistically rare anomalies — are exactly the ones you cannot afford to drop. Getting this balance right separates organizations that extract real engineering value from their tracing infrastructure from those that just store a random fraction of their traffic and call it "observability."

Head-Based Sampling

Head-based sampling makes the keep-or-drop decision at the very start of a trace — at the first span, before any downstream processing has occurred. The decision is encoded in the tracestate and sampled flag of the W3C Trace Context header and propagated to every service in the call chain. Every service honors that flag: if the root span was sampled, all its children are recorded; if it was not, every service skips instrumentation for that request entirely.

The canonical form is probabilistic (rate) sampling: sample 1% or 5% of all incoming requests at random. This is trivially cheap — one random number comparison per request — and produces a statistically representative view of your traffic. It is the default in most SDK configurations.

The fatal flaw: a 1% sampler will drop 99% of your 500 errors and 99% of your p99.9 latency spikes. At 100k RPS with a 0.01% error rate (100 errors/sec), you statistically see only 1 error trace per second. That might be acceptable for trend analysis but is useless for debugging a specific production incident in real time.

Key idea: Head-based sampling is cheap and easy to implement but is fundamentally uninformed — it decides fate before any outcome is known. Use it as a baseline noise floor, never as your only strategy in a production SRE context.

Tail-Based Sampling

Tail-based sampling buffers a complete trace in memory, waits for all spans to arrive, evaluates the full trace against a policy, and then decides whether to keep or drop it. Because the decision happens after you know the outcome, you can implement policies that actually capture what matters:

Always sample errors — any trace containing a span with status.code = ERROR is kept at 100%.
Latency threshold sampling — keep all traces where total duration exceeds a configured threshold (e.g., > 2s for a service whose SLO is 500ms p99).
Rate limiting per route — keep at most N traces per second per endpoint, ensuring low-traffic endpoints are still represented.
Composite policies — combine multiple rules: always-on for errors, latency-based for slow requests, probabilistic for everything else.

The cost is architectural: tail sampling requires a stateful aggregation layer. All spans for a given traceId must be routed to the same collector instance — you cannot distribute them across a pool of stateless collectors, because no single node would have the complete trace to evaluate. This is implemented via trace-ID-based routing: a load-balancer in front of your collector tier hashes the traceId to a specific collector shard. Each shard holds a decision cache (typically an LRU in-memory store) and a buffer window (usually 30-60 seconds) for incomplete traces.

Head-based sampling decides at the entry point (blind); tail-based sampling buffers all spans and decides after the full trace is known.

Configuring Tail Sampling in the OTel Collector

The OpenTelemetry Collector's tailsampling processor implements this pattern. Below is a production-representative configuration: always keep errors and traces slower than 1 second, rate-limit healthy traces to 10 per second, and use a 30-second buffer window for late-arriving spans.

processors:
  tail_sampling:
    decision_wait: 30s          # buffer window — wait this long for all spans
    num_traces: 200000          # in-memory trace buffer size (tune for your traffic)
    expected_new_traces_per_sec: 5000
    policies:
      - name: always-sample-errors
        type: status_code
        status_code: { status_codes: [ERROR] }

      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }

      - name: low-volume-endpoints
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/admin/*", "/internal/*"]
          enabled_regex_matching: true

      - name: probabilistic-baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 2 }  # 2% of remaining

      - name: composite-policy
        type: and
        and:
          and_sub_policy:
            - name: not-health-check
              type: string_attribute
              string_attribute:
                key: http.route
                values: ["/healthz", "/readyz"]
                invert_match: true
            - name: service-rate-limit
              type: rate_limiting
              rate_limiting: { spans_per_second: 200 }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [jaeger]

Pro tip: Set decision_wait to at least 2x the p99 of your longest inter-service call. If a database query occasionally takes 20 seconds, a 30-second window ensures the DB span arrives before the decision is made. Spans arriving after the decision window are dropped or re-evaluated depending on collector version — always monitor the otelcol_processor_tail_sampling_late_span_goes_to_new_decision metric.

The Load-Balancing Exporter: Routing by Trace ID

Because tail sampling requires all spans of a trace on one shard, you need a load-balancer layer that routes by traceId, not by connection or round-robin. The OTel Collector's loadbalancingexporter does exactly this — it sits in a "router" tier that receives spans from all services and forwards them to a fixed pool of "sampler" collector instances via consistent hashing on the trace ID.

# collector-router.yaml  (the front tier — one per AZ, stateless)
exporters:
  loadbalancing:
    routing_key: traceID       # consistent hash on traceId
    protocol:
      otlp:
        timeout: 1s
        tls: { insecure: false }
    resolver:
      dns:
        hostname: otelcol-sampler.observability.svc.cluster.local
        port: 4317
        interval: 5s           # re-resolve on pod scaling

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [loadbalancing]

Sampling at Scale: Production Failure Modes

Several failure modes bite teams that only tested tail sampling on small loads:

Buffer overflow: If num_traces is too small and a traffic spike arrives, the collector evicts the oldest buffered traces before making a decision — silently dropping the very traces that might reveal the root cause of the spike. Monitor otelcol_processor_tail_sampling_sampling_decision_timer_fired and dropped_spans.
Shard hotspots: Consistent hashing distributes trace IDs uniformly in theory, but a service that generates very high span counts per trace (e.g., an N+1 query producing 5,000 DB spans per request) can overwhelm a single shard. Apply per-trace span limits in your SDK (SpanLimits) or add a filter processor upstream.
Memory pressure: Each buffered span occupies heap on the collector. At 5,000 TPS with 40 spans/trace and a 30s window, that is 6 million spans in memory. Size collector pods accordingly and set GOGC / memory limits conservatively with an HPA to scale horizontally.
Collector restarts lose in-flight decisions: A rolling restart of sampler pods mid-decision-window produces split-brain: some spans for the same trace go to the old pod (evicted at shutdown) and some to the new pod (no decision context). This is unavoidable — design your SLO around it and use a short decision_wait to minimize the window.

Production pitfall: Never run tail sampling on the same collector instance that also handles metrics and logs. The memory buffer for trace decisions competes with metric aggregation and log batching. In production, separate your collector fleet into purpose-specific tiers: a router tier (stateless, handles head sampling + load balancing), a sampler tier (stateful, tail sampling), and an export tier (batching to backends). This isolation makes scaling and failure analysis dramatically simpler.

Hybrid Strategy: Head + Tail in Practice

Top-tier engineering organizations do not choose one or the other — they layer both. A common production pattern at scale:

Client-side head sampling at the SDK: drop health-check endpoints (/healthz, /readyz) entirely at 100% — these generate enormous span volume with zero diagnostic value. Use a ParentBased sampler so downstream services honor the upstream decision.
Probabilistic head sampling in the router-tier collector at 20-50% — this reduces the data the sampler tier must buffer, lowering memory requirements.
Tail-based policy evaluation in the sampler tier: keep 100% of errors and slow traces from the pre-filtered 20-50%, and apply rate limiting on the remainder.

The result: total trace volume sent to your backend might be 0.5-2% of raw request volume, but within that 0.5-2% you have near-100% coverage of all errors and latency outliers. This is the approach used by large organizations where full fidelity would cost millions of dollars per month in storage alone.

Key idea: The goal of sampling is not to save money — it is to save the right traces. A good sampling strategy is one where every trace you dropped was one you would not have needed, and every trace you needed was one you kept. Design your policies around that definition, measure it with error-trace capture rate as a KPI, and revisit the policies whenever your traffic patterns change significantly.