Observability Foundations

Observability Cost & Data Management

18 min Lesson 9 of 28

Observability Cost & Data Management

At Google scale, observability infrastructure costs more than many companies' entire engineering budgets. Even at mid-size companies on managed observability platforms — Datadog, New Relic, Honeycomb — teams routinely receive bills of $500K–$2M per year and discover the costs only after they explode. Understanding why telemetry is expensive and what levers you can pull to control costs without sacrificing signal quality is a first-class engineering discipline, not a finance concern.

This lesson covers the three mechanisms that dominate observability cost: cardinality (the number of unique metric time-series), sampling (retaining a statistically useful fraction of traces and logs), and retention tiers (storing data at different cost/fidelity trade-offs over time). Master these three and you can cut bills by 60–80% while keeping every alert, dashboard, and incident investigation working.

Cardinality Explosions

Every Prometheus metric is identified by its name plus a set of label key-value pairs. The total number of distinct label combinations is the metric's cardinality. Prometheus stores each unique combination as a separate time-series, each requiring in-memory head chunks and on-disk blocks. Cardinality is the single biggest cost driver in metrics systems.

A seemingly harmless label decision compounds fast. A metric with labels region (5), http_method (6), status_code (12), and route (200 normalised paths) has 5 × 6 × 12 × 200 = 72,000 series. Add one more label with 50 values and you're at 3.6 million series. Add user_id with 10 million values and Prometheus runs out of RAM and crashes — taking your alerting system down exactly when you need it most.

The cardinality bomb: The most common production disaster is a developer adding a high-cardinality value as a metric label — user_id, order_id, request_id, email, IP address, or any UUID. Each new entity creates a new time-series. After a traffic spike or signup campaign, Prometheus OOM-kills itself within minutes. The fix is cultural and tooling-enforced: label values must come from a bounded, known-size set. High-cardinality values belong in trace span attributes and structured log fields — never in metric labels.

Detecting cardinality problems before they hit production requires active monitoring of your metrics pipeline itself. Prometheus exposes prometheus_tsdb_head_series (current active series count) and prometheus_tsdb_head_series_created_total (creation rate). Alert when these grow faster than your service count warrants.

# Prometheus recording rule: track cardinality per job
# prometheus/rules/cardinality.yml
groups:
  - name: cardinality.rules
    interval: 5m
    rules:
      # Series count per job — alert if any job exceeds 50k series
      - record: job:prometheus_tsdb_head_series:max
        expr: max by (job) (prometheus_tsdb_head_series)

      - alert: CardinalityExplosion
        expr: job:prometheus_tsdb_head_series:max > 50000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Job {{ $labels.job }} has {{ $value }} active series"
          runbook: "Check for unbounded label values — likely user_id, request_id, or raw URL paths"

# Find the top cardinality contributors with PromQL:
# topk(20, count by (__name__, job) ({__name__=~".+"}))
#
# In Thanos or Cortex, use the /api/v1/label/__name__/values endpoint
# with cardinality statistics to spot offenders before they crash the cluster.

The fix for existing cardinality problems is to relabel at the scrape layer before data enters TSDB, using Prometheus metric_relabel_configs. You can drop entire metrics, drop specific labels, or replace label values with normalised buckets — all without touching application code.

# prometheus/prometheus.yml — relabel to cap cardinality at scrape time
scrape_configs:
  - job_name: "checkout-service"
    static_configs:
      - targets: ["checkout:9090"]
    metric_relabel_configs:
      # Drop high-cardinality metrics you never query
      - source_labels: [__name__]
        regex: "go_gc_duration_seconds|go_memstats_.*"
        action: drop

      # Normalise route label — replace numeric segments with {id}
      - source_labels: [route]
        regex: ".*/([0-9]+)(/.*)?"
        replacement: "/{id}$2"
        target_label: route
        action: replace

      # Drop the raw user_id label entirely — it must never be a metric label
      - regex: "user_id"
        action: labeldrop

      # Bucket status codes into classes: 200-299 -> 2xx
      - source_labels: [status_code]
        regex: "([2345])[0-9]{2}"
        replacement: "${1}xx"
        target_label: status_class
      - regex: "status_code"
        action: labeldrop

Bounded vs unbounded labels — the difference between 750 series and 7.5 billion series from a single metric.

Sampling: Keeping What Matters

Distributed traces and logs are verbose by nature. A 1000 RPS service produces 86.4 million trace spans per day. Storing all of them costs an enormous amount and delivers diminishing returns — the 999 successful GET /healthz calls tell you nothing that the first one did not. Sampling is the practice of keeping a statistically representative subset while discarding the redundant majority.

There are three sampling strategies, each appropriate at a different point in the pipeline:

Head-based sampling — the decision is made at the start of a request, before any spans are created. Simple, zero overhead, but blind: you do not know yet whether this trace will be interesting (error, slow, unusual). Appropriate for very high-volume, low-value traffic (health checks, metrics scrapes).
Tail-based sampling — the decision is deferred until the entire trace is complete. The collector buffers spans and evaluates rules: keep if any span has an error, keep if end-to-end latency exceeds 2s, keep 1% of the rest. This is the correct strategy for production services because it guarantees you keep 100% of errors and slow traces. The cost is a buffer (typically 30–60 seconds of spans in memory).
Adaptive (dynamic) sampling — the sample rate adjusts automatically based on observed error rates, latency distributions, and traffic volume per route. Honeycomb's dynamic sampling and Grafana Tempo's probabilistic sampling fall here. This is the state of the art for large services but requires more configuration.

Always keep 100% of errors. The cardinal rule of sampling: never sample away an error trace. A 1-in-a-million error that you sampled away is an incident you cannot investigate. Configure your tail sampler to treat any span with error=true or HTTP status 5xx as non-negotiable — keep rate 100% for that trace, regardless of the global sampling ratio.

In the OTel Collector, tail-based sampling is implemented with the tailsampling processor. Configure it as follows — this configuration keeps all errors, all slow traces, and 5% of normal traffic:

# otel-collector/config.yaml — tail-based sampling processor
processors:
  tail_sampling:
    # How long to wait for all spans in a trace before making the decision
    decision_wait: 30s
    # Max traces held in memory at once — tune to your RPS × avg trace duration
    num_traces: 100000
    expected_new_traces_per_sec: 1000

    policies:
      # Policy 1: Always keep traces with any error span
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }

      # Policy 2: Always keep traces slower than 2 seconds end-to-end
      - name: keep-slow-traces
        type: latency
        latency: { threshold_ms: 2000 }

      # Policy 3: Keep traces tagged for sampling (from feature flags or debug sessions)
      - name: keep-debug-tagged
        type: string_attribute
        string_attribute:
          key: "sampling.priority"
          values: ["always_sample"]

      # Policy 4: Probabilistic — keep 5% of everything else
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

    # Composite policy: a trace is kept if ANY policy matches
    # (default behaviour — no composite block needed for OR logic)

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/tempo]

Retention Tiers: The Cost-Fidelity Curve

Not all telemetry data has the same value over time. A spike in error rate is actionable for the first 72 hours while an incident is live. The same data is useful for trend analysis for 30 days. After 90 days, you rarely need raw data — you need aggregated summaries: daily p99 latency, weekly error rate by service. After a year, only compliance-mandated data survives.

Big-tech observability teams implement tiered retention that trades fidelity for cost at each stage:

Hot tier (0–72 hours) — full resolution, full cardinality, fast queries. Stored in Prometheus local TSDB or Grafana Mimir ingesters. Cost: highest. Purpose: real-time dashboards and on-call investigations.
Warm tier (72 hours – 30 days) — full resolution metrics compacted into object storage (S3/GCS) blocks via Thanos or Mimir. Query speed is 2–5× slower but cost drops 10×. Purpose: post-incident reviews and SLO burn-rate calculations.
Cold tier (30 days – 1 year) — downsampled data only. Thanos Compactor runs 5m and 1h downsampling, reducing storage 100–1000× vs raw data. Purpose: capacity planning, QBR metrics, compliance evidence.
Archive / glacier (1 year+) — only pre-computed aggregates survive. Raw spans and logs are deleted; summary tables in a data warehouse (BigQuery, Redshift) persist for legal hold periods.

# thanos/compactor-config.yaml — automated downsampling and retention
# Run Thanos Compactor as a standalone deployment alongside Thanos Store

type: S3
config:
  bucket: my-thanos-bucket
  region: us-east-1

# Compactor retention settings (passed as flags):
#   --retention.resolution-raw=30d     keep raw 15s samples for 30 days
#   --retention.resolution-5m=90d      keep 5m downsamples for 90 days
#   --retention.resolution-1h=1y       keep 1h downsamples for 1 year
#
# Example compactor Deployment (key flags only):
#
# args:
#   - compact
#   - --data-dir=/var/thanos/compact
#   - --objstore.config-file=/etc/thanos/objstore.yml
#   - --retention.resolution-raw=30d
#   - --retention.resolution-5m=90d
#   - --retention.resolution-1h=365d
#   - --wait
#   - --wait-interval=5m
#
# Thanos Querier automatically selects the right resolution tier
# based on the query time range — no application changes needed.
# Queries > 30d automatically read 5m downsamples;
# queries > 90d read 1h downsamples.

For logs, the tiering pattern maps naturally to Loki's storage backends or to a pipeline that routes logs through the OTel Collector:

All logs go to Loki for 7–14 days (hot/warm).
After 14 days, only level=error and level=warn logs are retained in Loki; level=info and level=debug are expired.
Error logs are archived to S3 with a 1-year lifecycle policy for compliance.
High-value structured fields are extracted and written to a columnar store (ClickHouse, BigQuery) for long-term analytics.

Record rules as a cost multiplier. Prometheus recording rules pre-compute expensive aggregations and store the result as a new, low-cardinality metric. Instead of running a heavy sum(rate(http_requests_total[5m])) by (service) across millions of series at dashboard load time, run it once every 30 seconds via a recording rule and query the cheap pre-computed series. This cuts query cost 100× on busy Prometheus clusters and is the correct pattern for every SLO and dashboard metric that runs more than once per minute.

Cost Attribution and Governance

At organisations with multiple teams, telemetry cost attribution prevents runaway spending. Teams should own their own cost centres. In Datadog, per-team cost is visible via the Usage Attribution feature. In self-hosted Prometheus/Thanos, cost can be proxied by series count per team: label metrics with team="payments" and query count by (team) ({__name__=~".+"}) to produce a series-count bill per team. Set per-team series budgets and alert in CI when a PR would cause the deploying service to exceed its budget.

The Three Dials: Every observability cost decision involves three dials — cardinality (how many series/streams), volume (how many data points/events), and retention (how long). Reducing any one of them cuts costs proportionally. Reducing all three multiplies the savings. Build a monthly review of all three into your team's operational rhythm.

Observability data management is ultimately an engineering economics problem. The goal is not zero telemetry — it is the minimum telemetry required to meet your SLO investigation and alerting requirements. An observability plan that costs $2M/year but catches every incident in under 5 minutes may be worth every dollar. One that costs $500K/year but leaves your team blind during incidents is worthless. Calibrate the levers — cardinality, sampling rates, and retention windows — to the actual risk and recovery time objectives of your system, not to arbitrary cost targets.