Observability Foundations

The Three Pillars of Observability

18 min Lesson 2 of 28

The Three Pillars of Observability

The three pillars — metrics, logs, and traces — are the raw signals that allow you to understand what a distributed system is doing at any moment. The framing comes from Cory Watson's 2013 work at Twitter and was popularised by Cindy Sridharan's writing on observability. By 2025, every major platform team at Google, Meta, Uber, and Stripe has built its production tooling around exactly these three signal types, with a dedicated store for each: Prometheus or Thanos for metrics, Elasticsearch or Loki for logs, Jaeger or Tempo for traces.

No single pillar is sufficient on its own. They are complementary: metrics tell you that something is wrong, logs tell you what happened, and traces tell you where in the call chain the problem originated. Miss any one and your on-call engineer will be flying blind during an incident.

The Three Pillars of Observability Your System services, infra METRICS Numeric time-series cpu_usage, req_rate error_rate, latency_p99 Tool: Prometheus Viz: Grafana LOGS Discrete timestamped events structured JSON records request, error, audit Tool: Loki / ELK Viz: Grafana / Kibana TRACES Causal request spans trace_id, span_id parent → child spans Tool: Jaeger / Tempo Viz: Grafana / Jaeger UI Strengths Cheap at query time Alerting + trending Aggregation & math Cost: low per datapoint Strengths Full event context Human-readable detail Debuggability Cost: high at scale Strengths End-to-end latency Service dependency map Root cause localisation Cost: medium + sampling
The three observability pillars — signals emitted by your system, their primary tools, and cost/strength profile.

Pillar 1: Metrics

Metrics are numeric measurements aggregated over time. A counter increments every time an HTTP request is served; a gauge tracks current memory usage; a histogram buckets request durations and lets you compute arbitrary percentiles. They are cheap to store (a single float + labels + timestamp), cheap to query at scale, and the right foundation for dashboards and alerts.

The Prometheus data model is the industry standard. Every metric has a name and a set of key-value labelshttp_requests_total{method="POST", status="500", service="checkout"}. Labels are what make metrics powerful: you can sum across all services, filter to a single endpoint, or compare error rates by region. But labels have a cost: every unique combination of label values is a separate time series. A label with unbounded cardinality (like user_id or request_id) will explode your Prometheus TSDB.

# Prometheus scrape config — scrape your app's /metrics endpoint # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'api-service' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: "true" - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] target_label: __metrics_path__ regex: (.+) # PromQL query: 99th-percentile latency over 5m per service histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) # Error rate as a percentage 100 * ( sum(rate(http_requests_total{status=~"5.."}[1m])) by (service) / sum(rate(http_requests_total[1m])) by (service) )
The Four Golden Signals (Google SRE Book): Latency, Traffic (request rate), Errors (error rate), Saturation (how full a resource is). If you instrument nothing else, instrument these four for every service. They are the minimum viable metric set that tells you whether a service is healthy.

Pillar 2: Logs

Logs are discrete, timestamped records of individual events. A log line captures what happened at an exact moment — an HTTP request was received with these headers, a database query timed out, a user was denied access. Logs give you the full context that metrics cannot: the exact query that failed, the exact user ID, the exact stack trace.

The shift from unstructured ("printf-style") logs to structured JSON logs is one of the highest-leverage improvements a team can make. Structured logs can be indexed, filtered, and aggregated by machines. An unstructured log string is a debugging artifact; a structured log record is data.

# Application emitting structured JSON logs (Node.js / pino example) # Every log line is a valid JSON object — queryable by any field {"level":"info","time":1718000000000,"pid":42,"hostname":"api-7d4b9c-xkq2p", "service":"checkout","trace_id":"3fa85f64-5717-4562","span_id":"a1b2c3d4", "msg":"payment processed","user_id":"usr_8821","amount_cents":4999,"currency":"USD", "duration_ms":143} {"level":"error","time":1718000001234,"pid":42,"hostname":"api-7d4b9c-xkq2p", "service":"checkout","trace_id":"3fa85f64-5717-4562","span_id":"a1b2c3d4", "msg":"stripe charge failed","error":"card_declined","stripe_code":"do_not_honor", "user_id":"usr_8821","stack":"Error: card_declined\n at ChargeService.charge ..."} # Loki LogQL — find all error logs for a service in the last 15m and extract fields {service="checkout"} |= "error" | json | duration_ms > 500 # Count errors per minute by error type sum by (error) ( rate({service="checkout"} | json | __error__="" [1m]) )
Log volume is your biggest observability cost driver. A busy microservice can emit hundreds of thousands of log lines per second. At $0.50–$1.50 per GB ingested (Datadog, Splunk pricing), a single verbose service can cost thousands of dollars per month. Production practice: log at INFO level for normal operations, ERROR only on actionable failures. Use dynamic log sampling — log 100% of errors and 1% of successful requests. Never log full request/response bodies at INFO level. Attach the trace_id to every log line so you can correlate with traces.

Pillar 3: Distributed Traces

A distributed trace is the story of a single request as it flows across every service that handles it. Each unit of work within a service is a span — it has a start time, a duration, a set of tags (key-value metadata), and a reference to its parent span. The collection of all spans for one request, linked by a shared trace_id, forms the trace.

Traces answer questions that neither metrics nor logs can resolve: "Which service in our 40-service mesh is responsible for the 2-second tail latency our customers see on checkout?" Metrics might tell you checkout p99 is slow. Logs might show individual slow requests. Only traces reveal that the bottleneck is a specific inventory-service → postgres span that is consistently slow for requests involving more than 10 SKUs.

# OpenTelemetry auto-instrumentation for a Node.js service # npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node # tracing.js — loaded before your app (node --require ./tracing.js server.js) const { NodeSDK } = require('@opentelemetry/sdk-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4317', // Collector in the same K8s namespace }), instrumentations: [getNodeAutoInstrumentations()], serviceName: 'checkout-service', }); sdk.start(); # OpenTelemetry Collector config — receives traces and forwards to Tempo # otel-collector-config.yaml receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 1s send_batch_size: 1024 tail_sampling: # sample only interesting traces decision_wait: 10s policies: - name: errors-policy type: status_code status_code: {status_codes: [ERROR]} - name: slow-traces type: latency latency: {threshold_ms: 500} - name: probabilistic-baseline type: probabilistic probabilistic: {sampling_percentage: 5} exporters: otlp: endpoint: "tempo:4317" tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [batch, tail_sampling] exporters: [otlp]
Tail-based sampling is the production standard. Head-based sampling (decide at the entry point, e.g., "trace 5% of all requests") is simple but drops exactly the interesting traces — errors and slow outliers are also sampled away. Tail-based sampling buffers spans for a short window (10–30 seconds) and makes the keep/drop decision after seeing the whole trace. Keep 100% of error traces and traces above a latency threshold, and probabilistically sample the rest. This is how Google Dapper, Uber Jaeger, and the OTel Collector's tail_sampling processor work.

How the Three Pillars Complement Each Other

The real power comes from using all three together in a workflow. Here is a representative on-call scenario at a production scale company:

  1. Alert fires (metrics): A Prometheus alert triggers — checkout_error_rate > 2% for 3 minutes on the US-EAST region.
  2. Narrow the blast radius (metrics): The on-call engineer opens Grafana. Error rate is elevated only on pod/checkout-v2-*, not checkout-v1-*. A recent deploy is the suspect. Dashboard shows p99 latency also spiking.
  3. Understand what happened (logs): Filter Loki for service="checkout" level="error" in the affected window. Logs show "error": "upstream timeout", "upstream": "inventory-service" — the checkout service is timing out waiting on inventory.
  4. Find the root cause (traces): Open Tempo, filter for service=checkout status=error. A trace shows the inventory.GetStock span is consuming 1,800 ms of the total 2,000 ms request. Drill into that span — it is making 47 sequential database queries (an N+1 bug introduced in the new deploy).
  5. Correlate (cross-pillar linking): The trace_id embedded in each log line lets you jump directly from the relevant log record to the exact trace in Tempo. Grafana Explore supports this natively when logs and traces share the same trace_id field.

This workflow — alert on metrics, investigate with logs, locate with traces — is the canonical observability loop. Each pillar does one job well and hands off to the next. Trying to do this with logs alone is expensive and slow at scale. Trying to do it with metrics alone leaves you unable to understand causation.

OpenTelemetry (OTel) is now the standard instrumentation layer. It is a CNCF graduated project that provides vendor-neutral SDKs for emitting metrics, logs, and traces from any language. The OTel Collector is a standalone binary that receives signals, processes (batch, filter, sample), and exports to any backend. Adopting OTel means you can swap from Jaeger to Tempo, or from Prometheus to Mimir, without re-instrumenting your applications.

Cost and Cardinality: The Engineering Trade-off

Each pillar has a different cost profile that determines how much data you can afford to keep at what resolution:

  • Metrics: Very low cost per data point. A counter scrape from 1,000 pods every 15 seconds is millions of data points per day but still cheap in Prometheus. The main cost driver is label cardinality — too many unique label combinations and Prometheus will OOM. Keep label values bounded and use recording rules to pre-aggregate expensive queries.
  • Logs: High storage cost at scale. Ingest compression helps (Loki stores compressed chunks), but the volume is fundamentally proportional to request rate × log lines per request. Sampling, retention tiers (hot/warm/cold), and log levels are the primary levers.
  • Traces: Medium cost, controlled by sampling rate. 100% trace capture is only feasible at low request rates. At >1,000 RPS, use tail-based sampling targeting 5–10% overall with 100% error/slow retention. Tempo stores trace data compressed and is significantly cheaper than Jaeger's Elasticsearch backend at scale.

The next lesson goes deeper into which specific metrics actually matter — the signals proven to predict user-visible failures, how to define your Golden Signals for each service tier, and the PromQL patterns that surface them in under 60 seconds during an incident.