Observability Foundations

Metrics That Matter

18 min Lesson 3 of 28

Metrics That Matter

Instrumenting a service is easy. Knowing which metrics actually tell you whether your service is healthy — and which are noise — is the hard part. Two complementary mental models cut through the chaos: the RED method for understanding services from the outside, and the USE method for understanding resources from the inside. Together they map directly onto Google's Four Golden Signals, which every SRE team at Google treats as the minimum viable dashboard for any production service.

These frameworks were born from painful experience. Before them, teams would instrument dozens of arbitrary internal metrics and then stare at hundreds of Grafana panels during incidents, unable to answer the single question that matters: is this service working for users right now?

The RED Method — Services from the Outside In

RED stands for Rate, Errors, and Duration. It was coined by Tom Wilkie at Grafana Labs and is the correct frame for any service that handles requests — HTTP APIs, gRPC services, message consumers, database queries.

Rate — how many requests per second is the service receiving? This is your demand signal. If rate drops suddenly, either load disappeared (upstream problem) or the service is shedding connections (your problem).
Errors — what fraction of requests are failing? Track HTTP 5xx, gRPC non-OK codes, application-level business errors separately from infrastructure errors. A 1% error rate sounds small; at 10,000 RPS that is 100 failed requests every second.
Duration — how long do requests take? Always instrument as a histogram or summary, never just an average. The p50 tells you what most users experience; the p99 tells you what your worst 1% experience; the p999 reveals latency outliers that will surface as SLO breaches at scale. Mean latency hides bimodal distributions completely.

RED is user-facing: these three signals answer "is the service doing its job from the user's perspective?" They say nothing about why. A service can have perfect RED metrics while sitting on a saturated disk or leaking memory — until it doesn't.

Prometheus + Grafana is the canonical stack for RED. A minimal Prometheus scrape config and a PromQL query that captures all three signals for a Kubernetes service:

# prometheus.yml — scrape your app's /metrics endpoint
scrape_configs:
  - job_name: 'checkout-service'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: checkout

---
# PromQL — RED dashboard queries

# Rate: requests per second (5-min window)
rate(http_requests_total{job="checkout-service"}[5m])

# Errors: fraction of requests that returned 5xx
sum(rate(http_requests_total{job="checkout-service",status=~"5.."}[5m]))
  /
sum(rate(http_requests_total{job="checkout-service"}[5m]))

# Duration: 99th-percentile latency
histogram_quantile(0.99,
  sum by (le) (
    rate(http_request_duration_seconds_bucket{job="checkout-service"}[5m])
  )
)

The USE Method — Resources from the Inside Out

USE stands for Utilization, Saturation, and Errors. It was defined by Brendan Gregg and is the correct frame for every resource a system consumes — CPUs, memory, disk I/O, network interfaces, thread pools, database connection pools.

Utilization — what percentage of time is this resource busy? CPU at 70% utilization means 30% headroom. A disk I/O controller at 99% utilization has almost none. Utilization is a capacity planning signal.
Saturation — how much work is queued because the resource cannot keep up? A CPU at 70% utilization with a run-queue depth of 4 is saturated despite moderate utilization. Queue depth is often the earliest warning of impending latency blow-up — latency spikes lag saturation by seconds to minutes.
Errors — is the resource reporting hardware or software errors? Disk read errors, network packet drops, memory ECC corrections, TCP retransmits. These are resource-level errors, distinct from application-level errors in RED.

USE finds the bottleneck: when RED signals degrade and you do not know why, systematically apply USE to every resource in the request path — CPU, memory, disk, network, connection pools. The first resource showing high saturation is almost always the bottleneck. This turns a 45-minute war-room call into a 5-minute diagnosis.

Node Exporter exposes Linux resource metrics for USE analysis. Key PromQL queries:

# USE — Node Exporter PromQL queries

# CPU Utilization: fraction busy (not idle), averaged across all cores
1 - avg by (instance) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
)

# CPU Saturation: normalized run-queue length (load avg per CPU core)
node_load1 / count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})

# Memory Utilization: fraction of RAM in use
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Memory Saturation: paging activity (pages swapped in per second)
rate(node_vmstat_pgpgin[5m])

# Disk I/O Utilization: fraction of time device was busy
rate(node_disk_io_time_seconds_total{device="sda"}[5m])

# Disk Saturation: average I/O queue length
rate(node_disk_io_time_weighted_seconds_total{device="sda"}[5m])

# Network Errors: receive + transmit errors per second
rate(node_network_receive_errs_total[5m])
  + rate(node_network_transmit_errs_total[5m])

The Four Golden Signals

Google's Site Reliability Engineering book defines four signals as the minimum required for any production service. They map cleanly onto RED and USE:

Latency — time to serve a request, distinguishing successful from failed (a fast error is not a success). → RED Duration
Traffic — demand on the system: RPS, transactions/second, active connections. → RED Rate
Errors — rate of failed requests, both explicit (HTTP 500) and implicit (HTTP 200 with wrong payload). → RED Errors
Saturation — how "full" the service is; emphasizes resources most constrained. A service nearing saturation degrades before utilization hits 100%. → USE Saturation

Production pitfall — averaging latency: avg(http_request_duration_seconds) is a lie detector test you always fail. A bimodal distribution where 95% of requests take 5 ms and 5% take 2,000 ms reports an average of ~105 ms — which looks fine on a dashboard while hundreds of users per second are experiencing 2-second timeouts. Always alert on p99 (and often p999 for high-volume services), never on mean. This is one of the most common mistakes on junior-engineered dashboards.

RED covers services (rate, errors, duration); USE covers resources (utilization, saturation, errors). Together they map onto Google's Four Golden Signals.

Where These Methods Break Down — and What Fills the Gap

RED and USE are not exhaustive. They are the minimum. At big-tech scale you also need business metrics — orders per second, checkout conversion rate, ad click-through rate — because a service can appear perfectly healthy at the infrastructure level while silently returning wrong data that destroys business outcomes. Stripe famously monitors charge success rate as a first-class signal alongside RED metrics. These business-level metrics are often the only ones that catch subtle correctness bugs that pass all infrastructure health checks.

You also need dependency RED metrics: instrument every outbound call your service makes — to databases, caches, downstream APIs, message brokers — with its own Rate, Errors, and Duration triple. A dependency degrading silently is one of the most common root causes of latency regressions that look like your service's fault but are not.

Decision rule for new metrics: before adding any metric to your codebase, ask: "does this metric tell me Rate, Errors, Duration, Utilization, Saturation, or a business outcome?" If the answer is none of the above, the metric is probably noise. This rule keeps your cardinality under control and your dashboards readable.

Applying RED and USE During an Incident

The structured approach to an unknown incident is: RED first, then USE to find the why. Start by confirming which RED signals are degraded — is it latency, errors, or both? Is rate normal or have clients started retrying (rate spike)? Once you know the symptom, apply USE to each resource in the request path until you find saturation or errors. This is the methodical approach that separates engineers who navigate incidents calmly from engineers who thrash randomly through dashboards.

A real example: p99 latency of an order service spikes to 8 seconds. RED confirms degraded Duration, normal Errors, normal Rate. USE on the database connection pool shows saturation — queue depth has climbed to 200 connections waiting. Root cause: a slow query (introduced in the last deploy) is holding connections for 4 seconds each, starving all other requests. Fix: roll back or add an index. Without the USE framework, the team would have wasted time checking CPU, network, and deployment configs before finding the actual bottleneck.