We are still cooking the magic in the way!
Metrics That Matter
Metrics That Matter
Instrumenting a service is easy. Knowing which metrics actually tell you whether your service is healthy — and which are noise — is the hard part. Two complementary mental models cut through the chaos: the RED method for understanding services from the outside, and the USE method for understanding resources from the inside. Together they map directly onto Google's Four Golden Signals, which every SRE team at Google treats as the minimum viable dashboard for any production service.
These frameworks were born from painful experience. Before them, teams would instrument dozens of arbitrary internal metrics and then stare at hundreds of Grafana panels during incidents, unable to answer the single question that matters: is this service working for users right now?
The RED Method — Services from the Outside In
RED stands for Rate, Errors, and Duration. It was coined by Tom Wilkie at Grafana Labs and is the correct frame for any service that handles requests — HTTP APIs, gRPC services, message consumers, database queries.
- Rate — how many requests per second is the service receiving? This is your demand signal. If rate drops suddenly, either load disappeared (upstream problem) or the service is shedding connections (your problem).
- Errors — what fraction of requests are failing? Track HTTP 5xx, gRPC non-OK codes, application-level business errors separately from infrastructure errors. A 1% error rate sounds small; at 10,000 RPS that is 100 failed requests every second.
- Duration — how long do requests take? Always instrument as a histogram or summary, never just an average. The p50 tells you what most users experience; the p99 tells you what your worst 1% experience; the p999 reveals latency outliers that will surface as SLO breaches at scale. Mean latency hides bimodal distributions completely.
Prometheus + Grafana is the canonical stack for RED. A minimal Prometheus scrape config and a PromQL query that captures all three signals for a Kubernetes service:
The USE Method — Resources from the Inside Out
USE stands for Utilization, Saturation, and Errors. It was defined by Brendan Gregg and is the correct frame for every resource a system consumes — CPUs, memory, disk I/O, network interfaces, thread pools, database connection pools.
- Utilization — what percentage of time is this resource busy? CPU at 70% utilization means 30% headroom. A disk I/O controller at 99% utilization has almost none. Utilization is a capacity planning signal.
- Saturation — how much work is queued because the resource cannot keep up? A CPU at 70% utilization with a run-queue depth of 4 is saturated despite moderate utilization. Queue depth is often the earliest warning of impending latency blow-up — latency spikes lag saturation by seconds to minutes.
- Errors — is the resource reporting hardware or software errors? Disk read errors, network packet drops, memory ECC corrections, TCP retransmits. These are resource-level errors, distinct from application-level errors in RED.
Node Exporter exposes Linux resource metrics for USE analysis. Key PromQL queries:
The Four Golden Signals
Google's Site Reliability Engineering book defines four signals as the minimum required for any production service. They map cleanly onto RED and USE:
- Latency — time to serve a request, distinguishing successful from failed (a fast error is not a success). → RED Duration
- Traffic — demand on the system: RPS, transactions/second, active connections. → RED Rate
- Errors — rate of failed requests, both explicit (HTTP 500) and implicit (HTTP 200 with wrong payload). → RED Errors
- Saturation — how "full" the service is; emphasizes resources most constrained. A service nearing saturation degrades before utilization hits 100%. → USE Saturation
avg(http_request_duration_seconds) is a lie detector test you always fail. A bimodal distribution where 95% of requests take 5 ms and 5% take 2,000 ms reports an average of ~105 ms — which looks fine on a dashboard while hundreds of users per second are experiencing 2-second timeouts. Always alert on p99 (and often p999 for high-volume services), never on mean. This is one of the most common mistakes on junior-engineered dashboards.
Where These Methods Break Down — and What Fills the Gap
RED and USE are not exhaustive. They are the minimum. At big-tech scale you also need business metrics — orders per second, checkout conversion rate, ad click-through rate — because a service can appear perfectly healthy at the infrastructure level while silently returning wrong data that destroys business outcomes. Stripe famously monitors charge success rate as a first-class signal alongside RED metrics. These business-level metrics are often the only ones that catch subtle correctness bugs that pass all infrastructure health checks.
You also need dependency RED metrics: instrument every outbound call your service makes — to databases, caches, downstream APIs, message brokers — with its own Rate, Errors, and Duration triple. A dependency degrading silently is one of the most common root causes of latency regressions that look like your service's fault but are not.
Applying RED and USE During an Incident
The structured approach to an unknown incident is: RED first, then USE to find the why. Start by confirming which RED signals are degraded — is it latency, errors, or both? Is rate normal or have clients started retrying (rate spike)? Once you know the symptom, apply USE to each resource in the request path until you find saturation or errors. This is the methodical approach that separates engineers who navigate incidents calmly from engineers who thrash randomly through dashboards.
A real example: p99 latency of an order service spikes to 8 seconds. RED confirms degraded Duration, normal Errors, normal Rate. USE on the database connection pool shows saturation — queue depth has climbed to 200 connections waiting. Root cause: a slow query (introduced in the last deploy) is holding connections for 4 seconds each, starving all other requests. Fix: roll back or add an index. Without the USE framework, the team would have wasted time checking CPU, network, and deployment configs before finding the actual bottleneck.