Alerting on SLOs
Alerting on SLOs
Most teams alert on symptoms: CPU over 80 %, error rate over 1 %, p99 latency over 500 ms. Those thresholds are arbitrary, they page on non-issues, and they stay silent during slow, rolling failures that silently drain your error budget over days. SLO-based alerting inverts this — you page only when you are consuming your error budget faster than you can afford to, measured across multiple time windows simultaneously. This is how Google, Spotify, and Cloudflare achieve both high reliability and low alert fatigue at scale.
The Core Idea: Burn Rate
Your error budget for a 30-day SLO is the fraction of requests you are allowed to serve badly. A 99.9 % SLO gives you 0.1 % bad events — that is 43.2 minutes of full outage, or a continuous 0.1 % trickle for 30 days. Burn rate tells you how fast you are consuming that budget relative to the neutral pace.
- Burn rate = 1 — you are on track to use exactly 100 % of your budget in 30 days. Acceptable.
- Burn rate = 2 — you will exhaust the budget in 15 days. Investigate soon.
- Burn rate = 14.4 — at this rate the full month's budget is gone in 2 hours. Page immediately.
- Burn rate = 36 — budget exhausted in 48 minutes. Critical outage in progress.
The formula is: burn_rate = (bad_event_rate / (1 - SLO_target)). For a 99.9 % SLO, a 1.44 % bad-event rate means burn rate 14.4 (1.44 % / 0.1 %).
Why Single-Window Alerts Fail
A single short window (5 min) catches spikes quickly but generates enormous noise — a 10-second spike of 503s fires the pager even if the budget impact is trivial. A single long window (1 hour) suppresses noise but is far too slow to catch a major outage that burns the budget in minutes. The answer is multi-window, multi-burn-rate alerting: combine a fast short window (to detect significance quickly) with a slow long window (to confirm the condition is sustained), and tier your thresholds by how fast the damage accumulates.
The Google-Recommended Alert Tiers
The SRE Workbook (2018) and subsequent industry practice converge on four alert tiers for a 30-day SLO window. Each tier fires only when both the short and long windows simultaneously exceed the burn-rate threshold — this is the multi-window condition that eliminates most false positives:
Implementing in Prometheus + Alertmanager
The PromQL pattern uses increase() or rate() over each window and divides by the error budget rate. The canonical approach records a job:slo_errors:rate5m recording rule and builds alerts off it. Below is a complete, production-ready example for a 99.9 % availability SLO on an HTTP service:
for: 2m in Prometheus means the condition must be continuously true for 2 minutes before firing. Combined with the dual-window AND condition, you get three independent noise filters: both error rates must be elevated, and that elevation must be sustained. Remove any one of these and false-positive pages will return.
Routing and Inhibition in Alertmanager
Configuring the thresholds is only half the job. Route Tier 1 and Tier 2 alerts to PagerDuty with a repeat_interval of 30 minutes; route Tier 3 to Jira or Slack. Add an inhibition rule so that when a critical alert fires for a job, the high alert for the same job is suppressed — otherwise on-call receives two pages for the same incident.
Latency SLOs and the Histogram Problem
Latency SLOs follow the same burn-rate algebra, but the SLI must come from a histogram, not a gauge. The correct metric is the fraction of requests served under the target latency threshold. Using the average latency as your SLI is a production-grade mistake — averages hide the long tail where users actually feel pain.
Alert Fatigue Anti-Patterns
Even with multi-window logic, teams accumulate fatigue through predictable mistakes:
- Setting the burn-rate threshold too low — a threshold of 2× means you page whenever the error rate is 0.2 % on a 99.9 % SLO. That will fire multiple times per week on healthy services with natural variance.
- Forgetting the error budget context — an alert that fires at burn rate 3 on a service with 25 days of budget remaining is informational, not critical. Encode remaining budget into your runbook decision tree.
- Not reviewing alert history monthly — Google mandates that every page either results in a fix or a threshold change. If the same alert fires and on-call does nothing, the threshold is wrong.
user_id or request_id in your recording rules, the resulting time series fan-out will OOM your Prometheus. Always aggregate to the job or service level before computing burn rates. Use exemplars and tracing (covered in the Distributed Tracing tutorial) to drill into individual requests after the alert fires.
Connecting Alerts Back to Error Budgets
The final discipline: every fired Tier 1 or Tier 2 alert must appear in your error budget report. Track the start and end times of each budget-burning event, and calculate how many minutes of budget were consumed. This data drives the quarterly SLO review where you decide whether to tighten the SLO, loosen it, or invest engineering time in reliability improvements. Without this feedback loop, your alert thresholds calcify and drift from the actual reliability your users experience.