Site Reliability Engineering (SRE)

Alerting on SLOs

18 min Lesson 8 of 29

Alerting on SLOs

Most teams alert on symptoms: CPU over 80 %, error rate over 1 %, p99 latency over 500 ms. Those thresholds are arbitrary, they page on non-issues, and they stay silent during slow, rolling failures that silently drain your error budget over days. SLO-based alerting inverts this — you page only when you are consuming your error budget faster than you can afford to, measured across multiple time windows simultaneously. This is how Google, Spotify, and Cloudflare achieve both high reliability and low alert fatigue at scale.

The Core Idea: Burn Rate

Your error budget for a 30-day SLO is the fraction of requests you are allowed to serve badly. A 99.9 % SLO gives you 0.1 % bad events — that is 43.2 minutes of full outage, or a continuous 0.1 % trickle for 30 days. Burn rate tells you how fast you are consuming that budget relative to the neutral pace.

  • Burn rate = 1 — you are on track to use exactly 100 % of your budget in 30 days. Acceptable.
  • Burn rate = 2 — you will exhaust the budget in 15 days. Investigate soon.
  • Burn rate = 14.4 — at this rate the full month's budget is gone in 2 hours. Page immediately.
  • Burn rate = 36 — budget exhausted in 48 minutes. Critical outage in progress.

The formula is: burn_rate = (bad_event_rate / (1 - SLO_target)). For a 99.9 % SLO, a 1.44 % bad-event rate means burn rate 14.4 (1.44 % / 0.1 %).

Why Single-Window Alerts Fail

A single short window (5 min) catches spikes quickly but generates enormous noise — a 10-second spike of 503s fires the pager even if the budget impact is trivial. A single long window (1 hour) suppresses noise but is far too slow to catch a major outage that burns the budget in minutes. The answer is multi-window, multi-burn-rate alerting: combine a fast short window (to detect significance quickly) with a slow long window (to confirm the condition is sustained), and tier your thresholds by how fast the damage accumulates.

The Google-Recommended Alert Tiers

The SRE Workbook (2018) and subsequent industry practice converge on four alert tiers for a 30-day SLO window. Each tier fires only when both the short and long windows simultaneously exceed the burn-rate threshold — this is the multi-window condition that eliminates most false positives:

Multi-Window Multi-Burn-Rate Alert Tiers TIER 1 — CRITICAL PAGE Burn rate ≥ 14.4 · Short window: 5 min · Long window: 1 hr Budget exhausted in < 2 hours. Both windows must fire. Immediate page. TIER 2 — HIGH PAGE Burn rate ≥ 6 · Short window: 30 min · Long window: 6 hr Budget exhausted in < 5 days. Both windows must fire. Page (may wait for ack). TIER 3 — TICKET Burn rate ≥ 3 · Short window: 2 hr · Long window: 24 hr Budget at risk. File a ticket; investigate next business day. TIER 4 — INFO / WATCH Burn rate ≥ 1 · Window: 3 day rolling Budget consumption on trend. Log and review in SLO report. Urgency
Four alert tiers: burn rate thresholds and dual time windows that must both fire before paging.

Implementing in Prometheus + Alertmanager

The PromQL pattern uses increase() or rate() over each window and divides by the error budget rate. The canonical approach records a job:slo_errors:rate5m recording rule and builds alerts off it. Below is a complete, production-ready example for a 99.9 % availability SLO on an HTTP service:

# recording_rules.yml — pre-compute per-window error rates groups: - name: slo_http_availability interval: 30s rules: # Ratio of bad requests in each window - record: job:http_errors:rate5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) - record: job:http_errors:rate30m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[30m])) / sum by (job) (rate(http_requests_total[30m])) - record: job:http_errors:rate1h expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h])) - record: job:http_errors:rate6h expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[6h])) / sum by (job) (rate(http_requests_total[6h]))
# alerts.yml — multi-window multi-burn-rate SLO alerts (99.9 % target → budget = 0.001) groups: - name: slo_http_alerts rules: # TIER 1: burn rate ≥ 14.4 for 5m AND 1h - alert: HttpAvailabilityCritical expr: | job:http_errors:rate5m{job="api"} > (14.4 * 0.001) and job:http_errors:rate1h{job="api"} > (14.4 * 0.001) for: 2m labels: severity: critical slo: http_availability annotations: summary: "SLO CRITICAL: {{ $labels.job }} burning budget at 14x rate" description: "Error rate {{ $value | humanizePercentage }}. Budget exhausted in ~2h." runbook: "https://wiki.example.com/runbook/slo-critical" # TIER 2: burn rate ≥ 6 for 30m AND 6h - alert: HttpAvailabilityHigh expr: | job:http_errors:rate30m{job="api"} > (6 * 0.001) and job:http_errors:rate6h{job="api"} > (6 * 0.001) for: 5m labels: severity: high slo: http_availability annotations: summary: "SLO HIGH: {{ $labels.job }} burning budget at 6x rate" description: "Budget will exhaust in ~5 days at current rate."
The "for" clause and the multi-window condition serve different roles. The for: 2m in Prometheus means the condition must be continuously true for 2 minutes before firing. Combined with the dual-window AND condition, you get three independent noise filters: both error rates must be elevated, and that elevation must be sustained. Remove any one of these and false-positive pages will return.

Routing and Inhibition in Alertmanager

Configuring the thresholds is only half the job. Route Tier 1 and Tier 2 alerts to PagerDuty with a repeat_interval of 30 minutes; route Tier 3 to Jira or Slack. Add an inhibition rule so that when a critical alert fires for a job, the high alert for the same job is suppressed — otherwise on-call receives two pages for the same incident.

# alertmanager.yml — routing and inhibition route: group_by: [job, slo] group_wait: 30s group_interval: 5m repeat_interval: 30m receiver: slack-ops routes: - matchers: - severity =~ "critical|high" receiver: pagerduty-oncall continue: false - matchers: - severity = "ticket" receiver: jira-slo inhibit_rules: - source_matchers: - severity = critical target_matchers: - severity = high equal: [job, slo] # suppress high when critical fires for same job+slo

Latency SLOs and the Histogram Problem

Latency SLOs follow the same burn-rate algebra, but the SLI must come from a histogram, not a gauge. The correct metric is the fraction of requests served under the target latency threshold. Using the average latency as your SLI is a production-grade mistake — averages hide the long tail where users actually feel pain.

# Latency SLI: fraction of requests completed under 200ms (99.9% target = budget 0.001) - record: job:http_latency_ok:rate5m expr: | sum by (job) (rate(http_request_duration_seconds_bucket{le="0.2",job="api"}[5m])) / sum by (job) (rate(http_request_duration_seconds_count{job="api"}[5m])) # Latency burn-rate alert (Tier 1) - alert: HttpLatencyCritical expr: | (1 - job:http_latency_ok:rate5m{job="api"}) > (14.4 * 0.001) and (1 - job:http_latency_ok:rate1h{job="api"}) > (14.4 * 0.001) for: 2m labels: severity: critical slo: http_latency

Alert Fatigue Anti-Patterns

Even with multi-window logic, teams accumulate fatigue through predictable mistakes:

  • Setting the burn-rate threshold too low — a threshold of 2× means you page whenever the error rate is 0.2 % on a 99.9 % SLO. That will fire multiple times per week on healthy services with natural variance.
  • Forgetting the error budget context — an alert that fires at burn rate 3 on a service with 25 days of budget remaining is informational, not critical. Encode remaining budget into your runbook decision tree.
  • Not reviewing alert history monthly — Google mandates that every page either results in a fix or a threshold change. If the same alert fires and on-call does nothing, the threshold is wrong.
Production practice: annotate alerts with budget consumed. Add a PromQL expression to your alert annotation that shows how many minutes of error budget have been consumed in the current month. Grafana OnCall and PagerDuty both support templating Prometheus query results into the incident body. On-call engineers make better triage decisions when they immediately see "14 minutes of 43-minute monthly budget consumed" rather than an opaque "error rate 1.44 %".
High-cardinality labels break multi-window recording rules. If you group by user_id or request_id in your recording rules, the resulting time series fan-out will OOM your Prometheus. Always aggregate to the job or service level before computing burn rates. Use exemplars and tracing (covered in the Distributed Tracing tutorial) to drill into individual requests after the alert fires.

Connecting Alerts Back to Error Budgets

The final discipline: every fired Tier 1 or Tier 2 alert must appear in your error budget report. Track the start and end times of each budget-burning event, and calculate how many minutes of budget were consumed. This data drives the quarterly SLO review where you decide whether to tighten the SLO, loosen it, or invest engineering time in reliability improvements. Without this feedback loop, your alert thresholds calcify and drift from the actual reliability your users experience.