Error Budgets
Error Budgets
An error budget is the maximum amount of unreliability a service is permitted to accumulate before an SLO breach — and it is the single most important idea that separates SRE practice from traditional operations. If your SLO says "99.9 % of requests succeed over a 30-day window," then you have exactly 43.2 minutes of allowable failure per month. That 43.2 minutes is your error budget. Once it is spent, you have promised your users more than you delivered.
The Error Budget Contract
The error budget is not a metric owned by the SRE team alone. It is a bilateral contract between product and reliability engineering. Product wants to ship features fast; reliability wants stability. The budget reconciles these tensions with a clear, shared number:
- If the budget is healthy — the team may take on more deployment risk, run experiments, and push releases aggressively.
- If the budget is exhausted — the deployment pipeline freezes (or is heavily restricted), and the next sprint is devoted entirely to reliability work, post-mortems, and hardening.
- If the budget is chronically unspent — the SLO is too loose and should be tightened; you are over-engineering reliability at the cost of velocity.
Calculating an Error Budget
The formula is straightforward. Given an SLO target T over a window W:
- Budget (ratio) = 1 − T
- Budget (minutes, 30-day) = (1 − T) × 30 × 24 × 60
For common SLO targets over a 30-day window (43,200 minutes total):
- 99 % → 432 minutes (~7.2 hours) of allowed downtime
- 99.9 % → 43.2 minutes
- 99.95 % → 21.6 minutes
- 99.99 % → 4.32 minutes
In Prometheus, you track budget remaining as a fraction:
A value of 1.0 means the full budget is intact. 0.0 means exhausted. A negative value means you have breached the SLO.
Burn Rates
Watching the raw budget remaining is too slow for incident response — you would only notice a problem after significant damage. Burn rate tells you how fast the budget is being consumed relative to normal.
A burn rate of 1 means the service is consuming its error budget at exactly the rate that would exhaust it precisely at the end of the window. A burn rate of 14.4 over one hour means the service is burning a full month of budget in just two hours — a P0 incident by any measure.
The Google SRE workbook recommends a multi-window, multi-burn-rate alert strategy to balance precision and recall:
- Fast burn (1h / 5m window): burn rate > 14.4 → page immediately (2 % budget in 1 hour)
- Moderate burn (6h / 30m window): burn rate > 6 → page (5 % budget in 6 hours)
- Slow burn (1d / 2h window): burn rate > 3 → ticket (10 % budget in 1 day)
- Trend alert (3d / 6h window): burn rate > 1 → warning (budget depleting faster than replenished)
In Prometheus / Alertmanager YAML:
Burn Rate Diagram
What Happens When the Budget Is Spent
A drained error budget triggers a defined escalation policy — not a vague "everyone is upset" reaction. A well-run SRE org enforces the following automatically:
- Deployment freeze — the CI/CD pipeline rejects new production rollouts (a policy gate checks the error budget API or Prometheus before allowing a deploy). Hotfixes and rollbacks are excepted.
- Reliability sprint — the next sprint backlog is cleared and filled exclusively with reliability items from the post-mortem action list, debt items, and hardening tasks.
- Escalation to leadership — the SLO miss is surfaced in the weekly executive report with root-cause and remediation timelines.
- Budget review — if budget is repeatedly exhausted by the same failure mode, the SLO itself may need revision (tighter or looser), or the service architecture needs a fundamental change.
Production Failure Modes
The most common mistakes teams make with error budgets in production:
- Measuring the wrong SLI — counting only 5xx errors misses latency violations, partial failures (third-party dependency times out), and client-side errors. Define SLIs at the user journey level.
- Planned maintenance not excluded — if you drain the budget on a maintenance window your users were notified about and agreed to, that spend is misleading. Mark maintenance windows in your SLO tooling and exclude them from budget calculations.
- Single-window alerting — alerting only on the 30-day budget remaining causes huge lag. A catastrophic outage that drains 100 % of budget in one hour will not fire until a significant fraction is gone. Always pair burn-rate alerts with short windows.
- No policy teeth — error budgets are meaningless if an engineering manager can override the freeze "just this once." Encode the policy in pipeline gates, not in a Confluence page.