Site Reliability Engineering (SRE)

Error Budgets

18 min Lesson 3 of 29

Error Budgets

An error budget is the maximum amount of unreliability a service is permitted to accumulate before an SLO breach — and it is the single most important idea that separates SRE practice from traditional operations. If your SLO says "99.9 % of requests succeed over a 30-day window," then you have exactly 43.2 minutes of allowable failure per month. That 43.2 minutes is your error budget. Once it is spent, you have promised your users more than you delivered.

The Error Budget Contract

The error budget is not a metric owned by the SRE team alone. It is a bilateral contract between product and reliability engineering. Product wants to ship features fast; reliability wants stability. The budget reconciles these tensions with a clear, shared number:

If the budget is healthy — the team may take on more deployment risk, run experiments, and push releases aggressively.
If the budget is exhausted — the deployment pipeline freezes (or is heavily restricted), and the next sprint is devoted entirely to reliability work, post-mortems, and hardening.
If the budget is chronically unspent — the SLO is too loose and should be tightened; you are over-engineering reliability at the cost of velocity.

The error budget is a spending account, not a penalty box. Spending it on a deliberate, well-tested release is fine. Watching it drain from an undetected memory leak is not.

Calculating an Error Budget

The formula is straightforward. Given an SLO target T over a window W:

Budget (ratio) = 1 − T
Budget (minutes, 30-day) = (1 − T) × 30 × 24 × 60

For common SLO targets over a 30-day window (43,200 minutes total):

99 % → 432 minutes (~7.2 hours) of allowed downtime
99.9 % → 43.2 minutes
99.95 % → 21.6 minutes
99.99 % → 4.32 minutes

In Prometheus, you track budget remaining as a fraction:

# Remaining error budget fraction for a 99.9 % availability SLO
# SLI = ratio of good requests over the 30-day window
1 - (
  sum(increase(http_requests_total{job="api",status=~"5.."}[30d]))
  /
  sum(increase(http_requests_total{job="api"}[30d]))
) / 0.001

A value of 1.0 means the full budget is intact. 0.0 means exhausted. A negative value means you have breached the SLO.

Burn Rates

Watching the raw budget remaining is too slow for incident response — you would only notice a problem after significant damage. Burn rate tells you how fast the budget is being consumed relative to normal.

A burn rate of 1 means the service is consuming its error budget at exactly the rate that would exhaust it precisely at the end of the window. A burn rate of 14.4 over one hour means the service is burning a full month of budget in just two hours — a P0 incident by any measure.

The Google SRE workbook recommends a multi-window, multi-burn-rate alert strategy to balance precision and recall:

Fast burn (1h / 5m window): burn rate > 14.4 → page immediately (2 % budget in 1 hour)
Moderate burn (6h / 30m window): burn rate > 6 → page (5 % budget in 6 hours)
Slow burn (1d / 2h window): burn rate > 3 → ticket (10 % budget in 1 day)
Trend alert (3d / 6h window): burn rate > 1 → warning (budget depleting faster than replenished)

Always use two windows for each burn-rate threshold (a long window to detect the sustained trend and a short window to confirm the issue is still active). This prevents alert storms from short transient spikes that have already resolved.

In Prometheus / Alertmanager YAML:

groups:
  - name: slo.error_budget
    rules:
      # Fast burn — page
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{job="api"}[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{job="api"}[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
          team: sre
        annotations:
          summary: "Fast error budget burn on api (burn rate > 14.4x)"
          runbook: "https://wiki.internal/runbooks/api-error-budget"

      # Moderate burn — page
      - alert: ErrorBudgetModerateBurn
        expr: |
          (
            sum(rate(http_requests_total{job="api",status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{job="api"}[6h]))
          ) > (6 * 0.001)
          and
          (
            sum(rate(http_requests_total{job="api",status=~"5.."}[30m]))
            /
            sum(rate(http_requests_total{job="api"}[30m]))
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: page
          team: sre
        annotations:
          summary: "Moderate error budget burn on api (burn rate > 6x)"

      # Slow burn — ticket
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{job="api",status=~"5.."}[1d]))
            /
            sum(rate(http_requests_total{job="api"}[1d]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{job="api",status=~"5.."}[2h]))
            /
            sum(rate(http_requests_total{job="api"}[2h]))
          ) > (3 * 0.001)
        for: 1h
        labels:
          severity: ticket
          team: sre
        annotations:
          summary: "Slow error budget burn on api — reliability work needed"

Burn Rate Diagram

Error budget burn rates: how fast each scenario exhausts the 30-day budget.

What Happens When the Budget Is Spent

A drained error budget triggers a defined escalation policy — not a vague "everyone is upset" reaction. A well-run SRE org enforces the following automatically:

Deployment freeze — the CI/CD pipeline rejects new production rollouts (a policy gate checks the error budget API or Prometheus before allowing a deploy). Hotfixes and rollbacks are excepted.
Reliability sprint — the next sprint backlog is cleared and filled exclusively with reliability items from the post-mortem action list, debt items, and hardening tasks.
Escalation to leadership — the SLO miss is surfaced in the weekly executive report with root-cause and remediation timelines.
Budget review — if budget is repeatedly exhausted by the same failure mode, the SLO itself may need revision (tighter or looser), or the service architecture needs a fundamental change.

Never let teams "borrow" from next month's error budget. Each window resets to 100 %. Borrowing forward destroys the accountability mechanism — users experience the cumulative unreliability even if the metric resets.

Production Failure Modes

The most common mistakes teams make with error budgets in production:

Measuring the wrong SLI — counting only 5xx errors misses latency violations, partial failures (third-party dependency times out), and client-side errors. Define SLIs at the user journey level.
Planned maintenance not excluded — if you drain the budget on a maintenance window your users were notified about and agreed to, that spend is misleading. Mark maintenance windows in your SLO tooling and exclude them from budget calculations.
Single-window alerting — alerting only on the 30-day budget remaining causes huge lag. A catastrophic outage that drains 100 % of budget in one hour will not fire until a significant fraction is gone. Always pair burn-rate alerts with short windows.
No policy teeth — error budgets are meaningless if an engineering manager can override the freeze "just this once." Encode the policy in pipeline gates, not in a Confluence page.

At Google scale, error budget status is surfaced on a real-time dashboard visible to every engineer and manager in the org. Transparency is what gives the budget its social force. Consider publishing yours on an internal status page alongside the SLO dashboard.