SLIs, SLOs & SLAs
SLIs, SLOs & SLAs
When a production system goes down at 3 AM, the first question every engineer asks is: "Are we violating our SLA?" The second question — which separates senior engineers from the rest — is: "Was our error budget already spent before this incident?" Understanding the full vocabulary of reliability, from the raw measurement all the way to the legal contract, is foundational to operating services at Google, Amazon, and Stripe scale. These are not abstract concepts; they are the operational contracts that determine engineering priorities, on-call escalation, and whether a company issues a refund.
Service Level Indicators (SLIs)
A Service Level Indicator is a quantitative measure of some aspect of the level of service being provided. An SLI answers: "How is the service performing right now, expressed as a ratio?" The canonical form of an SLI is a ratio of good events to total events over a rolling window:
SLI = (good events) / (valid events)
Good SLIs are carefully chosen — not every metric is an SLI. The four categories that matter at production scale are:
- Availability: The fraction of requests served successfully. For HTTP APIs:
(requests with status < 500) / (total requests). At Google, a request that returns HTTP 500 or times out counts as bad. - Latency: The fraction of requests served faster than a threshold. Example:
(requests completed in < 200ms) / (total requests). Note that latency SLIs measure a ratio, not a raw p99 value — this makes them composable with the SLO framework. - Throughput: The fraction of time the service is handling sufficient request volume. Used for batch pipelines:
(minutes processing >= target_rate) / (total minutes). - Quality / Correctness: The fraction of responses that are correct. Used for search ranking, recommendation engines, data pipelines. Harder to measure automatically — often requires golden datasets or sampling with human validation.
Service Level Objectives (SLOs)
A Service Level Objective is the target value for an SLI, over a measurement window. An SLO is the internal agreement about how reliable a service must be. The canonical form is:
SLO = SLI target × window — for example: "99.9% of requests will return in < 200ms, measured over a rolling 28-day window."
The measurement window matters enormously. A 28-day rolling window is preferred over calendar month at companies like Google because it has consistent length and rolls forward continuously (no "reset at midnight on the 1st" behaviour that causes perverse incentives to burn budget early in a month). The standard windows are: 28-day rolling, 7-day rolling, or trailing 90 days for quarterly reviews.
An SLO of 99.9% over 28 days means the service is allowed a total of 28 × 24 × 60 × (1 - 0.999) = 40.32 minutes of bad events (the error budget) before it is considered out of compliance. This number drives engineering decisions:
- If error budget is healthy (>50% remaining): ship features aggressively, run chaos experiments, do risky schema migrations.
- If error budget is at 25%: slow down risky deployments, freeze non-critical changes, prioritize reliability work.
- If error budget is exhausted: freeze releases, halt experiments, all hands on reliability until the window rolls forward.
The Error Budget
The error budget is the quantity of unreliability you are allowed to spend before breaching the SLO. It is the most powerful tool the SRE discipline introduced to software engineering. Error budget converts the abstract concept of "reliability" into a finite, spendable resource. Teams that understand this stop having the endless "reliability vs. features" argument — the budget answer it objectively.
Burn rate alerting is the production-grade approach to SLO-based alerting. Instead of alerting on individual metric thresholds, you alert when you are consuming error budget faster than it will replenish. Google's SRE Workbook defines two-window burn rate alerts: a fast alert (1-hour window, >14x burn rate — catches rapid outages) and a slow alert (6-hour window, >6x burn rate — catches slow burns you would otherwise miss).
Service Level Agreements (SLAs)
An SLA is the external, legally-binding contract between a service provider and its customers, specifying the consequences of failing to meet a target. SLAs typically include: the metric being promised (availability, latency), the measurement period, the measurement methodology, and the remedy (service credits, refunds, contract termination). AWS SLAs typically promise 99.9% monthly uptime for EC2 and 99.99% for S3, with service credits of 10-30% of the monthly bill for breaches.
Defining SLIs in Practice: Prometheus and OpenTelemetry
At big-tech scale, SLIs are computed from metrics instrumentation in the service itself. The industry standard is to instrument request counters and histograms, then compute the SLI ratio in your monitoring system. Here is how this looks end-to-end with Prometheus:
Common Pitfalls at Production Scale
- Health-check traffic in the SLI denominator: Synthetic health checks from your load balancer or uptime monitor make your availability SLI look better than it is for real users. Exclude them by filtering on a
source=syntheticlabel or using separate counter metrics. - Using availability as your only SLI: A service that responds with HTTP 200 containing an error JSON body is "available" but broken. Include a correctness SLI for critical paths. Stripe monitors whether payments actually clear, not just whether the endpoint responds.
- Setting SLOs on individual instances: An SLI that measures one pod will be noisy. SLIs should be aggregated across the entire service fleet — if one pod is bad but 99 are healthy, your users are fine and the SLI should reflect that.
- Not distinguishing user-impacting errors from infrastructure errors: A timeout caused by a misconfigured health check probe should not count against your user-facing availability SLI. Classify events by user impact, not by what the load balancer logged.