SLIs & SLOs in Practice
SLIs & SLOs in Practice
Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are the quantitative backbone of SRE. Every reliability conversation — budget allocation, incident escalation, feature gating — ultimately traces back to a number: what fraction of interactions are "good" right now? Getting that number wrong means your team either over-invests in reliability (slowing product velocity) or under-invests (burning users). This lesson teaches you to choose, instrument, and govern SLIs and SLOs the way Google, Netflix, and Stripe do it in production.
What Makes a Good SLI?
An SLI is a ratio: good events / valid events. The "good" and "valid" definitions are where most teams go wrong. A well-formed SLI has four properties:
- User-proximate — measures what the user actually experiences, not internal queue depth or CPU usage.
- Actionable — when the SLI degrades, you can trace it to a specific failure domain.
- Unambiguous — a new engineer reading the definition reaches the same number as a senior engineer.
- Independent of load — a 50 % traffic spike must not mechanically move the SLI without a real degradation.
The Four Golden SLI Families
Most services fall into one of four categories. Choose SLIs from the family that matches your service type:
- Request/response services (REST APIs, gRPC) — availability (non-5xx / total), latency (fraction served under threshold), correctness (valid response body).
- Data processing pipelines (Kafka consumers, ETL) — freshness (lag under threshold), throughput, completeness (no lost events).
- Storage systems (databases, object stores) — durability (reads return data successfully written), availability of read/write operations.
- User-journey flows (checkout, login) — success rate of the end-to-end flow, measured at the browser or mobile client.
Picking the Right Threshold
For a latency SLI on a synchronous API the standard approach is: measure current p95/p99 over a 30-day baseline, find the knee of the distribution curve, then set the threshold 20–30 % above median acceptable latency. For availability, three nines (99.9 %) is a reasonable starting SLO for most internal services; 99.95 % for customer-facing APIs; 99.99 % only when genuinely required (payment auth, emergency services) — each nine costs roughly 10× more to operate.
Measurement Windows
The window over which you evaluate compliance determines how quickly you detect problems and how stable the signal is. Two patterns dominate production:
- Rolling window — a trailing 28- or 30-day window, recalculated every minute. Gives a continuously accurate picture of recent reliability. This is the Google SRE recommendation and what most Prometheus-based stacks implement.
- Calendar window — a fixed month (Jan 1–Jan 31). Aligns with billing cycles and SLAs but creates a "cliff" at month boundaries where you can violate for 30 days with no consequence until the next window opens.
For most teams, use a 28-day rolling window for operational decisions (error budget) and a calendar window only for contractual SLA reporting.
Instrumenting SLIs in Prometheus
The canonical Prometheus pattern for a request-based SLI uses two metrics: a _total counter for all valid events and a _total counter (or histogram) for good events. Below is a PromQL availability SLI over a 28-day rolling window:
Wrap these in a recording rule so dashboards and alerts do not re-evaluate expensive range queries on every scrape:
Common Failure Modes When Setting SLOs
- Measuring at the load balancer, not the client. A request that the LB retries three times before succeeding appears as "good" server-side but the user waited 3×. Instrument client-perceived latency where possible, or account for retries explicitly.
- Excluding "expected" errors. Teams often exclude HTTP 429 (rate-limit) or 503 (maintenance) from the SLI denominator. Do not — from the user's perspective these are failures. Put the budget pressure on your system, not your denominator math.
- SLO target set to match current performance. If your service runs at 99.92 % today and you set SLO = 99.9 %, you have no budget pressure — the SLO is meaningless. Set it at or slightly below the reliability level users actually need.
- Too many SLIs. Google's internal guidance: two to four SLIs per service tier. More creates alert fatigue and diffuses engineering focus. Pick the two that most faithfully represent user happiness.
Multi-Window, Multi-Burn-Rate Alerting Preview
Once recording rules emit your SLI ratio, the next step (covered in Lesson 8) is alerting. The key insight here is that a single 28-day window burns too slowly to page on. You need a short window (1 h, 6 h) to detect fast burns and a long window (1 d, 3 d) to detect slow burns. The ratio between the burn rate and the time remaining in the window determines urgency. All of this math is driven by the same SLI recording rules you set up here — getting the SLI right is the prerequisite for reliable alerting.
Summary
Choose SLIs that are user-proximate ratios of good/valid events. Use the four golden families as your starting palette. Set SLO targets based on user need, not current performance. Use 28-day rolling windows for operational decisions. Express everything as Prometheus recording rules so your dashboards, alerts, and error-budget calculations all derive from a single authoritative number.