Site Reliability Engineering (SRE)

SLIs & SLOs in Practice

18 min Lesson 2 of 29

SLIs & SLOs in Practice

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are the quantitative backbone of SRE. Every reliability conversation — budget allocation, incident escalation, feature gating — ultimately traces back to a number: what fraction of interactions are "good" right now? Getting that number wrong means your team either over-invests in reliability (slowing product velocity) or under-invests (burning users). This lesson teaches you to choose, instrument, and govern SLIs and SLOs the way Google, Netflix, and Stripe do it in production.

What Makes a Good SLI?

An SLI is a ratio: good events / valid events. The "good" and "valid" definitions are where most teams go wrong. A well-formed SLI has four properties:

User-proximate — measures what the user actually experiences, not internal queue depth or CPU usage.
Actionable — when the SLI degrades, you can trace it to a specific failure domain.
Unambiguous — a new engineer reading the definition reaches the same number as a senior engineer.
Independent of load — a 50 % traffic spike must not mechanically move the SLI without a real degradation.

The Four Golden SLI Families

Most services fall into one of four categories. Choose SLIs from the family that matches your service type:

Request/response services (REST APIs, gRPC) — availability (non-5xx / total), latency (fraction served under threshold), correctness (valid response body).
Data processing pipelines (Kafka consumers, ETL) — freshness (lag under threshold), throughput, completeness (no lost events).
Storage systems (databases, object stores) — durability (reads return data successfully written), availability of read/write operations.
User-journey flows (checkout, login) — success rate of the end-to-end flow, measured at the browser or mobile client.

Latency SLIs need a threshold, not an average. Averages hide tail pain — a p99 at 4 s with a p50 at 20 ms is user-invisible in the average but disastrous for 1 in 100 users. Define your latency SLI as "fraction of requests completing in under X ms" and pick X at the p90 or p99 of your acceptable experience.

Picking the Right Threshold

For a latency SLI on a synchronous API the standard approach is: measure current p95/p99 over a 30-day baseline, find the knee of the distribution curve, then set the threshold 20–30 % above median acceptable latency. For availability, three nines (99.9 %) is a reasonable starting SLO for most internal services; 99.95 % for customer-facing APIs; 99.99 % only when genuinely required (payment auth, emergency services) — each nine costs roughly 10× more to operate.

SLO ≠ SLA. The SLO is your internal engineering target. The SLA is the contractual commitment with customers, always set lower (e.g., SLO = 99.9 %, SLA = 99.5 %). The gap is your buffer — never set an SLA at or above your SLO.

Measurement Windows

The window over which you evaluate compliance determines how quickly you detect problems and how stable the signal is. Two patterns dominate production:

Rolling window — a trailing 28- or 30-day window, recalculated every minute. Gives a continuously accurate picture of recent reliability. This is the Google SRE recommendation and what most Prometheus-based stacks implement.
Calendar window — a fixed month (Jan 1–Jan 31). Aligns with billing cycles and SLAs but creates a "cliff" at month boundaries where you can violate for 30 days with no consequence until the next window opens.

For most teams, use a 28-day rolling window for operational decisions (error budget) and a calendar window only for contractual SLA reporting.

Instrumenting SLIs in Prometheus

The canonical Prometheus pattern for a request-based SLI uses two metrics: a _total counter for all valid events and a _total counter (or histogram) for good events. Below is a PromQL availability SLI over a 28-day rolling window:

# Availability SLI: fraction of non-5xx responses (28-day rolling)
sum(rate(http_requests_total{job="checkout",code!~"5.."}[28d]))
/
sum(rate(http_requests_total{job="checkout"}[28d]))

# Latency SLI: fraction of requests under 200 ms (28-day rolling)
sum(rate(http_request_duration_seconds_bucket{job="checkout",le="0.2"}[28d]))
/
sum(rate(http_request_duration_seconds_count{job="checkout"}[28d]))

Wrap these in a recording rule so dashboards and alerts do not re-evaluate expensive range queries on every scrape:

# prometheus/rules/slo-checkout.yaml
groups:
  - name: slo_checkout
    interval: 1m
    rules:
      - record: slo:checkout_availability:ratio_rate28d
        expr: |
          sum(rate(http_requests_total{job="checkout",code!~"5.."}[28d]))
          / sum(rate(http_requests_total{job="checkout"}[28d]))

      - record: slo:checkout_latency200ms:ratio_rate28d
        expr: |
          sum(rate(http_request_duration_seconds_bucket{job="checkout",le="0.2"}[28d]))
          / sum(rate(http_request_duration_seconds_count{job="checkout"}[28d]))

      # Error budget remaining = (SLO - burn) / (1 - SLO)
      - record: slo:checkout_availability:error_budget_remaining
        expr: |
          (0.999 - (1 - slo:checkout_availability:ratio_rate28d)) / (1 - 0.999)

SLI measurement pipeline: raw metrics → Prometheus recording rules → SLO compliance decision, with error-budget outcomes.

Common Failure Modes When Setting SLOs

Measuring at the load balancer, not the client. A request that the LB retries three times before succeeding appears as "good" server-side but the user waited 3×. Instrument client-perceived latency where possible, or account for retries explicitly.
Excluding "expected" errors. Teams often exclude HTTP 429 (rate-limit) or 503 (maintenance) from the SLI denominator. Do not — from the user's perspective these are failures. Put the budget pressure on your system, not your denominator math.
SLO target set to match current performance. If your service runs at 99.92 % today and you set SLO = 99.9 %, you have no budget pressure — the SLO is meaningless. Set it at or slightly below the reliability level users actually need.
Too many SLIs. Google's internal guidance: two to four SLIs per service tier. More creates alert fatigue and diffuses engineering focus. Pick the two that most faithfully represent user happiness.

Do not set an SLO during an incident. Post-incident adrenaline produces unrealistically aggressive targets ("we will never drop below 99.99 % again"). Set SLOs during calm, data-informed quarters. Ratchet them up incrementally as reliability improves.

Multi-Window, Multi-Burn-Rate Alerting Preview

Once recording rules emit your SLI ratio, the next step (covered in Lesson 8) is alerting. The key insight here is that a single 28-day window burns too slowly to page on. You need a short window (1 h, 6 h) to detect fast burns and a long window (1 d, 3 d) to detect slow burns. The ratio between the burn rate and the time remaining in the window determines urgency. All of this math is driven by the same SLI recording rules you set up here — getting the SLI right is the prerequisite for reliable alerting.

Summary

Choose SLIs that are user-proximate ratios of good/valid events. Use the four golden families as your starting palette. Set SLO targets based on user need, not current performance. Use 28-day rolling windows for operational decisions. Express everything as Prometheus recording rules so your dashboards, alerts, and error-budget calculations all derive from a single authoritative number.