Observability Foundations

Project: An Observability Plan

18 min Lesson 10 of 28

Project: An Observability Plan

Everything in this tutorial has built toward one deliverable: a production-grade observability plan for a real service. At Google, every new service must pass a Production Readiness Review (PRR) before launch. Observability — defined SLIs, SLOs, dashboards, and alert rules — is a mandatory PRR gate. This lesson walks you through producing exactly that artifact, end to end, for a sample e-commerce checkout service.

The output of this lesson is a set of files you can commit to a repository: SLO definitions in YAML, a Prometheus alerting rule file, and a Grafana dashboard JSON model. These are not toy examples — they follow the same structure used by SRE teams at scale.

Step 1: Define the Service and Its User Journeys

Before writing a single metric query, document what the service does and which user actions are critical. For the checkout-service:

Primary journey: User submits an order — the checkout service validates the cart, calls the payment processor, reserves inventory, and returns a confirmation within 3 seconds.
Secondary journey: Retrieve past orders for the order history page (latency-sensitive but not revenue-blocking).
Background job: Send post-purchase email (async, failure acceptable with retry).

Identify the reliability targets your users actually notice. A payment that takes 15 seconds feels broken even if it technically succeeds. An email that arrives 30 minutes late is invisible to the user.

Step 2: Define SLIs and SLOs

For each journey, define a Service Level Indicator (the metric) and a Service Level Objective (the target). Use the request-based availability and latency SLI formulas standardised in the Google SRE Workbook.

SLI definitions for checkout-service:

Availability SLI: good_requests / total_requests, where a good request is any HTTP 2xx or 3xx response to POST /checkout.
Latency SLI: Proportion of POST /checkout requests completing in under 2 seconds, measured at the 99th percentile over a 5-minute window.
Order retrieval latency SLI: Proportion of GET /orders requests completing in under 500 ms at the 95th percentile.

SLO targets (28-day rolling window):

Availability: 99.9% (allows ~43 minutes of bad requests per month)
Checkout latency p99 < 2 s: 99.5% of requests
Order retrieval latency p95 < 500 ms: 99% of requests

Error Budget = 1 - SLO. A 99.9% availability SLO gives you a 0.1% error budget per month — roughly 43 minutes of failures. Track your burn rate. If you burn 100% of the budget in 72 hours, something is seriously wrong. If you have unused budget at month-end, you have room to take deployment risk or invest in reliability work.

Encode these SLOs in a machine-readable format. The OpenSLO spec (CNCF project) provides a vendor-neutral YAML schema; Sloth and Pyrra can generate Prometheus rules from it automatically:

# slo/checkout-availability.yaml  (OpenSLO format)
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-availability
  namespace: checkout-service
spec:
  service: checkout-service
  description: "Availability of the checkout endpoint for order submission"
  timeWindow:
    - duration: 28d
      isRolling: true
  budgetingMethod: Occurrences
  objectives:
    - displayName: "99.9% availability"
      target: 0.999
  indicator:
    metadata:
      name: checkout-good-requests
    spec:
      ratioMetric:
        good:
          metricSource:
            type: Prometheus
            spec:
              query: |
                sum(rate(http_server_requests_total{
                  job="checkout-service",
                  route="/checkout",
                  status_class=~"2xx|3xx"
                }[5m]))
        total:
          metricSource:
            type: Prometheus
            spec:
              query: |
                sum(rate(http_server_requests_total{
                  job="checkout-service",
                  route="/checkout"
                }[5m]))

---
# Generate Prometheus recording + alerting rules from this SLO:
# sloth generate -i slo/checkout-availability.yaml -o alerts/checkout-slo.yaml

Step 3: Design the Dashboard

A production dashboard answers four questions at a glance: Is the service healthy right now? Is it meeting its SLOs? What changed recently? Where is the bottleneck? Structure every service dashboard around these sections, in this order:

SLO Status row — Current error budget burn rate, remaining budget this window, SLO compliance percentage. If this row is red, nothing else matters.
Golden Signals row — Latency (p50/p95/p99), traffic (RPS), error rate (%), saturation (CPU %, memory %, queue depth).
Dependency Health row — Downstream service latency and error rates (payment processor, inventory service, database).
Recent Changes row — Deployment markers from your CI/CD system overlaid on the latency graph. Most incidents are caused by recent changes.

From user journeys to SLIs/SLOs, through Prometheus, to Grafana dashboards and Alertmanager on-call routing.

In Grafana, define your dashboard as code using JSON model files committed to git. Never click-to-configure production dashboards — they drift and are unrecoverable after an accidental deletion. Use Grafana's provisioning directory or the grafana-dashboard Terraform resource to deploy them automatically.

Step 4: Write Alerting Rules

Alerts must be actionable. Every alert that fires should have a corresponding runbook entry. If an engineer cannot fix the problem within minutes of reading the alert, the alert is wrong — either it is firing too early, the threshold is wrong, or it lacks a runbook. The multi-window multi-burn-rate alerting strategy (from the Google SRE Workbook) gives you fast detection of severe issues and slow detection of gradual degradation without excessive alert noise:

# alerts/checkout-slo-alerts.yaml
groups:
  - name: checkout-slo-alerts
    rules:

      # Fast burn: consuming error budget at 14x rate in last 5 minutes
      # This means the SLO will be exhausted in ~2 hours if it continues
      - alert: CheckoutAvailabilityFastBurn
        expr: |
          (
            sum(rate(http_server_requests_total{job="checkout-service",route="/checkout",status_class=~"4xx|5xx"}[5m]))
            /
            sum(rate(http_server_requests_total{job="checkout-service",route="/checkout"}[5m]))
          ) > (14 * 0.001)
        for: 2m
        labels:
          severity: critical
          team: checkout
          slo: availability
        annotations:
          summary: "Checkout SLO fast burn — error budget draining"
          description: "Error rate {{ $value | humanizePercentage }} is 14x above the SLO threshold. Budget exhaustion in ~2h."
          runbook: "https://wiki.example.com/runbooks/checkout-availability"
          dashboard: "https://grafana.example.com/d/checkout-slo"

      # Slow burn: 6x rate over last hour — will exhaust budget in ~5 days
      - alert: CheckoutAvailabilitySlowBurn
        expr: |
          (
            sum(rate(http_server_requests_total{job="checkout-service",route="/checkout",status_class=~"4xx|5xx"}[1h]))
            /
            sum(rate(http_server_requests_total{job="checkout-service",route="/checkout"}[1h]))
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
          team: checkout
          slo: availability
        annotations:
          summary: "Checkout SLO slow burn — error budget degrading"
          description: "Sustained 6x burn rate over 1h. Budget exhaustion in ~5 days without remediation."
          runbook: "https://wiki.example.com/runbooks/checkout-availability"

      # Latency SLO: p99 > 2s for more than 5 minutes
      - alert: CheckoutLatencySLOBreach
        expr: |
          histogram_quantile(0.99,
            sum by (le) (
              rate(http_server_request_duration_seconds_bucket{
                job="checkout-service",
                route="/checkout"
              }[5m])
            )
          ) > 2.0
        for: 5m
        labels:
          severity: critical
          team: checkout
          slo: latency
        annotations:
          summary: "Checkout p99 latency exceeds 2s SLO"
          description: "p99 latency is {{ $value | humanizeDuration }}. SLO target is 2s."
          runbook: "https://wiki.example.com/runbooks/checkout-latency"

      # Dependency: payment processor error rate spike
      - alert: PaymentProcessorErrorRate
        expr: |
          sum(rate(http_client_requests_total{
            job="checkout-service",
            target="payment-processor",
            status_class=~"4xx|5xx"
          }[5m]))
          /
          sum(rate(http_client_requests_total{
            job="checkout-service",
            target="payment-processor"
          }[5m])) > 0.05
        for: 3m
        labels:
          severity: critical
          team: checkout
        annotations:
          summary: "Payment processor returning >5% errors"
          description: "External payment gateway error rate {{ $value | humanizePercentage }} — likely causing checkout failures."

Add a runbook URL and dashboard link to every alert annotation. At 3 AM, the on-call engineer should not need to search for context. The alert fires, they click the runbook link, follow the diagnosis steps, click the dashboard link to see the data. Google SRE calls this "making the right action the easy action." A bare alert with just a description and a severity label is an incomplete alert.

Step 5: The Observability Plan Document

Combine everything into a single structured document — a living artifact checked into the service repository at docs/observability-plan.md. At big-tech companies this is the document a new on-call engineer reads on their first rotation. It must cover:

Service overview: What it does, its dependencies, and its traffic patterns (peak RPS, typical latency, batch job schedule).
SLI catalogue: Every defined SLI with its PromQL query and the reasoning for that specific threshold.
SLO table: Target, window, error budget per month, and what "SLO breach" means for the business.
Dashboard index: Links to Grafana dashboards with a one-line description of what each answers.
Alert catalogue: Every alert, its severity, its burn rate multiplier, and a link to its runbook.
Data retention policy: How long raw metrics, traces, and logs are kept, and at what resolution (e.g., raw metrics 15 days, downsampled 5-minute aggregates 13 months).
On-call escalation path: Who gets paged for critical vs. warning alerts, and the escalation chain if they do not acknowledge within N minutes.

An observability plan that is not reviewed is a plan that is wrong. SLOs drift from reality: traffic patterns change, new features are added, dependencies change. Schedule a quarterly SLO review — compare the target against actual measured performance, review the error budget burn rate trend, and ask whether the SLO is still meaningful. A service sitting at 99.99% availability with a 99.9% SLO is leaving engineering risk capacity on the table; one at 99.85% with the same SLO is in chronic breach without knowing it.

Putting It All Together: File Structure

A production-grade observability plan lives in version control alongside the service code. Here is the recommended layout:

checkout-service/
├── slo/
│   ├── checkout-availability.yaml     # OpenSLO definition
│   └── checkout-latency.yaml
├── alerts/
│   ├── checkout-slo-alerts.yaml       # Multi-window burn rate alerts
│   └── checkout-dependencies.yaml    # Payment processor, DB, cache
├── dashboards/
│   ├── checkout-slo.json              # Grafana dashboard JSON model
│   └── checkout-golden-signals.json
└── docs/
    └── observability-plan.md          # Living on-call reference document

# Validate alert YAML syntax before merging:
promtool check rules alerts/checkout-slo-alerts.yaml

# Validate PromQL expressions against a real Prometheus instance:
promtool query instant http://prometheus:9090 \
  'histogram_quantile(0.99, sum by(le)(rate(http_server_request_duration_seconds_bucket{job="checkout-service"}[5m])))'

# Deploy Grafana dashboards via Terraform:
# resource "grafana_dashboard" "checkout_slo" {
#   config_json = file("${path.module}/dashboards/checkout-slo.json")
#   folder      = grafana_folder.checkout.id
# }

This is the standard you hold yourself to as a professional DevOps engineer. An observable service is not one that has Prometheus installed — it is one where, when an incident happens at 2 AM, the on-call engineer opens a dashboard and knows exactly what is broken and why within five minutes. Building that capability intentionally, before incidents happen, is what separates operational excellence from fire-fighting.