Capstone: A Big-Tech Production Platform

Reliability: SRE Practice & DR

18 min Lesson 8 of 30

Reliability: SRE Practice & DR

Reliability is not a property you build in; it is a discipline you operate. By the time your platform reaches production load, the Kubernetes layer, Terraform IaC, and observability stack from earlier lessons are in place. This lesson addresses how senior SRE and platform teams at Google, Netflix, and Atlassian actually run the system once it is live: structuring on-call so it scales without burning out engineers, using error budgets as an engineering decision tool rather than a compliance checkbox, designing your disaster-recovery tiers with precision, and then proving those tiers are real through game days.

On-Call Design at Big-Tech Scale

The canonical failure mode for a growing platform team is a flat on-call rotation that receives every alert. Within six months, every engineer in the rotation has seen a P0 at 2 AM caused by a flaky sidecar metric that has fired 40 times without triggering a real user impact. Responders stop caring. Alerts become noise, and noise hides real incidents.

The architecture that works is tiered on-call with hard ownership boundaries:

Tier 1 — Platform on-call (infrastructure): Owns the data plane: Kubernetes node health, networking, storage, cluster autoscaler, certificate expiry. Carries a 15-minute SLA to acknowledge P1. Typically a 5–7 person weekly rotation across time zones. Pages only on symptoms proven to affect real traffic (high 5xx rate on ingress, node not-ready cascade, etcd latency spike).
Tier 2 — Service on-call (per team): Each product team owns on-call for their services. They are paged by Alertmanager routes that scope to their namespace. They escalate to Tier 1 only when the root cause is platform infrastructure, not application code.
Tier 3 — Management escalation: Automated escalation after 30 minutes of unacknowledged P0 from Tier 1 and 2. PagerDuty escalation policies handle this without human intervention.

Alert quality is the whole game. Every alert in the rotation must meet the "2 AM test": if it pages someone at 2 AM, is there a runbook, is it actionable within 10 minutes, and did it represent actual user harm in the last 90 days? Alerts that fail this test are either silenced, converted to daily digest notifications, or deleted. Google SRE reports that teams who enforce alert curation reduce mean time to acknowledge (MTTA) by 40–60% within one quarter.

Use multi-window multi-burn-rate alerts for SLOs. A burn rate of 14× your error budget over 1 hour means you will exhaust the monthly budget in 3 hours. A burn rate of 1× over 6 hours is a slow leak worth a ticket but not a page. AlertManager rules that fire only on elevated burn rate eliminate the majority of transient false positives that plague threshold-based alerting.

# Alertmanager: multi-burn-rate SLO alert (PrometheusRule)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-payment-api
  namespace: monitoring
spec:
  groups:
  - name: slo.payment-api.burnrate
    rules:
    # Fast burn — page immediately (2% budget in 1 h)
    - alert: PaymentAPIFastBurn
      expr: |
        (
          rate(http_requests_total{job="payment-api",code=~"5.."}[1h])
          /
          rate(http_requests_total{job="payment-api"}[1h])
        ) > (14 * 0.001)
        and
        (
          rate(http_requests_total{job="payment-api",code=~"5.."}[5m])
          /
          rate(http_requests_total{job="payment-api"}[5m])
        ) > (14 * 0.001)
      for: 2m
      labels:
        severity: critical
        team: payments
      annotations:
        summary: "Payment API fast error-budget burn"
        runbook: "https://runbooks.internal/payment-api/high-error-rate"
    # Slow burn — ticket (10% budget in 3 days)
    - alert: PaymentAPISlowBurn
      expr: |
        (
          rate(http_requests_total{job="payment-api",code=~"5.."}[6h])
          /
          rate(http_requests_total{job="payment-api"}[6h])
        ) > (1 * 0.001)
      for: 60m
      labels:
        severity: warning
        team: payments
      annotations:
        summary: "Payment API slow error-budget burn"

Error Budgets as an Engineering Tool

An SLO without an error budget is a metric. An SLO with an error budget is a decision engine. The error budget answers one question in every engineering conversation: do we have the reliability headroom to ship this change, or do we need to invest in reliability first?

The operational mechanics that make error budgets work in practice:

Measure from the client, not the server. Server-side metrics miss DNS failures, CDN errors, and mobile-network TCP timeouts. The canonical source of truth for SLO measurement is synthetic monitoring (Blackbox Exporter or Checkly hitting your endpoints from five geographic regions) combined with Real User Monitoring (RUM) data from your frontend. Disagree at least monthly on whether your measurement methodology captures what the user actually experiences.
Budget is shared across all causes of failure. An incident caused by a botched Terraform apply and an incident caused by a DNS provider outage both consume the same error budget. This is intentional. It prevents teams from arguing about blame and focuses energy on reducing all causes of unavailability equally.
Budget policy is a written document, reviewed quarterly. The policy must answer: when budget drops below 10%, what changes? At Google the answer is: all feature releases to that service require an SRE sign-off; reliability work takes priority over feature work. Without a written, enforced policy, error budgets become theatre.
Track budget consumption at the sprint level. A Grafana dashboard with a 28-day rolling error budget burn rate per service, visible to both SREs and product managers, is the single most effective alignment tool a platform team can ship. Disagreements about reliability investment resolve themselves when the data is in the room.

Error budget alerts ≠ incident alerts. Configure a weekly Slack digest that posts each service's budget consumption for the prior 7 days. Reserve paging for the burn-rate thresholds above. Mixing budget-tracking notifications with incident pages causes responders to tune out both.

Disaster Recovery Tier Design

DR tier selection is a cost/recovery trade-off expressed as two numbers: RTO (Recovery Time Objective — how long until service is restored) and RPO (Recovery Point Objective — how much data loss is acceptable). At big-tech companies, different services have different answers, and the platform must support all of them.

DR tiers by RTO, RPO, and relative cost — most platforms run Tier 0 for payments, Tier 1 for core APIs, Tier 2/3 for internal tooling.

The critical insight is that not every service needs Tier 0. The payment processing API and authentication service warrant active-active multi-region. The internal analytics dashboard and admin portal do not. Applying Tier 0 uniformly doubles your infrastructure spend with no user-facing benefit for low-criticality workloads. Maintain a service criticality register — a spreadsheet or Confluence page updated quarterly — that maps each service to a DR tier and records the business justification. Without this register, DR tier decisions drift by engineer preference rather than business requirement.

Data replication strategy by tier:

Tier 0 (Active-Active): Aurora Global Database (synchronous replication, <1 s lag to secondary region), or CockroachDB for globally-distributed writes. The hard problem is write conflicts; most teams resolve this by routing writes to a single primary region with automatic failover, not true multi-primary. Redis Cluster replication is async and not suitable as the only data store for Tier 0 — it is an auxiliary cache, not a source of truth.
Tier 1 (Warm Standby): Aurora Global with async replication to a scaled-down read replica cluster in the DR region. Promote the replica to writer on failover. Failover time is 5–15 minutes including DNS propagation. Use Route 53 health checks with evaluate_target_health = true on your CNAME alias for automatic DNS failover.
Tier 2 (Pilot Light): Continuous database backups to S3 (point-in-time recovery enabled on Aurora). Terraform state for the DR region already exists — running terraform apply -var="environment=dr" provisions the full stack in <30 minutes. No nodes running means near-zero standby cost.

Game Days: Proving Your DR Is Real

A DR plan that has never been executed under pressure is a fiction. Game days are the engineering discipline of running controlled failure scenarios in production (or a production-like environment) to verify that RTO and RPO claims are true. They also surface the operational muscle memory — who calls whom, which runbook page has the right command — that only develops through practice.

Game day programme structure at big-tech companies:

Quarterly full DR failover: Simulate a full region loss. Cut traffic to primary region (update Route 53 weights to 0), promote Aurora replica, verify all services start in the DR region within the stated RTO, measure actual data loss vs. RPO target. Run this in a maintenance window the first time; run it without a maintenance window after you have two clean executions.
Monthly chaos experiments: Narrower scope. Kill a random pod in a critical namespace, drain a node from the on-call's home AZ, inject 200 ms latency on the payment service's outbound connections. Use Chaos Mesh or AWS Fault Injection Simulator. Scope must be constrained — define the steady-state hypothesis, inject the fault, observe, and roll back within 30 minutes. Never start a chaos experiment without a rollback procedure that takes <5 minutes.
Weekly synthetic failure on staging: Automated, unattended. A k6 load test combined with a scripted fault injection (kill the primary Redis node, restart a database proxy). Pass/fail result reported to Slack. Prevents regression between game days.

# Chaos Mesh — inject 200ms latency on payment service egress
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-egress-latency
  namespace: chaos-engineering
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - payments
    labelSelectors:
      app: payment-api
  delay:
    latency: "200ms"
    correlation: "25"
    jitter: "50ms"
  direction: egress
  duration: "10m"
---
# k6 load test to observe SLO behaviour during the experiment
# Run alongside the chaos manifest:
#   k6 run --env BASE_URL=https://api.internal payment-slo-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

const errorRate = new Rate('errors');

export const options = {
  stages: [
    { duration: '2m', target: 200 },
    { duration: '6m', target: 200 },
    { duration: '2m', target: 0 },
  ],
  thresholds: {
    http_req_duration: ['p(99)<600'],   // 600ms P99 — SLO allows 200ms added latency
    errors: ['rate<0.001'],              // Error budget: 99.9% success rate
  },
};

export default function () {
  const res = http.post(
    `${__ENV.BASE_URL}/v1/payments/validate`,
    JSON.stringify({ amount: 1000, currency: 'USD' }),
    { headers: { 'Content-Type': 'application/json' } }
  );
  errorRate.add(res.status !== 200);
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(0.5);
}

Post-game day process: Every game day produces a written report within 24 hours. The report covers steady-state hypothesis, what was injected, what was observed, whether RTO/RPO was met, and — critically — the action items. Action items that come from game days are reliability work, and they take priority in the next sprint under the error-budget policy. This is the feedback loop that makes the platform measurably more reliable over time rather than just feeling more reliable.

Never run chaos experiments without an explicit owner and a written rollback plan in the room. A poorly scoped chaos injection at Twitch in 2021 killed a Kafka broker in the wrong Kubernetes namespace — the "staging" namespace was backed by the same physical broker cluster as production. The blast radius was a 45-minute partial outage affecting live stream ingest. Scope experiments to isolated namespaces; verify namespace-to-infra mappings before injecting faults; always run with a human who has rollback authority watching the dashboard in real time.

The Reliability Flywheel

On-call, error budgets, DR tiers, and game days are not independent practices. They form a flywheel: game days surface reliability gaps that consume error budget; budget policy converts those gaps into sprint work; that work improves alert quality and DR confidence; better alerts make on-call sustainable; a sustainable on-call team runs better game days. The platform team that runs this loop quarterly becomes measurably more reliable every 90 days. The platform team that treats any one of these as a compliance exercise — running game days on paper, carrying alerts they never prune, writing DR plans they never test — stalls at the reliability ceiling defined by their worst untested assumption.