Observability Foundations

Alerting Philosophy

18 min Lesson 8 of 28

Alerting Philosophy

An alert wakes someone up. That is a serious act. Every time an alert fires at 3 AM and the on-call engineer investigates only to find nothing actionable, you have made an implicit statement: our system's noise matters more than your sleep and your judgment. Multiply that by dozens of engineers and hundreds of alerts, and you have the most common failure mode of observability programs at scale — alert fatigue. Teams stop trusting alerts, start ignoring them, and the first time a real incident fires, nobody responds in time.

Google's Site Reliability Engineering handbook devotes an entire chapter to this problem. The core principle that emerges: every alert must be urgent, actionable, and customer-visible. If an alert does not meet all three criteria, it should be a dashboard warning or a ticket, not a page. This lesson teaches how to build an alerting system that your on-call team trusts and that catches real problems before users notice them.

Symptom-Based Alerting

The most important design decision in alerting is alert on symptoms, not causes. A symptom is something the user experiences: slow responses, errors, unavailable features. A cause is an internal system state: high CPU, disk filling, a pod in CrashLoopBackOff. The mistake most teams make is alerting on causes, which leads to two failure modes:

False positives: CPU at 95% for 2 minutes does not always mean users are impacted. Maybe it is a batch job. Maybe auto-scaling is already reacting. You page an engineer for nothing.
False negatives: A silent memory leak causes the recommendation engine to serve stale results without any CPU or error rate spike. No cause-based alert fires. Users are degraded for hours.

Symptom-based alerting inverts this. Instead of "alert when CPU > 80%", you ask "alert when p99 latency exceeds the SLO threshold" or "alert when error budget burn rate is too high." The internal cause becomes an input to investigation, not the trigger for waking someone up.

Key principle: The Four Golden Signals (from Google SRE) define the universal symptom space for web services: Latency (how slow), Traffic (how much load), Errors (how many failures), and Saturation (how full/overloaded). Alert on degradations in these four signals, not on the internal causes that might produce them. For non-web workloads (databases, queues, batch jobs) the signals differ, but the principle — alert on what the user experiences — holds.

Severity Levels and the Escalation Contract

Not every alert deserves a 3 AM phone call. A robust alerting architecture defines a clear severity taxonomy and enforces a routing contract for each level. At most mature companies the tiers look like this:

P1 — Critical / Page immediately: Users are completely unable to use the product, or data integrity is at risk. Revenue impact is active. Requires immediate human response regardless of time. Example: checkout service returning 100% errors, database primary is down with no automatic failover completing.
P2 — High / Page during business hours, wake if prolonged: A significant portion of users are degraded. SLO breach is imminent or occurring. Requires response within 30 minutes. Example: p99 latency 3× SLO threshold for more than 10 minutes, payment success rate dropped 15%.
P3 — Medium / Ticket, respond next business day: A non-critical path is degraded or a capacity limit is approaching. No immediate user impact, but left unaddressed it could escalate. Example: background job queue depth trending toward limit, a specific API endpoint elevated errors affecting less than 0.5% of users.
P4 — Low / Dashboard warning: Informational. Worth watching but requires no action today. Never sends a notification — only visible to someone actively looking at dashboards.

The contract is sacred: if P1 fires and the on-call does not respond within 5 minutes, escalate automatically to the team lead. PagerDuty, OpsGenie, and VictorOps all support this escalation policy out of the box. The moment P1 ever fires for a non-emergency, the team will start routing P1 alerts to silence — and that is the beginning of an incident response breakdown.

Runbooks: The Alert Is Not Complete Without One

An alert without a runbook is an alarm with no instructions. When an engineer is woken at 3 AM, their cognitive capacity is reduced, their stress is elevated, and they may be junior enough to have never seen this failure mode before. A runbook is the bridge between "alert fired" and "incident resolved."

Every production alert must link directly to its runbook. The link goes in the alert annotation. The runbook must answer, in order:

What does this alert mean? Plain English: what user-visible symptom is occurring or at risk?
What is the immediate triage step? The first command or dashboard to check. Answerable in under 2 minutes.
What are the likely root causes? Ordered by historical frequency. Each cause links to its specific remediation.
What are the safe remediations? Exact commands to run, flagged as safe vs. destructive. Destructive steps require explicit confirmation.
When to escalate? If still unresolved after N minutes, who to call next, and what info to provide them.

Runbooks live in version control alongside the alert rules. When an alert rule changes, its runbook is updated in the same PR. Stale runbooks that reference old commands or non-existent services are worse than no runbook — they waste time and erode trust.

Pro practice: The best runbooks at companies like Stripe and GitHub are living documents — after every incident that hits a given alert, the SRE or on-call engineer updates the runbook with what they actually found, which triage steps were wasteful, and which new root cause they discovered. This turns each incident into an institutional knowledge gain rather than a one-time event. Some teams encode this as a policy: you cannot close an incident ticket without updating the runbook for every alert that fired.

# Prometheus alerting rule with runbook annotation and severity label
# File: monitoring/alerts/checkout-service.yml

groups:
  - name: checkout-service
    rules:
      - alert: CheckoutErrorBudgetBurnHigh
        expr: |
          (
            rate(http_requests_total{job="checkout", status=~"5.."}[1h])
            /
            rate(http_requests_total{job="checkout"}[1h])
          ) > 0.005
        for: 5m
        labels:
          severity: critical
          team: payments
          service: checkout
        annotations:
          summary: "Checkout error rate burning SLO budget at high rate"
          description: "Error rate {{ $value | humanizePercentage }} exceeds 0.5% SLO threshold for 5m."
          runbook_url: "https://runbooks.internal/checkout/high-error-rate"
          dashboard_url: "https://grafana.internal/d/checkout-overview"

      - alert: CheckoutLatencyP99High
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket{job="checkout"}[10m])
          ) > 0.8
        for: 10m
        labels:
          severity: high
          team: payments
          service: checkout
        annotations:
          summary: "Checkout p99 latency above 800ms"
          description: "p99={{ $value | humanizeDuration }} — SLO threshold is 500ms."
          runbook_url: "https://runbooks.internal/checkout/latency-p99-high"

# Alertmanager routing: severity → receiver
# File: monitoring/alertmanager/alertmanager.yml

route:
  receiver: default-slack
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-payments
      continue: false
    - match:
        severity: high
      receiver: slack-payments-oncall
    - match:
        severity: medium
      receiver: jira-ticket-creator

Alert Fatigue: The Silent Killer of On-Call Culture

Alert fatigue is not a metaphor — it is a measurable phenomenon. Studies of hospital alarm systems (where fatigue has killed patients) and cloud operations teams show the same pattern: when the ratio of actionable alerts to total alerts drops below roughly 50%, on-call engineers begin to develop automatic suppression behavior. They acknowledge alerts without reading them. They silence PagerDuty for the first hour after waking because they expect it to be noise. When the real incident fires, the response is slow.

The causes of alert fatigue in production systems:

Alerting on causes, not symptoms — leads to high false-positive rates for non-user-impacting conditions.
Thresholds set too aggressively — "alert if p99 ever exceeds 200ms" fires constantly in a system where the SLO is 500ms.
No for duration — a single metric spike lasting 30 seconds pages an engineer at 3 AM.
Alert proliferation without review — engineers add alerts when they find a new problem, but never remove alerts when the underlying issue is fixed or the service is decommissioned. Alert count grows monotonically.
Flapping alerts — an alert that oscillates between firing and resolving every few minutes, sending multiple notifications.

The cure is a regular alert review process. Every quarter, pull the alert firing history and categorize each alert into: (a) fired and led to a user-impacting incident, (b) fired and was noise, (c) did not fire during an incident it should have caught. Category (b) alerts are candidates for deletion, threshold increase, or conversion to a ticket. Category (c) alerts reveal gaps in coverage.

Alert flow from metric signal through routing to on-call response and post-incident review — the feedback loop that eliminates fatigue over time.

Practical Alert Hygiene at Scale

Beyond the philosophy, production alerting requires operational discipline. These are the rules followed by SRE teams at hyperscalers:

Every alert has an owner. An alert with no team label gets silenced and removed. Ownership is enforced by the routing config — if there is no route for a label, the alert goes to a catch-all that opens a ticket against the platform team to find the owner.
Alerts must have a for duration. Never fire on a single evaluation. Minimum for: 2m for critical, for: 5m for everything else. This alone eliminates 60-70% of false positives from transient spikes.
Use multi-window burn rate for SLO alerts. A single threshold on a 5-minute window misses slow burns. Alert when the 1-hour burn rate is fast AND the 5-minute burn rate confirms it is still ongoing. This is the algorithm in Google's SRE Workbook: it catches fast burns early and slow burns before budget exhaustion, with very low false-positive rates.
Suppress during maintenance windows. Alertmanager inhibit_rules and time_intervals let you silence derivative alerts during planned downtime. Forgetting this causes 50+ alerts during a scheduled maintenance, training the team to ignore them.
Track your alert signal-to-noise ratio. Export a metric: alerts_total{actionable="true|false"}. If actionable rate drops below 70%, schedule an alert audit sprint.

# Multi-window SLO burn rate alert (Google SRE Workbook pattern)
# Catches fast burns AND slow burns with minimal false positives
# Assumes SLO = 99.9% (0.1% error budget), 30-day window.

groups:
  - name: slo-burn-rate
    rules:
      # Fast burn: 14.4x rate — burns 2% of monthly budget in 1 hour
      - alert: SLOBurnRateFast
        expr: |
          (
            rate(http_requests_total{job="api", status=~"5.."}[1h])
            / rate(http_requests_total{job="api"}[1h])
          ) > (14.4 * 0.001)
          and
          (
            rate(http_requests_total{job="api", status=~"5.."}[5m])
            / rate(http_requests_total{job="api"}[5m])
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "API fast error budget burn — page immediately"
          runbook_url: "https://runbooks.internal/api/slo-burn-fast"

      # Slow burn: 3x rate — burns 10% of monthly budget in 3 days
      - alert: SLOBurnRateSlow
        expr: |
          (
            rate(http_requests_total{job="api", status=~"5.."}[6h])
            / rate(http_requests_total{job="api"}[6h])
          ) > (3 * 0.001)
          and
          (
            rate(http_requests_total{job="api", status=~"5.."}[30m])
            / rate(http_requests_total{job="api"}[30m])
          ) > (3 * 0.001)
        for: 15m
        labels:
          severity: high
          page: "false"
        annotations:
          summary: "API slow error budget burn — investigate today"
          runbook_url: "https://runbooks.internal/api/slo-burn-slow"

# Alertmanager inhibition: suppress pod-level alerts when cluster alert is firing
# Prevents alert storm when a node goes down
inhibit_rules:
  - source_match:
      alertname: NodeDown
    target_match:
      job: kubernetes-pods
    equal: ['node']

Production pitfall: Never alert on predicted future state (e.g., "disk will be full in 4 hours based on current rate") unless your team has a proven track record of acting on it. Predictive alerts sound appealing but have high false-positive rates — the rate changes, the prediction expires, and the engineer who investigated at 2 AM found 40% disk usage and went back to sleep angry. Use predictive signals in dashboards and daily digests, not pages. Alert on current symptom, not forecast.

Building the Culture, Not Just the Config

Alerting philosophy is ultimately a team agreement, not a Prometheus YAML file. The config enforces the philosophy, but the philosophy must be agreed upon first. This means having explicit conversations about: what counts as an emergency, what is acceptable to let burn through business hours, and what the on-call rotation's implicit social contract is. Teams that skip this conversation end up with engineers who individually tune their own alert sensitivity — some muting everything, others paging for every blip — and the system as a whole becomes unpredictable.

Institutionalize this as a quarterly alerting review: pull the last 90 days of alert history, calculate the actionable rate per alert, identify the top-10 noisiest alerts by volume, and spend one sprint reducing them. This feedback loop, applied consistently, produces an alert system that the team trusts — which means a team that responds when it matters.