Incident Management & On-Call

Severity Levels & Escalation

18 min Lesson 3 of 28

Severity Levels & Escalation

When production breaks, the first question every on-call engineer must answer is how bad is this? Severity levels give the whole company a shared vocabulary to answer that question in seconds — and escalation policies define who gets woken up when the answer is "very bad." Together they are the skeleton of incident response. Without them, teams waste the first ten minutes of a P0 arguing about whether it is a P0.

Why a Standard Taxonomy Matters

Severity levels are not bureaucracy — they drive concrete decisions: which runbook to open, how many engineers to page, whether to call the CEO, and what the SLA on recovery time is. Google, PagerDuty, Atlassian, and most top-tier SRE orgs converge on a SEV 1 – SEV 4 (or P0 – P4) scale. The numbers vary slightly by company, but the logic is identical: severity encodes customer impact and business risk, not technical complexity.

The Standard Four-Level Scale

The table below maps the levels used at most large-scale tech companies. Learn this pattern cold — interviewers and incident commanders both expect you to cite it instantly.

Severity level ladder with impact and response time Level Customer Impact Acknowledge SLA Who Gets Paged SEV 1 Critical Full outage or data loss. All users affected. Revenue at risk. < 5 min On-call + incident commander + senior eng + VP/exec + customer comms team SEV 2 High Major feature broken or significant user subset down. No viable workaround. < 15 min On-call + incident commander + product stakeholder (exec optional) SEV 3 Medium Degraded performance or partial feature outage. Workaround exists. < 30 min On-call engineer only (may rope in domain expert if not resolved in 30 min) SEV 4 Low Minor bug or cosmetic issue. Edge-case users affected. No SLA impact. Next business day Ticket filed; handled during business hours, no after-hours page. Acknowledge SLA = time from alert fire to first human acknowledges the incident
Standard four-level severity ladder used at most large-scale tech companies — each level drives specific response times and escalation paths.
Severity is about customer impact, not engineering difficulty. A tricky but invisible bug in an internal reporting pipeline is SEV 4. A one-line config error that drops all payments is SEV 1. Never let engineering complexity inflate the severity of an incident that customers cannot feel.

Writing a Severity Definition That Sticks

Vague definitions cause arguments at the worst possible time. Each SEV level needs three things: a measurable customer impact threshold, a time-to-acknowledge SLA, and an explicit escalation target. Here is a real-world definition for SEV 1 that you can put into your runbook today:

# SEV1 — Critical Incident Definition # Trigger ANY of the following: conditions: - error_rate > 10% # across all requests, sustained > 2 min - p99_latency > 10s # on any public-facing endpoint - payment_success_rate < 95% - auth_login_success_rate < 90% - full_region_unavailable: true - data_loss_confirmed: true response: ack_sla: 5m # page escalates if no ACK within 5 minutes bridge_open: true # war-room Slack channel + Zoom bridge opened immediately exec_notify: true # VP Engineering + on-call PM notified status_page: true # public status page updated within 10 min # Once you declare SEV1, do NOT downgrade mid-incident. # Downgrade only during postmortem if impact is confirmed lower.

Escalation Policies and Paging Trees

An escalation policy is the sequence of humans that an alert system walks through when an incident is not acknowledged. PagerDuty, Opsgenie, and VictorOps all model this as a paging tree: if person A does not acknowledge in N minutes, page person B; if B does not respond, page person C; and so on up to an escalation stop.

A well-designed paging tree has three layers:

  1. Primary on-call — the engineer currently holding the pager for their service. Paged first for every incident.
  2. Secondary on-call — the backup, typically the previous week's primary. Paged if primary does not ACK within the SEV SLA window.
  3. Manager / domain lead — paged only for SEV 1 or SEV 2 that goes unacknowledged past both levels, or immediately for declared SEV 1. They never fix the incident — they remove blockers for the people who do.
Separate the technical escalation tree from the communication escalation tree. Engineers page up for help. Communications (status page, executive updates, customer emails) is a parallel track owned by the incident commander or comms role. Conflating both into one tree means engineers are writing customer emails while the system is still down.

Here is a PagerDuty escalation policy expressed as code using the Terraform PagerDuty provider — exactly how infrastructure-as-code shops version-control their on-call configs:

resource "pagerduty_escalation_policy" "payments_sev1" { name = "Payments SEV1 Escalation" num_loops = 2 # repeat the chain twice before giving up rule { escalation_delay_in_minutes = 5 target { type = "user_reference" id = pagerduty_user.on_call_primary.id } } rule { escalation_delay_in_minutes = 5 target { type = "user_reference" id = pagerduty_user.on_call_secondary.id } } rule { escalation_delay_in_minutes = 5 target { type = "user_reference" id = pagerduty_user.payments_eng_lead.id } } rule { escalation_delay_in_minutes = 10 target { type = "user_reference" id = pagerduty_user.vp_engineering.id } } }

The num_loops = 2 setting means the system cycles through all four levels twice (totalling 50 minutes) before stopping, at which point PagerDuty fires a "nobody responded" alert to a dedicated Slack channel so a human can intervene manually.

Escalation Diagram: From Alert to Exec

Paging tree escalation flow for a SEV1 incident Alert Fires (SEV1) PagerDuty / Opsgenie Primary On-Call ACK within 5 min → handles incident No ACK Secondary On-Call +5 min — backup engineer paged No ACK Engineering Lead +5 min — senior or team lead No ACK VP Engineering (+10 min) Parallel Comms Track • Status page updated • Incident channel opened • Exec summary posted • Customer email drafted (Incident Commander owns this — NOT the on-call eng) Runs in parallel from minute zero of SEV1
SEV 1 paging tree: technical escalation (left) runs in parallel with the communications track (right), both triggered from minute zero.

Escalation Anti-Patterns to Avoid

Even with a well-designed policy, teams fall into predictable traps:

  • Severity inflation — labelling everything SEV 1 so nothing gets ignored. This destroys the signal. If your on-call team sees three SEV 1 alerts a week and only one is real, they will start ignoring them.
  • Severity deflation — engineers under-declaring to avoid waking up their manager. Encode in your culture that declaring SEV 1 is never punished when done in good faith.
  • No automatic escalation — relying on the on-call engineer to manually call for help. Engineers in crisis mode forget. Automate it: if the incident ticket has not moved to "Mitigating" within 20 minutes of a SEV 1 declaration, PagerDuty re-pages the secondary automatically.
  • Escalating too early up the management chain — paging VPs for SEV 3 creates alert fatigue at the leadership level and erodes trust in your monitoring.
Severity re-assessment during an incident is mandatory. An incident that starts as SEV 3 can become SEV 1 within minutes if root cause analysis reveals that the blast radius is larger than first estimated. Build a checkpoint into your incident process: re-assess severity every 15 minutes for active incidents. Failing to upgrade severity delays the escalation and leaves you under-resourced.

Tying Severity to SLOs

The most mature organizations derive severity automatically from their Service Level Objectives. If a PromQL alert fires and the error budget burn rate exceeds a threshold, the severity is computed, not guessed. Example multi-window burn-rate alert at SEV 1 level:

# Prometheus alerting rule — auto-assigns SEV1 when burn rate is critical # Burns 5% of monthly error budget in 1 hour = SEV1 threshold groups: - name: slo.payments rules: - alert: PaymentsSLOBurnRateCritical expr: | ( rate(http_requests_total{job="payments",code=~"5.."}[1h]) / rate(http_requests_total{job="payments"}[1h]) ) > 0.10 and ( rate(http_requests_total{job="payments",code=~"5.."}[5m]) / rate(http_requests_total{job="payments"}[5m]) ) > 0.10 for: 2m labels: severity: SEV1 team: payments escalation_policy: payments_sev1 annotations: summary: "Payments error rate {{ $value | humanizePercentage }} — SEV1" runbook_url: "https://wiki.internal/runbooks/payments-high-error-rate" dashboard_url: "https://grafana.internal/d/payments-overview"

Notice the escalation_policy label — when Alertmanager routes this to PagerDuty, that label maps directly to the escalation policy name, so the right paging tree fires automatically. No human has to decide who to call.

Version-control your escalation policies alongside your alerting rules. When a team member leaves or rotation changes, a Terraform plan shows the diff. Oncall configs that live only in the PagerDuty UI are one forgotten update away from silently paging the wrong person.

Severity levels and escalation policies are load-bearing infrastructure. They are not documents that sit in a wiki — they are executable contracts between your monitoring stack, your on-call engineers, and your customers. Treat them with the same discipline you apply to your Kubernetes manifests: version-controlled, tested, and reviewed before they change.