Anatomy of an Incident
Anatomy of an Incident
Every production incident — from a brief latency spike that auto-healed in seconds to a multi-hour outage that made the news — follows the same underlying lifecycle. At Google, Amazon, Netflix, and every serious engineering organisation, the ability to understand exactly where you are in that lifecycle at any given moment is the difference between a team that resolves incidents predictably and a team that thrashes. This lesson maps that lifecycle in detail, connecting each phase to the tooling, human decisions, and failure modes you will encounter in production.
The Six Phases of an Incident
The incident lifecycle is not a rigid checklist — it is a mental model that keeps everyone oriented. Phases can overlap, compress under pressure, or temporarily reverse (you thought you had root cause and then learned you did not). Knowing the model is what lets you notice when you are drifting.
Phase 1: Detection
An incident begins the moment user impact starts — not when the first alert fires. This distinction is critical. There is always a detection gap: the interval between when the system degraded and when someone noticed. Closing this gap is the first and most leveraged reliability investment you can make.
Detection sources fall into three categories, in rough order of reliability:
- Synthetic monitoring — probes you control, running continuously from outside your system (blackbox Prometheus exporters, Pingdom, Datadog Synthetics). These detect failures from the user's perspective and are not subject to the same failure modes as your infrastructure. If your load balancer crashes and takes your internal monitoring with it, synthetics still fire.
- Metric-based alerts — threshold or anomaly alerts on your golden signals: latency, traffic, errors, saturation (USE: Utilisation, Saturation, Errors). Prometheus
AlertManagerrules, Datadog monitors, CloudWatch alarms. These fire fast but require good SLO-based thresholds — alerting on CPU at 80% catches almost nothing meaningful; alerting on error-rate SLO burn-rate catches what users feel. - User reports — the worst detection mechanism. By the time a user reports an issue via support ticket, you have already missed your TTD target by minutes or hours. User reports are a signal that your monitoring has a gap.
Phase 2: Triage
Triage is the 2-to-5-minute window after detection where you answer three questions: How bad is it? How many users/systems are affected? Who owns the response? The output of triage is a severity level (your organisation's S0/S1/S2/S3 or P0/P1/P2 scale) and an incident commander assigned. Getting this wrong is expensive — under-triaging a critical outage wastes precious minutes; over-triaging a minor blip burns people out and erodes response discipline.
Effective triage uses your dashboards and logs, not intuition. A trained on-call engineer looks at the golden signals first — latency, error rate, throughput — and then at the scope: is this one region, one availability zone, one service, or a cascade? Cross-referencing the error rate spike with a recent deploy (git log --oneline -20 or your deployment tracking tool) takes thirty seconds and often gives you 80% of the answer.
Phase 3: Communication
Communication begins in parallel with triage, not after it. This surprises engineers who think "I should understand the problem before saying anything." At Google and most big-tech companies, the convention is the opposite: open an incident channel immediately, post a brief initial assessment ("investigating elevated error rates on payments API, likely related to 14:32 deploy, stand by for update in 10 minutes"), and update on a cadence. Silence in an incident channel is worse than an uncertain update — stakeholders fill silence with worst-case assumptions, which triggers escalation chains that distract your engineers from fixing the problem.
Phase 4: Mitigation
Mitigation and root cause analysis are separate activities, and confusing them is one of the most common production mistakes. Mitigation means stopping user impact as fast as possible, by any means available. Root cause analysis comes later. A team that spends 40 minutes tracing the exact cause of a database deadlock while users cannot check out is making the wrong tradeoff — roll back the deploy first, then investigate.
The canonical mitigation toolkit in order of preference:
- Rollback or revert — if a deploy caused the issue, roll it back. This is the fastest and most reliable mitigation for a large class of incidents. Your deployment pipeline must support this in under two minutes for it to be effective.
- Feature flag / circuit breaker — disable the specific feature or dependency that is failing without a full redeploy. LaunchDarkly, Statsig, or a simple database-backed flag can cut scope in seconds.
- Traffic shifting — redirect traffic away from the failing region or service version. Kubernetes weighted services, ALB target group weights, or Istio traffic policies give you this at layer 7.
- Horizontal scaling — if the cause is resource saturation (not a bug), adding capacity buys time.
kubectl scale deployment payments --replicas=20. This is a temporising measure, not a fix.
Phase 5: Resolution
An incident is resolved when two conditions are met: user-facing SLOs have returned to target levels, and the mitigation is stable (not just "seems OK for now"). Resolution is a deliberate declaration, not just the moment the alerts clear. The incident commander explicitly closes the incident, records the end time (critical for calculating MTTR and error budget consumption), and hands off any remaining work to normal engineering channels.
Avoid the antipattern of "soft closing" — leaving the incident channel open with no declared owner while engineers quietly keep working on it. This obscures your true MTTR metrics and leaves stakeholders uncertain about the system state.
Phase 6: Postmortem
The postmortem is where the lifecycle closes the loop. A blameless postmortem — written within 48-72 hours while memory is fresh — documents the full timeline, the contributing factors (not a single "root cause," because complex systems rarely have one), and a set of action items with owners and due dates. The goal is not to assign fault but to make the system more resilient and the team better prepared. Postmortems are covered in depth in Lesson 7; what matters here is understanding that the postmortem is not optional overhead — it is the mechanism that converts incidents from pure cost into organisational learning.
Key Metrics: TTD, TTM, TTR
Every phase of the lifecycle corresponds to a measurement your team should be tracking. These are the industry-standard incident health metrics used at Google, AWS, and Stripe:
- Time to Detect (TTD) — from first user impact to when the on-call engineer is engaged. Target: under 5 minutes for P0/S0. Improved by better alerting and synthetic monitoring.
- Time to Mitigate (TTM) — from engagement to when user impact stops. Target: under 30 minutes for P0. Improved by runbooks, fast rollback tooling, and circuit breakers.
- Time to Resolve (TTR) — from engagement to full system health and incident closure. Can be hours or days if the mitigation was a workaround and the real fix takes time.
- Mean Time Between Failures (MTBF) — average time between incidents of comparable severity. Improved by reliability engineering work surfaced in postmortems.
The Incident Lifecycle in Context
The phases above describe the mechanics of a single incident. In practice, your on-call rotation is managing the aggregate of all incidents over a rolling window, measured against your error budget. A team that can reliably detect in under 3 minutes, mitigate in under 20 minutes, and run blameless postmortems with completed action items will, over time, reduce incident frequency and severity — compounding the same way that good engineering compounds. This tutorial covers the tools and practices that make each phase faster and more reliable. The first step is knowing the map.