Incident Management & On-Call

Incident Command

18 min Lesson 4 of 28

Incident Command

When a production system goes down, the natural human response is for everyone to jump in at once — Slack fills with questions, multiple engineers start poking at the same database, someone deploys a hotfix nobody reviewed, and the incident drags on for hours longer than it should. This is not chaos unique to small teams. It happens at Google, Netflix, and AWS when an incident lacks a clear command structure. The Incident Command System (ICS) exists precisely to prevent this — and understanding it is the difference between a coordinated response that mitigates in 20 minutes and a free-for-all that drags into a war room at 3 AM.

Where ICS Comes From and Why DevOps Adopted It

ICS was developed in the 1970s by California wildfire-fighting agencies after investigators found that the primary cause of firefighter deaths was not the fire itself — it was coordination failures between agencies: incompatible radio frequencies, no unified command, multiple people issuing contradictory orders to the same personnel. Sound familiar?

The US FEMA formalized ICS into a national standard. Google's SRE team studied it, stripped out the firefighting-specific terminology, and adapted the core structure for software incidents. PagerDuty, Atlassian, and most major SRE organizations now publish their own variants — but the underlying structure is nearly identical: one Incident Commander, clearly separated functional roles, a single source-of-truth for status, and a strict communications discipline.

ICS is a coordination protocol, not a blame hierarchy. The Incident Commander is not the most senior engineer in the room. They are the person currently filling that role — possibly a mid-level engineer or even a newly trained IC. Seniority and command authority during an incident are explicitly separated because conflating them causes senior engineers to be pulled into coordination work when their value is in deep technical diagnosis.

The Three Core Roles

Every incident beyond SEV-3 (and all SEV-1/SEV-2 incidents) should have these three roles filled by distinct people from the moment the incident is declared. One person must never hold more than one role simultaneously — that is the single most common structural failure in real incidents.

Incident Commander (IC). The IC owns the incident end-to-end. They do not fix anything. Their entire job is to maintain a shared mental model of the incident across all participants, ensure every action taken has an owner and a deadline, decide when to escalate or de-escalate severity, authorize high-risk mitigation steps (e.g., "yes, take the payments service down to drain the bad connections"), and drive the incident toward resolution. The IC is the last person to leave the incident bridge. They write the initial draft of the postmortem. In Slack-based incidents, the IC pins the status message and controls the incident channel. The IC must be trained and the role must be rotated on call — you cannot improvise an IC during a SEV-1.

Communications Lead (Comms). The Comms lead is the IC's voice to the outside world. They own all external and cross-functional communication: status page updates, stakeholder notifications, executive escalations, and customer-facing messaging. The Comms lead reads the incident channel and translates technical developments into plain-language status updates on a cadence (typically every 15–30 minutes during active incidents). They shield the IC and Ops lead from communication interruptions. In PagerDuty Incident Response, this role is called Communications Coordinator. In some organizations it is called Scribe when the primary function is documenting rather than broadcasting — though best practice separates the two when the incident is large enough.

Operations Lead (Ops Lead). The Ops lead is the technical coordinator. They direct the subject-matter experts (SMEs) who are actually diagnosing and remediating. The Ops lead translates the IC's strategic decisions ("we need to identify root cause within 20 minutes") into specific technical tasks with owners ("@alice, check the DB replica lag; @bob, pull the trace IDs for the failing requests"). The Ops lead maintains the technical timeline — what was tried, what was ruled out, what is currently being tested — and surfaces blockers to the IC. They do not do hands-on remediation themselves; that is the SMEs' job.

The Command Structure in Practice

Incident Command Structure Incident Commander Owns the incident end-to-end Communications Lead Status page · Stakeholders · Exec Operations Lead Directs SMEs · Technical timeline SME: Infra DB · Networking SME: App Service owners ... Status Page Public · Customers Stakeholders Exec · Sales · Support #incident-2025-06-11 (Slack) Single shared channel — all roles read + post here Command/reporting chain Communication channel (no command)
The Incident Command structure: one IC, one Comms lead, one Ops lead, with SMEs directed by the Ops lead. All roles share a single incident channel.

Communication Discipline: The Incident Channel

ICS does not work without strict communication discipline. All incident communication happens in a single, purpose-created Slack channel (or equivalent). The rules are not suggestions — they are protocol:

  • Every status update is prefixed with a timestamp and role: [IC 14:32] Root cause identified: Redis connection pool exhausted. Ops lead is coordinating drain + restart.
  • Questions to specific people use @mention and include a deadline: @alice — DB lag reading by 14:45?
  • Hypotheses and half-formed thoughts go to a separate thread or voice bridge — the main channel is for facts, decisions, and status updates only.
  • The Comms lead posts externally-facing updates as pinned messages: [COMMS 14:35] Status page updated: investigating elevated error rates on checkout. ETA: 15 min.
  • Nobody troubleshoots in the incident channel. The Ops lead assigns tasks; SMEs report results back.
PagerDuty and Jira Service Management both support automated incident channel creation. When an alert fires and an incident is declared, the integration creates a Slack channel named #incident-YYYYMMDD-HHMMSS-service, posts the alert context as the first message, and optionally pages the IC. Set this up so you are not manually creating channels during a SEV-1 at 2 AM.
# PagerDuty Slack integration — auto-create incident channel via webhook # (Store this in your PagerDuty service extension config) POST https://slack.com/api/conversations.create Authorization: Bearer xoxb-YOUR-BOT-TOKEN Content-Type: application/json { "name": "incident-{{ trigger_time_yyyymmdd }}-{{ service_slug }}", "is_private": false } # Immediately post context to the new channel: POST https://slack.com/api/chat.postMessage { "channel": "{{ new_channel_id }}", "text": ":rotating_light: *SEV-{{ severity }}* | {{ incident_title }}\n*IC:* Needs assignment\n*Comms:* Needs assignment\n*Ops Lead:* Needs assignment\n*Status:* Investigating\n*Dashboard:* {{ grafana_url }}\n*Runbook:* {{ runbook_url }}" } # Pin the message so the status is always visible: POST https://slack.com/api/pins.add { "channel": "{{ new_channel_id }}", "timestamp": "{{ context_message_ts }}" }

Handing Off the IC Role

Long incidents (anything over two to three hours) require IC handoffs. Fatigue degrades decision quality faster than most engineers admit — the IC who has been running a 4-hour SEV-1 at midnight is not making the same quality decisions they were in hour one. Handoffs must be explicit and structured. The outgoing IC provides a verbal or written brief covering: current hypothesis for root cause, what has been tried and ruled out, current status of all active tasks and their owners, any time-boxed commitments made to stakeholders, and the next decision point the incoming IC needs to make. This brief takes five minutes and is worth every second. Undocumented handoffs ("you're in charge now, good luck") are how incidents grow from four hours to eight hours.

The most dangerous moment in a long incident is the handoff. The outgoing IC holds the complete mental model of the incident. The incoming IC holds none of it. Without a structured handoff, the incoming IC spends 30-45 minutes reconstructing context that already exists — during which the incident progresses unsupervised. Organizations that run multiple IC handoffs per year should maintain a written handoff template in their incident runbook and require it to be completed before the outgoing IC leaves the bridge.

Scaling the Structure for Incident Size

Not every incident needs all three roles filled by separate people. ICS scales:

  • SEV-3 / minor: IC only. The IC doubles as their own comms lead (posts a single status update) and directs the one or two engineers involved directly.
  • SEV-2 / significant: IC + Comms lead. The Ops lead role is light — the IC can informally coordinate the small set of SMEs.
  • SEV-1 / critical: All three roles, strictly separated. Multiple SME sub-groups may be spun up (e.g., a separate DB sub-group and a separate network sub-group, each with their own coordinator who reports to the Ops lead).
  • SEV-0 / existential: Full ICS plus a separate Executive Briefing lead who manages VP/C-level stakeholders, and potentially a Legal/PR lead if the incident involves data exposure.

The cardinal rule across all sizes: whoever is IC must be explicitly declared and known to everyone in the channel. The first message in any incident channel should be: [IC: @username] Incident declared. Comms: @username. Ops: @username. Status: investigating. If that message does not exist, you do not have ICS — you have a headless incident.

# Runbook snippet: IC declaration template (paste into incident channel immediately) # Keep this as a Slack snippet or PD conference template [IC: @your-name] *Incident declared — {{ service }} — SEV-{{ severity }}* Comms Lead: @comms-name (or: unassigned — who can take this?) Ops Lead: @ops-name (or: unassigned) Current hypothesis: {{ initial_hypothesis }} Active mitigations: none yet Next check-in: {{ now + 15 minutes }} Status page: {{ url }} Runbook: {{ url }} Timeline doc: {{ google_doc_or_confluence_url }}

The Incident Command System is not bureaucracy for its own sake. Every element of the structure — the clear IC, the separated Comms and Ops leads, the single channel, the explicit declarations — exists because it has a documented history of reducing time-to-resolution and preventing the coordination failures that turn recoverable incidents into extended outages. Train your team, rotate the IC role through the on-call rotation, and run regular incident game days to keep the muscle memory sharp before you need it in production.