Communication During Incidents
Communication During Incidents
When a production system is on fire, the engineering work of fixing it is only half the job. The other half is communication — keeping stakeholders informed, coordinating the response team, and maintaining trust with users. Poor incident communication is one of the most common reasons post-mortems cite "confusion" and "delayed resolution." At Google, PagerDuty, and Cloudflare, incident comms is treated as a distinct discipline with clear ownership, structured cadence, and dedicated tooling.
The Three Communication Channels
Every incident runs three parallel communication streams simultaneously. Conflating them is a production pitfall that slows resolution:
- Internal operational channel — the incident war room (Slack/Teams channel, Zoom bridge). This is where engineers share raw findings, debate hypotheses, run commands, and coordinate actions. It is noisy by design.
- Internal stakeholder channel — regular updates to Engineering leadership, Product, Customer Success, and Legal. Concise, no jargon, action-oriented. They do not need to know which replica set lost quorum; they need to know impact, ETA, and what they should tell their contacts.
- External customer channel — the public status page. Crafted language, no blame, factual about impact scope, updated on a fixed cadence.
Status Pages: Architecture and Content
A status page is your external communication contract. It must answer three questions: Is anything broken right now? — What is the impact? — When will it be fixed?
Hosted options (Statuspage.io, Betterstack, Cachet) are preferred over self-hosted because they stay up when your infrastructure is down. Configure your status page to auto-update component status from your alerting pipeline — do not rely on humans to flip the toggle when they are already triaging.
Status update language follows a strict formula at production-grade companies. Each external update must contain: time (UTC always), current status (Investigating / Identified / Monitoring / Resolved), impact scope (what % of users, which features, which regions), and a next-update time. Never promise an ETA you cannot keep — update with "continuing to investigate" rather than silence.
Stakeholder Update Cadence
Internal stakeholders need a different rhythm than the public. The standard cadence used by SRE teams at major cloud providers:
- P0 (site down / data loss risk) — immediate page to VP/CTO, then updates every 15 minutes until resolved.
- P1 (major feature degraded) — update every 30 minutes to Engineering director and Customer Success lead.
- P2 (minor degradation, workaround exists) — update every hour; Customer Success notified once at start and once at resolution.
Send stakeholder updates to a dedicated Slack channel (#incident-updates) with a consistent template so recipients can scan history quickly:
Internal War Room Discipline
The operational channel gets chaotic fast. Three practices prevent it from becoming useless noise:
- Thread every hypothesis. Do not paste 50-line stack traces in the main channel. Thread them. The main channel should read as a timeline of decisions.
- Use a bot to pin actions. Every action taken gets logged:
/inc action "rolling back payment-service to v2.4.0" @devops-oncall. This creates an audit trail and feeds the post-mortem timeline automatically. - Silence irrelevant voices. The IC enforces a "speak only if you have data or can take an action" rule. Managers asking "what is the ETA" in the war room should be redirected to
#incident-updates.
Communication Architecture: End-to-End Flow
Resolving and Closing the Comms Loop
When the incident is resolved, every channel needs a closure update — not just the status page. A common failure is resolving the incident in the war room but forgetting to post a "resolved" message in #incident-updates, leaving executives thinking the outage is still ongoing. The IC checklist at resolution:
- Update status page to Resolved, summarize impact duration and root cause in plain language.
- Post final message in
#incident-updateswith duration, user impact count, and next steps (post-mortem date). - Send a customer email if the SLA breach threshold was crossed (typically: P0 outage over 15 minutes affecting more than 1% of users).
- Close the war room channel and archive it with an
/inc closebot command that timestamps the resolution.
Avoiding Communication Anti-Patterns
The anti-patterns that appear in post-mortems repeatedly:
- Silent updates — going 45 minutes without a status page post because engineers are deep in triage. Users assume you do not know what is happening. Automate the "still investigating" post on a 15-minute cron if no human posts.
- Technical jargon in external updates — "Cassandra compaction storm causing GC pressure" means nothing to a customer. Translate: "Database performance issues are slowing down search results."
- Premature "resolved" status — declaring resolved before you have verified with synthetic monitoring and real user metrics. A false resolution update followed by a second "still investigating" destroys trust faster than a single long outage.
- Over-communicating in the war room — managers pasting encouragement, non-responders asking for updates. Every message is a notification that pulls an engineer out of flow state.
Production-grade incident communication is a learned skill. The cadence, the language, the channel discipline — these are as engineerable as any system component. Document your comms runbook alongside your technical runbooks, and practice it in game days.