We are still cooking the magic in the way!
Incident Tooling
Incident Tooling
Tooling does not replace process, but the right tools remove friction at exactly the moments when friction is most expensive — the first minutes of a P0 when every second of confusion costs money and user trust. At companies like Stripe, GitHub, and Cloudflare, incident tooling is treated as a first-class engineering investment, not an afterthought. This lesson covers the three pillars of modern incident tooling: paging and on-call management (PagerDuty / Opsgenie), ChatOps integration (incident bots and Slack workflows), and automated incident timelines.
PagerDuty and Opsgenie: More Than a Pager
PagerDuty and Opsgenie (now part of Atlassian) are the two dominant on-call management platforms at enterprise scale. Both share the same conceptual model: services receive alerts from monitoring systems, escalation policies define who gets notified and in what order, and schedules define who is on-call at any moment. Understanding the data model is essential because getting it wrong creates silent alert gaps — the production failure mode where an alert fires but nobody receives it.
The most critical PagerDuty configuration to get right is the service integration key and the escalation policy. A common production failure is a team that creates a service, points all their alerts at it, but forgets to attach an escalation policy — alerts fire, deduplicate, and silently drop with no human ever notified. Always verify by sending a test alert through the full chain.
Key PagerDuty / Opsgenie Patterns at Big-Tech Scale
Alert grouping and noise reduction. At scale, a single infrastructure failure can produce hundreds of alerts per minute. Both platforms offer grouping: PagerDuty's "Intelligent Alert Grouping" (machine learning based) and Opsgenie's "Alert Policies" merge related alerts into one incident. Without grouping, an on-call engineer receives 200 pages in 60 seconds for what is ultimately one dead database — alert fatigue sets in within minutes and the engineer starts acking without reading.
Maintenance windows. Suppress alerts during planned maintenance. Forgetting to open a maintenance window before a database upgrade is the single most common cause of unnecessary P1 pages at mid-to-large organisations. Both platforms support recurring windows and API-driven creation so your deploy pipelines can open and close windows automatically.
Response plays (PagerDuty) / Notification policies (Opsgenie). Pre-defined response workflows that auto-add responders, create a conference bridge, and post to Slack when a P0 fires. This eliminates the "who do I call?" scramble by making the right team assembly automatic.
POST https://events.pagerduty.com/integration/<KEY>/send every 60 seconds from a cron that runs independently of your main stack.ChatOps: Incident Bots and Slack Integration
ChatOps is the practice of driving operational workflows through a chat platform — creating incidents, running diagnostics, executing remediations, and posting status updates, all from within Slack (or Teams). At companies like GitHub, Shopify, and LinkedIn, the chat channel is the incident war room: everything that happens during an incident is visible in one scrollable thread, creating an automatic audit trail and keeping distributed teams synchronized.
The core ChatOps capabilities for incident management are:
- Incident declaration: A slash command (
/incident declare payments-api-down sev1) creates the PagerDuty incident, creates a dedicated Slack channel, invites the on-call engineer and relevant stakeholders, and posts the first status update — all in one action. - Runbook lookup:
/runbook payments high-error-rateposts the relevant runbook link and key commands directly into the channel, so engineers do not have to leave the incident context to search a wiki. - Status page updates:
/statuspage update major_outage "Investigating elevated error rates on checkout"pushes to your Atlassian Statuspage or Cachet instance without leaving Slack. - Escalation:
/page @backend-team This is a P0, need immediate helptriggers PagerDuty and pulls additional responders without anyone needing to know the team's rotation schedule.
The most widely deployed incident bot in large organisations is Rootly, FireHydrant, or a custom bot built on the Slack Bolt SDK. All three follow the same model: they listen for slash commands, call the PagerDuty / Opsgenie API, manage channel lifecycle, and drive status updates.
channels:write and chat:write often ends up with files:write, users:read, and admin.conversations:write by the time it reaches production. Audit your bot's OAuth scopes quarterly. A compromised bot token with admin scopes is a significant security incident — far worse than an alert gap.Incident Timelines: The Automatic Audit Trail
An incident timeline is a chronological record of every meaningful event during an incident: when the alert fired, when the IC was paged and acknowledged, when each hypothesis was tested, when the rollback happened, when SLOs recovered, when the incident was closed. At Google and Stripe, this timeline is populated automatically from tool integrations and serves as the primary input for postmortem writing.
The timeline is valuable for three reasons:
- Postmortem accuracy: Human memory degrades fast under stress. Engineers consistently mis-remember the sequence and timing of events by 10-30% within 24 hours. An automatic timeline eliminates this distortion.
- Metrics calculation: TTD, TTM, and TTR are computed from timeline timestamps. Manual TTR reporting is almost always optimistic — teams tend to remember resolution as earlier than it was.
- Pattern analysis: Across tens or hundreds of incidents, timelines reveal systemic patterns: which team consistently takes 45 minutes to acknowledge, which service is always involved in cascades, which runbook step is always skipped.
Modern platforms (Rootly, FireHydrant, PagerDuty Operations Cloud) auto-populate timelines by integrating with PagerDuty (alert timestamps), Slack (message timestamps), GitHub (deploy events), and your observability stack (when SLOs crossed thresholds). The result is a timeline accurate to the second, requiring zero manual effort during the incident itself.
Choosing and Integrating Your Toolchain
The canonical production incident toolchain at a well-run organisation looks like this: Prometheus / Datadog → PagerDuty / Opsgenie → Slack (incident bot) → Statuspage → Rootly / FireHydrant (timeline + postmortem). Each tool does one thing well and passes context to the next via webhooks and APIs. The key integration points are:
- Monitoring → PagerDuty: Use the Events v2 API, not v1. v2 supports dedup keys, severity from the alert payload, and alert grouping. Set
dedup_keyto a stable identifier (service + alert name) to prevent duplicate incidents on flapping alerts. - PagerDuty → Slack: Use PagerDuty's native Slack integration or a webhook to your incident bot. The bot should auto-create a channel when an incident is triggered, not when it is acknowledged — delay here costs minutes.
- Slack → Status page: Every status page update should be a single command, not a three-step manual process. Engineers under pressure skip steps; automation does not.
- All tools → Timeline aggregator: Rootly, FireHydrant, or your own timeline service subscribes to webhooks from all sources and merges them into a single chronological view keyed on incident ID.