Incident Management & On-Call

Incident Tooling

18 min Lesson 9 of 28

Incident Tooling

Tooling does not replace process, but the right tools remove friction at exactly the moments when friction is most expensive — the first minutes of a P0 when every second of confusion costs money and user trust. At companies like Stripe, GitHub, and Cloudflare, incident tooling is treated as a first-class engineering investment, not an afterthought. This lesson covers the three pillars of modern incident tooling: paging and on-call management (PagerDuty / Opsgenie), ChatOps integration (incident bots and Slack workflows), and automated incident timelines.

The goal of incident tooling: Reduce cognitive load during the worst moments. Every second an engineer spends figuring out how to page someone, where the incident channel is, or what happened five minutes ago is a second not spent fixing the problem. Good tooling makes the right action the easiest action.

PagerDuty and Opsgenie: More Than a Pager

PagerDuty and Opsgenie (now part of Atlassian) are the two dominant on-call management platforms at enterprise scale. Both share the same conceptual model: services receive alerts from monitoring systems, escalation policies define who gets notified and in what order, and schedules define who is on-call at any moment. Understanding the data model is essential because getting it wrong creates silent alert gaps — the production failure mode where an alert fires but nobody receives it.

PagerDuty alert routing: from monitoring alert source through service deduplication, escalation policy, and on-call schedule to the engineer who gets paged.

The most critical PagerDuty configuration to get right is the service integration key and the escalation policy. A common production failure is a team that creates a service, points all their alerts at it, but forgets to attach an escalation policy — alerts fire, deduplicate, and silently drop with no human ever notified. Always verify by sending a test alert through the full chain.

# PagerDuty: create a service via the REST API (Terraform-managed in production)
# This is the imperative equivalent of what pd-terraform-provider does declaratively

curl -s -X POST https://api.pagerduty.com/services \
  -H "Authorization: Token token=$PD_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "service": {
      "name": "payments-api",
      "description": "Payments service — owns checkout and billing flows",
      "escalation_policy": {
        "id": "P3QLXYZ",
        "type": "escalation_policy_reference"
      },
      "alert_creation": "create_alerts_and_incidents",
      "alert_grouping_parameters": {
        "type": "intelligent"
      },
      "acknowledgement_timeout": 600,
      "auto_resolve_timeout": 14400
    }
  }'

# Opsgenie equivalent: create integration routing rule via Terraform
# (providers: opsgenie/opsgenie)
# resource "opsgenie_service" "payments" {
#   name      = "payments-api"
#   team_id   = opsgenie_team.backend.id
# }

Key PagerDuty / Opsgenie Patterns at Big-Tech Scale

Alert grouping and noise reduction. At scale, a single infrastructure failure can produce hundreds of alerts per minute. Both platforms offer grouping: PagerDuty's "Intelligent Alert Grouping" (machine learning based) and Opsgenie's "Alert Policies" merge related alerts into one incident. Without grouping, an on-call engineer receives 200 pages in 60 seconds for what is ultimately one dead database — alert fatigue sets in within minutes and the engineer starts acking without reading.

Maintenance windows. Suppress alerts during planned maintenance. Forgetting to open a maintenance window before a database upgrade is the single most common cause of unnecessary P1 pages at mid-to-large organisations. Both platforms support recurring windows and API-driven creation so your deploy pipelines can open and close windows automatically.

Response plays (PagerDuty) / Notification policies (Opsgenie). Pre-defined response workflows that auto-add responders, create a conference bridge, and post to Slack when a P0 fires. This eliminates the "who do I call?" scramble by making the right team assembly automatic.

Always use a dead-man's switch for your alerting pipeline itself. PagerDuty's "Heartbeat" integration (or Opsgenie's) expects a periodic HTTP POST from your AlertManager. If the POST stops arriving, PagerDuty pages you. Without this, a crashed AlertManager is invisible — your monitoring is down and you do not know it. Configure it as: POST https://events.pagerduty.com/integration/<KEY>/send every 60 seconds from a cron that runs independently of your main stack.

ChatOps: Incident Bots and Slack Integration

ChatOps is the practice of driving operational workflows through a chat platform — creating incidents, running diagnostics, executing remediations, and posting status updates, all from within Slack (or Teams). At companies like GitHub, Shopify, and LinkedIn, the chat channel is the incident war room: everything that happens during an incident is visible in one scrollable thread, creating an automatic audit trail and keeping distributed teams synchronized.

The core ChatOps capabilities for incident management are:

Incident declaration: A slash command (/incident declare payments-api-down sev1) creates the PagerDuty incident, creates a dedicated Slack channel, invites the on-call engineer and relevant stakeholders, and posts the first status update — all in one action.
Runbook lookup: /runbook payments high-error-rate posts the relevant runbook link and key commands directly into the channel, so engineers do not have to leave the incident context to search a wiki.
Status page updates: /statuspage update major_outage "Investigating elevated error rates on checkout" pushes to your Atlassian Statuspage or Cachet instance without leaving Slack.
Escalation: /page @backend-team This is a P0, need immediate help triggers PagerDuty and pulls additional responders without anyone needing to know the team's rotation schedule.

The most widely deployed incident bot in large organisations is Rootly, FireHydrant, or a custom bot built on the Slack Bolt SDK. All three follow the same model: they listen for slash commands, call the PagerDuty / Opsgenie API, manage channel lifecycle, and drive status updates.

# Slack Bolt (Python) — minimal incident bot skeleton
# Production bots add PD API calls, DB persistence, and status page integration

from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler
import requests, os

app = App(token=os.environ["SLACK_BOT_TOKEN"])
PD_TOKEN = os.environ["PD_API_TOKEN"]
PD_SERVICE_ID = os.environ["PD_SERVICE_ID"]

@app.command("/incident")
def declare_incident(ack, command, client, say):
    ack()
    text = command["text"]          # e.g. "payments-api-down sev1"
    parts = text.split()
    title, sev = parts[0], parts[1] if len(parts) > 1 else "sev2"

    # 1. Create PagerDuty incident
    pd_resp = requests.post(
        "https://api.pagerduty.com/incidents",
        headers={"Authorization": f"Token token={PD_TOKEN}", "Content-Type": "application/json"},
        json={"incident": {"type": "incident", "title": title, "service": {"id": PD_SERVICE_ID, "type": "service_reference"}, "urgency": "high"}}
    ).json()
    inc_id = pd_resp["incident"]["id"]
    inc_url = pd_resp["incident"]["html_url"]

    # 2. Create dedicated Slack channel
    chan = client.conversations_create(name=f"inc-{inc_id.lower()}")
    chan_id = chan["channel"]["id"]

    # 3. Post initial message
    client.chat_postMessage(channel=chan_id, text=f":rotating_light: *{sev.upper()} Incident Declared*\n*Title:* {title}\n*PD:* {inc_url}\n*IC:* <@{command['user_id']}>\n*Status:* Investigating")
    say(f"Incident declared. Channel: <#{ chan_id }> | PD: {inc_url}")

if __name__ == "__main__":
    SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"]).start()

Production pitfall — bot token scope creep. Incident bots accumulate Slack scopes over time as features are added. A bot that starts with channels:write and chat:write often ends up with files:write, users:read, and admin.conversations:write by the time it reaches production. Audit your bot's OAuth scopes quarterly. A compromised bot token with admin scopes is a significant security incident — far worse than an alert gap.

Incident Timelines: The Automatic Audit Trail

An incident timeline is a chronological record of every meaningful event during an incident: when the alert fired, when the IC was paged and acknowledged, when each hypothesis was tested, when the rollback happened, when SLOs recovered, when the incident was closed. At Google and Stripe, this timeline is populated automatically from tool integrations and serves as the primary input for postmortem writing.

The timeline is valuable for three reasons:

Postmortem accuracy: Human memory degrades fast under stress. Engineers consistently mis-remember the sequence and timing of events by 10-30% within 24 hours. An automatic timeline eliminates this distortion.
Metrics calculation: TTD, TTM, and TTR are computed from timeline timestamps. Manual TTR reporting is almost always optimistic — teams tend to remember resolution as earlier than it was.
Pattern analysis: Across tens or hundreds of incidents, timelines reveal systemic patterns: which team consistently takes 45 minutes to acknowledge, which service is always involved in cascades, which runbook step is always skipped.

Modern platforms (Rootly, FireHydrant, PagerDuty Operations Cloud) auto-populate timelines by integrating with PagerDuty (alert timestamps), Slack (message timestamps), GitHub (deploy events), and your observability stack (when SLOs crossed thresholds). The result is a timeline accurate to the second, requiring zero manual effort during the incident itself.

# FireHydrant REST API: add a manual timeline event during an incident
# (Auto-events come from integrations; manual entries are for decisions/hypotheses)

curl -X POST "https://api.firehydrant.io/v1/incidents/$INCIDENT_ID/milestone_updates" \
  -H "Authorization: Bearer $FH_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "occurred_at": "2025-03-14T14:42:00Z",
    "body": "Hypothesis: deploy abc123 introduced N+1 query in checkout flow. Rolling back now.",
    "type": "hypothesis"
  }'

# PagerDuty: add a note (appears in incident timeline and postmortem)
curl -X POST "https://api.pagerduty.com/incidents/$PD_INCIDENT_ID/notes" \
  -H "Authorization: Token token=$PD_API_TOKEN" \
  -H "From: oncall@example.com" \
  -H "Content-Type: application/json" \
  -d '{
    "note": {
      "content": "Rolled back to v1.4.2 at 14:43 UTC. Monitoring error rate — expect recovery within 3 minutes."
    }
  }'

Choosing and Integrating Your Toolchain

The canonical production incident toolchain at a well-run organisation looks like this: Prometheus / Datadog → PagerDuty / Opsgenie → Slack (incident bot) → Statuspage → Rootly / FireHydrant (timeline + postmortem). Each tool does one thing well and passes context to the next via webhooks and APIs. The key integration points are:

Monitoring → PagerDuty: Use the Events v2 API, not v1. v2 supports dedup keys, severity from the alert payload, and alert grouping. Set dedup_key to a stable identifier (service + alert name) to prevent duplicate incidents on flapping alerts.
PagerDuty → Slack: Use PagerDuty's native Slack integration or a webhook to your incident bot. The bot should auto-create a channel when an incident is triggered, not when it is acknowledged — delay here costs minutes.
Slack → Status page: Every status page update should be a single command, not a three-step manual process. Engineers under pressure skip steps; automation does not.
All tools → Timeline aggregator: Rootly, FireHydrant, or your own timeline service subscribes to webhooks from all sources and merges them into a single chronological view keyed on incident ID.

Start with PagerDuty + Slack, then iterate. The temptation at early-stage companies is to build a custom incident bot before the process is defined. Resist it. Use PagerDuty's native Slack integration and a simple channel-naming convention for six months. By then you will know exactly which automations save real time versus which look good in a demo. Build or adopt a full bot only when the friction points are empirically clear.