Incident Management & On-Call

Learning from Incidents

18 min Lesson 8 of 28

Learning from Incidents

A postmortem documents what happened. A learning program makes sure it never has to happen again — and that every near-miss and trend surfaces before it becomes a page. The difference between an organisation that improves and one that just survives is what it does in the weeks and months after the meeting room clears.

The Incident Review Cycle

A structured review cycle converts raw postmortem data into engineering decisions. At big-tech companies this typically runs on three cadences:

Weekly incident sync (30 min). On-call leads and SREs walk every incident opened in the past seven days. New postmortems are triaged: action items are assigned owners and deadlines, duplicates are linked.
Monthly trend review (60 min). Engineering managers and staff engineers look across incidents. Which service had the most pages? Which error class recurred? Is MTTR improving or regressing? Charts replace gut feel.
Quarterly resilience planning (half-day). Leadership prioritises reliability investments for the next quarter — chaos experiments, architecture changes, runbook automation — based on trend data, not heroics.

The weekly sync is cheap and prevents action items from expiring silently. Most organisations skip it. The ones that do are the ones that re-page on the same incident three months later.

Quantifying Trends with SQL and PromQL

Your incident management tool (PagerDuty, Opsgenie, Firehydrant) exposes an API or database you can query. Pair that with Prometheus alert history to see patterns a human reviewer would miss.

Query your incident database (Firehydrant / custom table) for the top recurring failure modes over 90 days:

-- Top 10 recurring incident titles over the last 90 days
SELECT
  REGEXP_REPLACE(title, '[0-9]+', 'N') AS pattern,
  COUNT(*)                             AS occurrences,
  AVG(duration_minutes)                AS avg_duration_min,
  SUM(CASE WHEN severity = 'SEV1' THEN 1 ELSE 0 END) AS sev1_count
FROM incidents
WHERE started_at >= NOW() - INTERVAL 90 DAY
GROUP BY pattern
ORDER BY occurrences DESC
LIMIT 10;

Track MTTR regression in Prometheus (alert duration is stored as a metric by many alertmanager exporters):

-- PromQL: 28-day rolling average incident duration (via alertmanager_alerts_received_total)
-- Requires alert-exporter that records alert firing duration as a histogram.

histogram_quantile(0.90,
  sum by (le, alertname) (
    rate(alert_duration_seconds_bucket[28d])
  )
)

Add incident_count, mttr_seconds, and error_budget_consumed to your team dashboard and review them at the monthly sync. When a bar chart says "database timeouts: 14 incidents in 90 days", nobody argues about whether a connection-pool audit is worth scheduling.

Near-Misses: Your Cheapest Learning Signal

A near-miss is an event that triggered an alert, was caught before user impact, and was resolved without opening a formal incident. At Google and Netflix these are tracked just as seriously as real incidents — because they reveal exactly the same failure mode at zero cost.

Create a lightweight near-miss log in your incident tool or a shared doc. Fields: date, service, what failed, what caught it, corrective action. Review them at the weekly sync. If the same near-miss appears three times, open a postmortem even though no users were affected.

A practical way to surface near-misses automatically: alert on alerts that fired and self-resolved within your SLO window without a human acknowledge.

# Alertmanager route: capture fast-resolving alerts to a "near-miss" receiver
# instead of silently dropping them.

route:
  receiver: pagerduty-critical
  routes:
    - matchers:
        - severity =~ "warning|info"
      continue: false
      receiver: near-miss-logger   # posts to a Slack #near-misses channel + appends to DB

receivers:
  - name: near-miss-logger
    webhook_configs:
      - url: https://hooks.example.com/near-miss
        send_resolved: true
        http_config:
          bearer_token_file: /var/run/secrets/webhook-token

Incident Trends Dashboard

The diagram below shows the data flow from raw incidents into the review cycle that produces resilience investments.

The incident learning loop: raw events aggregate into trend data, flow through three review cadences, and exit as concrete engineering work.

Resilience Investments: Turning Data into Engineering Work

Trend data should translate directly into a prioritised backlog of reliability work. Use a simple scoring model so the conversation with product management is data-driven rather than emotional:

Frequency score — incidents per quarter for this failure class.
Impact score — average error-budget minutes burned per occurrence.
Detection gap — time between failure onset and alert firing (if > 5 min, instrument better).
Mitigation maturity — is the runbook automated or manual? Manual runbooks score higher priority.

Common resilience investments that come out of this analysis: circuit-breaker adoption, retry/backoff tuning, database read-replica promotion automation, alert deduplication, chaos game days targeting the top failure mode.

Do not let action items live only in a postmortem doc. Every action item must have an owner, a deadline, and a ticket in your issue tracker — or it will not get done. Review open action items at every weekly sync and escalate anything past its deadline by two sprints.

Sharing Learning Across Teams

Incident learning compounds when it crosses team boundaries. Effective practices at scale:

Incident digest newsletter. A weekly internal email with three anonymised incident summaries, the root cause, and what changed. Links to full postmortems. Takes 20 minutes to write; read by hundreds of engineers.
Reliability guild meetings. Monthly cross-team SRE/on-call meeting where one team presents a postmortem and the rest ask questions. Engineers remember stories better than dashboards.
Searchable postmortem library. Tag every postmortem with affected services, error classes, and contributing factors. A new engineer investigating a cascade failure should be able to search "connection pool exhaustion" and find three previous postmortems with solutions. Confluence search works; a dedicated tool like Incident.io is better.

Anonymise service names and individuals in public digests but keep them in the internal postmortem. The goal is learning, not blame — but investigators need accurate context.

Closing the Loop: Did We Actually Get Better?

Every resilience investment should have a measurable target before work starts. "Improve database reliability" is not measurable. "Reduce connection-pool-exhaustion incidents from 6/quarter to 0 within two quarters" is.

At the next quarterly review, compare actual incident counts against those targets. If a fix did not move the metric, either the root cause was mis-diagnosed or the fix was incomplete — both are valuable learnings that feed back into the next postmortem.

This is the SRE feedback loop in its most honest form: measure, invest, measure again. Done consistently, it converts a reactive on-call burden into a proactively engineered system.