Learning from Incidents
Learning from Incidents
A postmortem documents what happened. A learning program makes sure it never has to happen again — and that every near-miss and trend surfaces before it becomes a page. The difference between an organisation that improves and one that just survives is what it does in the weeks and months after the meeting room clears.
The Incident Review Cycle
A structured review cycle converts raw postmortem data into engineering decisions. At big-tech companies this typically runs on three cadences:
- Weekly incident sync (30 min). On-call leads and SREs walk every incident opened in the past seven days. New postmortems are triaged: action items are assigned owners and deadlines, duplicates are linked.
- Monthly trend review (60 min). Engineering managers and staff engineers look across incidents. Which service had the most pages? Which error class recurred? Is MTTR improving or regressing? Charts replace gut feel.
- Quarterly resilience planning (half-day). Leadership prioritises reliability investments for the next quarter — chaos experiments, architecture changes, runbook automation — based on trend data, not heroics.
Quantifying Trends with SQL and PromQL
Your incident management tool (PagerDuty, Opsgenie, Firehydrant) exposes an API or database you can query. Pair that with Prometheus alert history to see patterns a human reviewer would miss.
Query your incident database (Firehydrant / custom table) for the top recurring failure modes over 90 days:
Track MTTR regression in Prometheus (alert duration is stored as a metric by many alertmanager exporters):
incident_count, mttr_seconds, and error_budget_consumed to your team dashboard and review them at the monthly sync. When a bar chart says "database timeouts: 14 incidents in 90 days", nobody argues about whether a connection-pool audit is worth scheduling.
Near-Misses: Your Cheapest Learning Signal
A near-miss is an event that triggered an alert, was caught before user impact, and was resolved without opening a formal incident. At Google and Netflix these are tracked just as seriously as real incidents — because they reveal exactly the same failure mode at zero cost.
Create a lightweight near-miss log in your incident tool or a shared doc. Fields: date, service, what failed, what caught it, corrective action. Review them at the weekly sync. If the same near-miss appears three times, open a postmortem even though no users were affected.
A practical way to surface near-misses automatically: alert on alerts that fired and self-resolved within your SLO window without a human acknowledge.
Incident Trends Dashboard
The diagram below shows the data flow from raw incidents into the review cycle that produces resilience investments.
Resilience Investments: Turning Data into Engineering Work
Trend data should translate directly into a prioritised backlog of reliability work. Use a simple scoring model so the conversation with product management is data-driven rather than emotional:
- Frequency score — incidents per quarter for this failure class.
- Impact score — average error-budget minutes burned per occurrence.
- Detection gap — time between failure onset and alert firing (if > 5 min, instrument better).
- Mitigation maturity — is the runbook automated or manual? Manual runbooks score higher priority.
Common resilience investments that come out of this analysis: circuit-breaker adoption, retry/backoff tuning, database read-replica promotion automation, alert deduplication, chaos game days targeting the top failure mode.
Sharing Learning Across Teams
Incident learning compounds when it crosses team boundaries. Effective practices at scale:
- Incident digest newsletter. A weekly internal email with three anonymised incident summaries, the root cause, and what changed. Links to full postmortems. Takes 20 minutes to write; read by hundreds of engineers.
- Reliability guild meetings. Monthly cross-team SRE/on-call meeting where one team presents a postmortem and the rest ask questions. Engineers remember stories better than dashboards.
- Searchable postmortem library. Tag every postmortem with affected services, error classes, and contributing factors. A new engineer investigating a cascade failure should be able to search "connection pool exhaustion" and find three previous postmortems with solutions. Confluence search works; a dedicated tool like Incident.io is better.
Closing the Loop: Did We Actually Get Better?
Every resilience investment should have a measurable target before work starts. "Improve database reliability" is not measurable. "Reduce connection-pool-exhaustion incidents from 6/quarter to 0 within two quarters" is.
At the next quarterly review, compare actual incident counts against those targets. If a fix did not move the metric, either the root cause was mis-diagnosed or the fix was incomplete — both are valuable learnings that feed back into the next postmortem.
This is the SRE feedback loop in its most honest form: measure, invest, measure again. Done consistently, it converts a reactive on-call burden into a proactively engineered system.