Incident Management & On-Call

On-Call Done Right

18 min Lesson 2 of 28

On-Call Done Right

On-call is the beating heart of incident response — and the most common source of engineer burnout in the industry. Done poorly, a single on-call rotation can destroy team morale, erode reliability through fatigued decision-making, and drive away your best engineers. Done right, it becomes a manageable part of the job that produces compounding reliability improvements over time. The difference is almost entirely structural, not about individual toughness.

This lesson covers the mechanics and culture of sustainable on-call: how rotations are designed, how shifts are handed off without information loss, how engineers are compensated fairly, and how you measure and control paging load so that it stays humane. These are not abstract ideals — they are practices that Google, PagerDuty, Stripe, and Shopify run in production at scale.

Rotation Design: Who Gets Paged and When

A rotation is the schedule that determines which engineer is the primary on-call responder at any given moment. Every rotation design involves tradeoffs between coverage quality, engineer load, and operational complexity.

Follow-the-sun rotations are the gold standard for global teams. Engineers in each time zone cover business hours for their region, so nobody is paged at 3 AM. A typical implementation has three regions — Americas, EMEA, APAC — with overlapping handoff windows of 30–60 minutes. The requirement is a team distributed across those zones. Many companies achieve this organically as they scale; smaller teams often cannot.

Weekly rotations are the most common pattern for single-timezone or small teams. One engineer is primary for seven days, a second is secondary (escalation target if primary does not respond within a defined SLA, typically five minutes). Weekly rotations keep the cognitive context window manageable — you have a week to internalize system state — but seven consecutive days of primary is psychologically draining on high-paging services. The mitigation is to ensure no engineer is primary more often than one week in every four to six, and to aggressively reduce page volume.

Weekend splitting is a refinement many teams adopt: the seven-day rotation is broken into a weekday block (Mon–Fri) and a weekend block (Sat–Sun), with different engineers covering each. Weekend pages have a higher psychological cost because they interrupt genuine off-time. Splitting lets the team compensate weekend on-call more heavily and distributes that cost more fairly.

Minimum viable rotation size: Google SRE guidance requires at least eight engineers in a rotation to keep any individual on primary duty no more than one week in eight. Below six engineers, sustainable on-call is essentially impossible without aggressive automation — every engineer is primary too frequently to recover between shifts. If your team is smaller than six, reducing page volume through alerting improvements is not optional; it is existential.

The Handoff: Engineering Shift Continuity

A rotation without a disciplined handoff is just a random assignment of pager duty. The outgoing on-call engineer holds context that is not in any runbook: which services are currently degraded and why, which alerts are silenced and until when, which deployments are in-flight, and which investigation threads are open. Transferring that context cleanly is an engineering discipline, not a courtesy.

Production-grade handoffs follow a structured format. The outgoing engineer writes a handoff note — typically in a shared doc, Slack channel, or the on-call management tool — covering: active incidents and their current status, silenced alerts (with expiry times and the reason for silencing), recent deployments in the last 24 hours, known degraded components, and any tickets or postmortem action items that became relevant during the shift. The incoming engineer reads and acknowledges the note before the handoff completes. A brief 15-minute synchronous call is strongly preferred over an async message for any shift that had significant activity.

# PagerDuty CLI: list active incidents before handoff
pd incident list --statuses=triggered,acknowledged --limit=20

# List silenced alert rules in Alertmanager (Prometheus stack)
curl -s http://alertmanager:9093/api/v2/silences \
  | jq '.[] | select(.status.state=="active") | {id, comment, endsAt, matchers}'

# Typical handoff doc template (Markdown, stored in runbook repo)
# --- HANDOFF: 2025-10-14 09:00 UTC ---
# Outgoing: alice@   Incoming: bob@
#
# Active incidents:
#   INC-4821: checkout service P50 latency elevated (~1.2s vs 0.4s baseline)
#   Status: investigating, suspect DB connection pool exhaustion
#   Slack thread: #incident-4821
#
# Silenced alerts:
#   - FrontendErrorRate (silenced until 2025-10-14 12:00 UTC)
#     Reason: known canary deploy in progress, owner: carol@
#
# Recent deploys (last 24h):
#   - payments-service v2.41.0 @ 22:10 UTC yesterday (rollback ready)
#   - auth-service v1.18.3 @ 06:30 UTC today (healthy)
#
# Open threads:
#   - Redis cluster rebalancing started Fri, expected to complete by EOD

Compensation: Paying for On-Call Fairly

On-call involves real opportunity cost: engineers cannot travel freely, must keep a laptop nearby, must abstain from alcohol during primary shifts, and are mentally preoccupied even when not actively paged. Compensation structures that ignore this fact create resentment and attrition. Structures that acknowledge it create buy-in.

There are three common models. Flat stipend: a fixed weekly payment for each primary shift, regardless of page volume — straightforward, predictable, easy to administer. Typical range at mid-size tech companies is $200–500/week for primary; secondary is often unpaid or $50–100. Per-incident pay: engineers are paid per page or per acknowledged incident above a baseline — creates better alignment between compensation and actual burden, but requires careful definition of what counts. Comp time: engineers who are paged outside business hours earn equivalent time off, often at a 1:1 or 1.5:1 ratio (one hour of comp time per hour on-call during nights or weekends). Comp time is common in organizations where budget is constrained but scheduling flexibility is high.

The most important compensation principle at big-tech companies is that on-call should never be free. When on-call is uncompensated, there is no financial signal that high page volume is costly to the organization — which removes a key incentive for management to invest in reliability improvements. A team with a $3,000/week on-call budget that is regularly spent has a much stronger business case for a $50,000 engineering project to reduce pages than a team where on-call is "just part of the job."

Google and Stripe practice: At Google, SREs receive on-call compensation as part of their role definition, and the on-call load is formally capped (see below). At Stripe, on-call stipends are used alongside the toil-reduction mandate — if a team\'s on-call burden exceeds the cap for two consecutive quarters, the team is required to produce and execute a remediation plan, with engineering time ring-fenced for it. The compensation structure creates accountability; the remediation requirement creates action.

Sustainable Paging Load: The Numbers That Matter

Sustainable paging load is the most operationally concrete concept in this lesson. Google\'s SRE book is explicit: a primary on-call engineer should receive no more than two to three actionable pages per twelve-hour shift. "Actionable" means the page requires investigation and a decision — not just acknowledging a flap that auto-resolved. This is not a soft guideline. It is a hard limit derived from cognitive science: after the third complex incident response in a half-day, human decision quality degrades measurably.

In practice, measuring paging load requires tooling. You need to track: total pages per engineer per week, pages during business hours vs. outside, mean time to acknowledge (MTTA), mean time to resolve (MTTR), pages that required escalation, and pages that auto-resolved before human action (these should be silenced at the source, not acknowledged by humans). Most teams instrument this in PagerDuty, OpsGenie, or a Prometheus-based dashboard.

# Prometheus: query weekly page count per on-call engineer
# Assumes alerts are labeled with oncall_user from your alerting pipeline

sum by (oncall_user) (
  increase(alertmanager_notifications_total{
    integration="pagerduty",
    status="success"
  }[7d])
)

# PagerDuty API: weekly incident count per user (bash + jq)
SINCE=$(date -u -d "7 days ago" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null \
       || date -u -v-7d +%Y-%m-%dT%H:%M:%SZ)
UNTIL=$(date -u +%Y-%m-%dT%H:%M:%SZ)

curl -s "https://api.pagerduty.com/incidents?since=${SINCE}&until=${UNTIL}&limit=100" \
  -H "Authorization: Token token=YOUR_TOKEN" \
  -H "Accept: application/vnd.pagerduty+json;version=2" \
  | jq '[.incidents[] | .assignments[].assignee.summary] | group_by(.) | map({user: .[0], count: length}) | sort_by(-.count)'

# Alert: if any user exceeds 15 pages/week, flag for rotation review

A follow-the-sun rotation eliminates overnight pages by handing coverage to engineers in the next time zone. Thirty-minute overlaps ensure a live handoff call rather than a blind transfer.

Alert Fatigue: The Silent Reliability Killer

Alert fatigue occurs when engineers receive so many pages — especially noisy, low-signal, or auto-resolving ones — that they begin to treat all alerts as background noise. The outcome is delayed responses to real incidents, habituation to failure states, and eventual desensitization that produces the exact miss that causes a major outage. Alert fatigue is not a character flaw; it is a predictable physiological response to sustained noise.

Preventing alert fatigue requires active alerting hygiene practices. Every alert that pages should have three properties: it should be actionable (there is a specific thing a human must do right now), it should be urgent (it cannot wait until business hours), and it should be unique (no other alert already captures this failure mode). Alerts that do not meet all three criteria should be either converted to dashboard warnings, demoted to tickets, or deleted. "Just in case" alerts are alert fatigue in disguise.

Production pitfall — the noise spiral: Alert fatigue creates a feedback loop. High page volume causes engineers to silence alerts hastily rather than fix the root cause. Hasty silences mask real problems. Real problems compound. Eventually a masked problem causes a major incident, and the postmortem reveals that the alert was silenced six weeks ago and never revisited. The fix: silences must have an owner, an expiry time (never "indefinite"), and a linked issue tracking the root cause elimination. Alertmanager enforces expiry natively; use it.

On-Call Onboarding: Shadowing Before Paging

A new engineer should never be thrown onto primary on-call without preparation. The standard big-tech pattern is a shadowing period — typically two to four weeks — during which the new engineer is added to the rotation in a "shadow" role: they receive all the same pages as the primary, but a senior engineer holds the actual primary responsibility. The shadow is expected to investigate alongside the primary, write their own hypotheses, and suggest actions — but the senior makes the final call. This approach ensures that by the time the new engineer rotates to primary, they have seen a representative sample of the failure modes they will face.

Shadowing should be paired with a structured runbook review, a production environment walkthrough, and at least one simulated incident (a fire drill using a staging environment or a controlled chaos experiment). The goal is that the first real incident a new on-call engineer handles alone should feel familiar, not novel.

On-call health metrics to track weekly: (1) Pages per engineer per shift — target <3 actionable; (2) MTTA (mean time to acknowledge) — target <5 min for P1; (3) Percentage of pages that auto-resolved before human action — should be eliminated at the source; (4) Percentage of shifts with zero pages ("quiet shifts") — a healthy rotation has many; (5) Engineer satisfaction score — quarterly survey asking specifically about on-call burden; score below 7/10 is an engineering red flag, not an HR issue.

When On-Call Is Unsustainable: The Escalation

Sometimes paging load exceeds the cap despite best efforts — a new service launch, a traffic spike, or a cascading failure creates an unsustainable burst. The engineering response to this is structured, not heroic. First, escalate immediately: the incident commander or engineering manager must be informed when any engineer exceeds the page cap, because they have the authority to pull in additional engineers, defer non-critical deploys, or trigger a service degradation to reduce load. Second, document every excess page in real time — not as a blame trail, but as data that funds the postmortem and the remediation roadmap. Third, treat sustained exceedance (more than two consecutive weeks above cap) as a service reliability incident, not an operational inconvenience. It gets a postmortem, an action list, and dedicated engineering time.

The final point is the hardest, politically: if an organization consistently ignores on-call overload, the most experienced engineers — the ones who know the systems best and can fix the problems — will leave first. They have options. What remains is a team with high institutional knowledge loss that is even less capable of reducing page volume, creating a reliability death spiral. Sustainable on-call is not a perk; it is a prerequisite for long-term system reliability.