On-Call Done Right
On-Call Done Right
On-call is the beating heart of incident response — and the most common source of engineer burnout in the industry. Done poorly, a single on-call rotation can destroy team morale, erode reliability through fatigued decision-making, and drive away your best engineers. Done right, it becomes a manageable part of the job that produces compounding reliability improvements over time. The difference is almost entirely structural, not about individual toughness.
This lesson covers the mechanics and culture of sustainable on-call: how rotations are designed, how shifts are handed off without information loss, how engineers are compensated fairly, and how you measure and control paging load so that it stays humane. These are not abstract ideals — they are practices that Google, PagerDuty, Stripe, and Shopify run in production at scale.
Rotation Design: Who Gets Paged and When
A rotation is the schedule that determines which engineer is the primary on-call responder at any given moment. Every rotation design involves tradeoffs between coverage quality, engineer load, and operational complexity.
Follow-the-sun rotations are the gold standard for global teams. Engineers in each time zone cover business hours for their region, so nobody is paged at 3 AM. A typical implementation has three regions — Americas, EMEA, APAC — with overlapping handoff windows of 30–60 minutes. The requirement is a team distributed across those zones. Many companies achieve this organically as they scale; smaller teams often cannot.
Weekly rotations are the most common pattern for single-timezone or small teams. One engineer is primary for seven days, a second is secondary (escalation target if primary does not respond within a defined SLA, typically five minutes). Weekly rotations keep the cognitive context window manageable — you have a week to internalize system state — but seven consecutive days of primary is psychologically draining on high-paging services. The mitigation is to ensure no engineer is primary more often than one week in every four to six, and to aggressively reduce page volume.
Weekend splitting is a refinement many teams adopt: the seven-day rotation is broken into a weekday block (Mon–Fri) and a weekend block (Sat–Sun), with different engineers covering each. Weekend pages have a higher psychological cost because they interrupt genuine off-time. Splitting lets the team compensate weekend on-call more heavily and distributes that cost more fairly.
The Handoff: Engineering Shift Continuity
A rotation without a disciplined handoff is just a random assignment of pager duty. The outgoing on-call engineer holds context that is not in any runbook: which services are currently degraded and why, which alerts are silenced and until when, which deployments are in-flight, and which investigation threads are open. Transferring that context cleanly is an engineering discipline, not a courtesy.
Production-grade handoffs follow a structured format. The outgoing engineer writes a handoff note — typically in a shared doc, Slack channel, or the on-call management tool — covering: active incidents and their current status, silenced alerts (with expiry times and the reason for silencing), recent deployments in the last 24 hours, known degraded components, and any tickets or postmortem action items that became relevant during the shift. The incoming engineer reads and acknowledges the note before the handoff completes. A brief 15-minute synchronous call is strongly preferred over an async message for any shift that had significant activity.
Compensation: Paying for On-Call Fairly
On-call involves real opportunity cost: engineers cannot travel freely, must keep a laptop nearby, must abstain from alcohol during primary shifts, and are mentally preoccupied even when not actively paged. Compensation structures that ignore this fact create resentment and attrition. Structures that acknowledge it create buy-in.
There are three common models. Flat stipend: a fixed weekly payment for each primary shift, regardless of page volume — straightforward, predictable, easy to administer. Typical range at mid-size tech companies is $200–500/week for primary; secondary is often unpaid or $50–100. Per-incident pay: engineers are paid per page or per acknowledged incident above a baseline — creates better alignment between compensation and actual burden, but requires careful definition of what counts. Comp time: engineers who are paged outside business hours earn equivalent time off, often at a 1:1 or 1.5:1 ratio (one hour of comp time per hour on-call during nights or weekends). Comp time is common in organizations where budget is constrained but scheduling flexibility is high.
The most important compensation principle at big-tech companies is that on-call should never be free. When on-call is uncompensated, there is no financial signal that high page volume is costly to the organization — which removes a key incentive for management to invest in reliability improvements. A team with a $3,000/week on-call budget that is regularly spent has a much stronger business case for a $50,000 engineering project to reduce pages than a team where on-call is "just part of the job."
Sustainable Paging Load: The Numbers That Matter
Sustainable paging load is the most operationally concrete concept in this lesson. Google\'s SRE book is explicit: a primary on-call engineer should receive no more than two to three actionable pages per twelve-hour shift. "Actionable" means the page requires investigation and a decision — not just acknowledging a flap that auto-resolved. This is not a soft guideline. It is a hard limit derived from cognitive science: after the third complex incident response in a half-day, human decision quality degrades measurably.
In practice, measuring paging load requires tooling. You need to track: total pages per engineer per week, pages during business hours vs. outside, mean time to acknowledge (MTTA), mean time to resolve (MTTR), pages that required escalation, and pages that auto-resolved before human action (these should be silenced at the source, not acknowledged by humans). Most teams instrument this in PagerDuty, OpsGenie, or a Prometheus-based dashboard.
Alert Fatigue: The Silent Reliability Killer
Alert fatigue occurs when engineers receive so many pages — especially noisy, low-signal, or auto-resolving ones — that they begin to treat all alerts as background noise. The outcome is delayed responses to real incidents, habituation to failure states, and eventual desensitization that produces the exact miss that causes a major outage. Alert fatigue is not a character flaw; it is a predictable physiological response to sustained noise.
Preventing alert fatigue requires active alerting hygiene practices. Every alert that pages should have three properties: it should be actionable (there is a specific thing a human must do right now), it should be urgent (it cannot wait until business hours), and it should be unique (no other alert already captures this failure mode). Alerts that do not meet all three criteria should be either converted to dashboard warnings, demoted to tickets, or deleted. "Just in case" alerts are alert fatigue in disguise.
On-Call Onboarding: Shadowing Before Paging
A new engineer should never be thrown onto primary on-call without preparation. The standard big-tech pattern is a shadowing period — typically two to four weeks — during which the new engineer is added to the rotation in a "shadow" role: they receive all the same pages as the primary, but a senior engineer holds the actual primary responsibility. The shadow is expected to investigate alongside the primary, write their own hypotheses, and suggest actions — but the senior makes the final call. This approach ensures that by the time the new engineer rotates to primary, they have seen a representative sample of the failure modes they will face.
Shadowing should be paired with a structured runbook review, a production environment walkthrough, and at least one simulated incident (a fire drill using a staging environment or a controlled chaos experiment). The goal is that the first real incident a new on-call engineer handles alone should feel familiar, not novel.
When On-Call Is Unsustainable: The Escalation
Sometimes paging load exceeds the cap despite best efforts — a new service launch, a traffic spike, or a cascading failure creates an unsustainable burst. The engineering response to this is structured, not heroic. First, escalate immediately: the incident commander or engineering manager must be informed when any engineer exceeds the page cap, because they have the authority to pull in additional engineers, defer non-critical deploys, or trigger a service degradation to reduce load. Second, document every excess page in real time — not as a blame trail, but as data that funds the postmortem and the remediation roadmap. Third, treat sustained exceedance (more than two consecutive weeks above cap) as a service reliability incident, not an operational inconvenience. It gets a postmortem, an action list, and dedicated engineering time.
The final point is the hardest, politically: if an organization consistently ignores on-call overload, the most experienced engineers — the ones who know the systems best and can fix the problems — will leave first. They have options. What remains is a team with high institutional knowledge loss that is even less capable of reducing page volume, creating a reliability death spiral. Sustainable on-call is not a perk; it is a prerequisite for long-term system reliability.