Toil & Automation
Toil & Automation
Google's SRE book introduced a word that immediately resonated with every operations engineer who read it: toil. Not because it was new — every ops team had been drowning in it for years — but because naming it gave teams permission to treat it as a problem worth solving systematically. This lesson defines toil precisely, explains why Google's 50% rule exists and how it is enforced, and shows the practical automation patterns that top-tier SRE teams use to eliminate it at the source.
Defining Toil: What It Is and What It Is Not
Toil is not simply "work I dislike." It has a precise, operational definition that distinguishes it from valuable engineering work. Toil is work that is:
- Manual: requires a human to execute each time — no automation runs it.
- Repetitive: the same sequence of actions is performed again and again across time or incidents.
- Automatable: a machine could perform the task if someone wrote the code.
- Reactive and interrupt-driven: triggered by a ticket, a page, or a user request rather than scheduled engineering work.
- Tactical, not strategic: it produces no enduring value — when you are done, the system is in exactly the same state as before the last time you did it. The on-call runbook exists because this task will need to be done again.
- Scaling linearly with service growth: as request volume or fleet size doubles, the volume of this work also doubles, unless eliminated.
Examples of toil in a mature production environment: manually restarting pods that OOM-kill, rotating credentials by hand via copy-paste into a secrets manager, provisioning new database read replicas through a UI every time a service team requests one, responding to Slack pings asking "can you check why service X is slow?", manually trimming a disk that fills every three days on a known schedule, and reviewing a certificate expiry spreadsheet every Monday.
Work that is not toil, even though it feels painful: debugging a novel production failure (non-repetitive, requires human judgment), designing a new alerting schema (produces enduring value), writing the automation that eliminates a toil task (overhead, but valuable), and performing a production readiness review (one-time strategic work).
The 50% Rule: Engineering Time Belongs to Engineering
Google's SRE model mandates that no SRE should spend more than 50% of their working time on toil. The remaining 50% must be spent on project work: automation, reliability improvements, tooling, and capacity planning — work that permanently reduces future toil or improves service health.
This is not a soft aspiration. It is an operational contract between SRE teams and the product teams they support. If an SRE team consistently exceeds the 50% toil cap:
- The SRE manager escalates to product engineering leadership — the product team owns the fix.
- The SRE team temporarily stops accepting new services into their portfolio until the root cause is addressed.
- In extreme cases, the SRE team "returns" a service to the product team to operate themselves (the "hand-back" mechanism).
The 50% rule exists because of a mathematical reality: if SRE toil grows linearly with service growth and SREs spend 100% of their time on it, you need to hire an SRE for every N servers or requests. This is the Ops model Google was trying to escape. The 50% cap forces automation investment before a service's toil overwhelms the team.
Measuring Toil: You Cannot Reduce What You Cannot Count
Before automating anything, quantify toil. SRE teams track toil systematically. At minimum, instrument your on-call rotation to capture three data points per ticket/page: the category (type of task), the time spent (including context-switch cost), and whether it is automatable. Most teams do this with a label in their ticketing system and a weekly review:
Automating Toil Away: Patterns and Tools
The goal is not to automate for automation's sake — it is to eliminate the human decision loop from tasks where the right action is deterministic. There are four canonical patterns used at big-tech companies:
1. Runbook-to-Bot Conversion. An on-call runbook that says "if service X OOM-kills, SSH in and restart the pod" is a bot waiting to be written. The progression: runbook (fully manual) → script invoked manually by on-call → script triggered automatically by the alert → no alert at all because the condition is self-healing. Most teams stop at step 2. SRE teams push to step 3 or 4.
2. Self-Healing Kubernetes Controllers. In Kubernetes environments, the correct place to encode recovery logic is a custom controller or an operator, not an on-call runbook. If a pod consistently OOM-kills at a predictable memory footprint, the right fix is a VPA (Vertical Pod Autoscaler) recommendation applied automatically — not a weekly manual increase. If a node pool becomes degraded, Cluster Autoscaler should drain and replace it without human involvement.
3. Event-Driven Remediation with Lambda/Cloud Functions. CloudWatch Alarm → EventBridge rule → Lambda function that calls the AWS API to take corrective action. This pattern is used heavily for RDS failovers, EC2 instance recovery, and ECS task restarts. The function logs every action taken to CloudTrail; the audit trail is automatic.
4. Policy Engines as Toil Eliminators. A surprising amount of toil comes from humans manually enforcing policies: "make sure all S3 buckets have versioning enabled," "ensure all new Lambda functions have a DLQ." These are deterministic rules. Push them into a policy engine (AWS Config Rules + auto-remediation, OPA/Gatekeeper for Kubernetes, HashiCorp Sentinel for Terraform) and the enforcement becomes continuous and automatic.
The Automation Trap: When Automation Creates New Toil
Automation is not free. Badly designed automation introduces its own class of toil: the automation breaks in unexpected ways, fires at the wrong time, requires manual intervention to reset, generates noisy false-positive alerts, and becomes a system that itself needs to be operated. Big-tech SREs guard against this with three practices:
- Automation must have a dry-run mode. Every remediation bot should support a
--dry-runflag that logs what it would do without taking action. This allows safe testing before enabling live execution. Bots deployed without dry-run kill production. - Automation must be idempotent. If the remediation fires three times in rapid succession because of a flapping alert, the end state should be the same as if it had fired once. Non-idempotent automation (e.g., appending to a config file on each run) is worse than the toil it replaced.
- Every automated action must be logged and auditable. When an automated system touches production, the audit trail must be at least as complete as it would be for a human. Log the action, the trigger, the timestamp, and the outcome to CloudTrail, a structured log sink, or both.
Measuring Toil Reduction: Closing the Loop
Automation investments need to be justified and validated. After deploying an automation, measure: how many manual interventions per week did this eliminate? What is the estimated engineering-hours saved per quarter? Did the toil-to-engineering ratio improve? These numbers go into the SRE team's quarterly review and justify further investment.
A practical Prometheus metric pattern for tracking automation efficacy:
The toil-and-automation discipline is the feedback loop that makes an SRE team self-improving. Every incident that results in a manual action is a candidate for automation. Every automation deployed is time returned to the team for reliability engineering. Run this loop consistently and the team's operational burden decreases even as the services under their care grow.