Recording & Alerting Rules
Recording & Alerting Rules
Raw PromQL gives you expressive power, but two critical capabilities live outside the query editor: recording rules — which precompute expensive expressions and materialize them as first-class metrics — and alerting rules — which continuously evaluate conditions and route actionable notifications to on-call engineers. Together they are the operational backbone of any production Prometheus deployment.
Why Recording Rules Exist
Prometheus evaluates every rule on every scrape interval. A complex expression such as a 5-minute-rate aggregated over thousands of time series can consume significant CPU and memory on every dashboard load. When that same expression feeds three dashboards, two alert rules, and an SLO calculation, Prometheus evaluates it dozens of times per minute — entirely redundantly.
A recording rule tells Prometheus: compute this expression once per evaluation interval, and store the result as a new metric. Any subsequent query against that metric is a trivial label-lookup instead of a full aggregation pass. At Google-scale, recording rules are not an optimisation — they are mandatory.
Recording Rule File Structure
Rules live in YAML files loaded by the rule_files stanza in prometheus.yml. A file can hold multiple groups; each group has its own interval (defaults to global evaluation_interval).
The naming convention level:metric:operations is the official Prometheus recommendation (also called the recording rule naming scheme). level is the aggregation scope (job, instance, cluster), metric is the base metric name, and operations describes what was done (rate5m, p99_5m, ratio1h). Following this convention makes rules discoverable and prevents naming collisions across teams.
Alerting Rules — The Anatomy of a Good Alert
An alerting rule is a PromQL expression evaluated periodically. When the expression produces one or more result vectors, the alert transitions to Pending. Once it stays Pending for the for duration, it fires (Firing) and is handed off to Alertmanager.
for duration elapses, then Alertmanager routes it to the notification channel.The for Clause — Why It Matters
The for duration is your primary defence against alert storms from transient spikes. A CPU spike that lasts 15 seconds should not page anyone at 3am. Setting for: 5m means the condition must be consistently true for five full minutes, filtering out noise while still catching sustained degradation.
for: 0s on high-traffic metrics. Every hiccup — a single slow scrape, a brief GC pause, a rolling restart — will trigger the alert. Production alert rules on rate-based metrics should have a minimum for of 2–5 minutes. The exception is up == 0 (instance reachability), which can reasonably use for: 1m.
Labels and Annotations — Operational Quality
Labels on alert rules are merged into the alert's label set used by Alertmanager for routing. severity: page vs severity: ticket is the most common dimension; adding team enables per-team routing to separate PagerDuty services.
Annotations carry human-readable context: summary (one-liner, appears in Slack), description (detailed, links to the runbook), and optionally runbook_url. Use Go template syntax to embed label values ({{ $labels.job }}) and the current value ({{ $value }}) directly in the notification.
runbook_url is officially supported by Alertmanager and rendered as a clickable link in most receivers. Treat a missing runbook as a deployment blocker, the same way you treat a missing test.
Validating and Reloading Rules
Prometheus ships promtool — use it in CI to validate rule files before they reach production.
Run promtool test rules test/alert_tests.yml in your pipeline. Then reload Prometheus at runtime — no restart required — with either a SIGHUP or the HTTP reload endpoint (when --web.enable-lifecycle is set): curl -X POST http://localhost:9090/-/reload.
Production Pitfalls
Missing the recording rule for an alert expression: If your alert references a raw expensive query rather than a precomputed recording rule, under high load the evaluation may time out and the alert will silently not fire — precisely when you need it most. Always back critical alerts with recording rules.
Label cardinality explosion: Adding a high-cardinality label (e.g. user_id) to a recording rule produces one stored time series per user. That is usually catastrophic. Keep recording rules at job or service granularity, not instance or request granularity.
Evaluation interval vs. scrape interval mismatch: If you set a rule group interval shorter than the scrape interval for its underlying metrics, the rule will re-evaluate against the same data repeatedly, wasting CPU without producing new information. Keep rule intervals at or above the scrape interval.