Prometheus & Grafana

Recording & Alerting Rules

18 min Lesson 6 of 32

Recording & Alerting Rules

Raw PromQL gives you expressive power, but two critical capabilities live outside the query editor: recording rules — which precompute expensive expressions and materialize them as first-class metrics — and alerting rules — which continuously evaluate conditions and route actionable notifications to on-call engineers. Together they are the operational backbone of any production Prometheus deployment.

Why Recording Rules Exist

Prometheus evaluates every rule on every scrape interval. A complex expression such as a 5-minute-rate aggregated over thousands of time series can consume significant CPU and memory on every dashboard load. When that same expression feeds three dashboards, two alert rules, and an SLO calculation, Prometheus evaluates it dozens of times per minute — entirely redundantly.

A recording rule tells Prometheus: compute this expression once per evaluation interval, and store the result as a new metric. Any subsequent query against that metric is a trivial label-lookup instead of a full aggregation pass. At Google-scale, recording rules are not an optimisation — they are mandatory.

When to create a recording rule: any PromQL expression that (1) takes more than ~100ms to evaluate, (2) is used in more than one place, or (3) feeds an alert that fires on a pre-aggregated value. Grafana's query inspector shows evaluation time — use it.

Recording Rule File Structure

Rules live in YAML files loaded by the rule_files stanza in prometheus.yml. A file can hold multiple groups; each group has its own interval (defaults to global evaluation_interval).

# prometheus.yml (partial) global: evaluation_interval: 15s rule_files: - "rules/*.yml"
# rules/request_rates.yml groups: - name: request_rates # logical grouping name interval: 30s # override global if needed rules: # Precompute per-job, per-status 5-min request rate - record: job:http_requests_total:rate5m expr: | sum by (job, status) ( rate(http_requests_total[5m]) ) # Precompute error ratio for SLO dashboards - record: job:http_errors:ratio5m expr: | sum by (job) ( rate(http_requests_total{status=~"5.."}[5m]) ) / sum by (job) ( rate(http_requests_total[5m]) ) # p99 latency pre-aggregated per job - record: job:http_request_duration_seconds:p99_5m expr: | histogram_quantile( 0.99, sum by (job, le) ( rate(http_request_duration_seconds_bucket[5m]) ) )

The naming convention level:metric:operations is the official Prometheus recommendation (also called the recording rule naming scheme). level is the aggregation scope (job, instance, cluster), metric is the base metric name, and operations describes what was done (rate5m, p99_5m, ratio1h). Following this convention makes rules discoverable and prevents naming collisions across teams.

Alerting Rules — The Anatomy of a Good Alert

An alerting rule is a PromQL expression evaluated periodically. When the expression produces one or more result vectors, the alert transitions to Pending. Once it stays Pending for the for duration, it fires (Firing) and is handed off to Alertmanager.

# rules/slo_alerts.yml groups: - name: slo_alerts rules: - alert: HighErrorRate # Use the precomputed recording rule — fast & consistent expr: job:http_errors:ratio5m > 0.01 for: 5m # must stay true for 5 min before firing labels: severity: page # routed to PagerDuty in Alertmanager team: backend annotations: summary: "High error rate on {{ $labels.job }}" description: | Error ratio is {{ $value | humanizePercentage }} for job {{ $labels.job }}. Runbook: https://wiki.internal/runbooks/high-error-rate - alert: LatencyP99Breach expr: job:http_request_duration_seconds:p99_5m > 0.5 for: 10m labels: severity: page team: backend annotations: summary: "p99 latency > 500ms on {{ $labels.job }}" description: | p99 latency is {{ $value | humanizeDuration }} on {{ $labels.job }}. Runbook: https://wiki.internal/runbooks/high-latency - alert: ServiceDown expr: up == 0 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} is down" description: "Job {{ $labels.job }} on {{ $labels.instance }} has been unreachable for over 1 minute."
Alert lifecycle: Inactive → Pending → Firing → Alertmanager → Notification INACTIVE expr = no results expr fires PENDING waiting for: duration for: elapsed FIRING alert sent to AM Alertmanager route / inhibit / silence / group Notification PagerDuty / Slack expr no longer true → RESOLVED → INACTIVE
Alert lifecycle: an alert moves from Inactive → Pending → Firing once the for duration elapses, then Alertmanager routes it to the notification channel.

The for Clause — Why It Matters

The for duration is your primary defence against alert storms from transient spikes. A CPU spike that lasts 15 seconds should not page anyone at 3am. Setting for: 5m means the condition must be consistently true for five full minutes, filtering out noise while still catching sustained degradation.

Do not set for: 0s on high-traffic metrics. Every hiccup — a single slow scrape, a brief GC pause, a rolling restart — will trigger the alert. Production alert rules on rate-based metrics should have a minimum for of 2–5 minutes. The exception is up == 0 (instance reachability), which can reasonably use for: 1m.

Labels and Annotations — Operational Quality

Labels on alert rules are merged into the alert's label set used by Alertmanager for routing. severity: page vs severity: ticket is the most common dimension; adding team enables per-team routing to separate PagerDuty services.

Annotations carry human-readable context: summary (one-liner, appears in Slack), description (detailed, links to the runbook), and optionally runbook_url. Use Go template syntax to embed label values ({{ $labels.job }}) and the current value ({{ $value }}) directly in the notification.

Runbook links in every alert. An alert without a runbook is a puzzle handed to a sleepy engineer. The annotation runbook_url is officially supported by Alertmanager and rendered as a clickable link in most receivers. Treat a missing runbook as a deployment blocker, the same way you treat a missing test.

Validating and Reloading Rules

Prometheus ships promtool — use it in CI to validate rule files before they reach production.

# Validate all rule files promtool check rules rules/*.yml # Unit-test alert expressions (promtool test rules) # test/alert_tests.yml rule_files: - ../rules/slo_alerts.yml tests: - interval: 1m input_series: - series: 'http_requests_total{job="api",status="500"}' values: '0 0 0 60 60 60 60 60 60 60' - series: 'http_requests_total{job="api",status="200"}' values: '0 1000 2000 3000 4000 5000 6000 7000 8000 9000' alert_rule_test: - eval_time: 5m alertname: HighErrorRate exp_alerts: [] # still pending, not yet 5m - eval_time: 10m alertname: HighErrorRate exp_alerts: - exp_labels: job: api severity: page team: backend

Run promtool test rules test/alert_tests.yml in your pipeline. Then reload Prometheus at runtime — no restart required — with either a SIGHUP or the HTTP reload endpoint (when --web.enable-lifecycle is set): curl -X POST http://localhost:9090/-/reload.

Production Pitfalls

Missing the recording rule for an alert expression: If your alert references a raw expensive query rather than a precomputed recording rule, under high load the evaluation may time out and the alert will silently not fire — precisely when you need it most. Always back critical alerts with recording rules.

Label cardinality explosion: Adding a high-cardinality label (e.g. user_id) to a recording rule produces one stored time series per user. That is usually catastrophic. Keep recording rules at job or service granularity, not instance or request granularity.

Evaluation interval vs. scrape interval mismatch: If you set a rule group interval shorter than the scrape interval for its underlying metrics, the rule will re-evaluate against the same data repeatedly, wasting CPU without producing new information. Keep rule intervals at or above the scrape interval.