Prometheus & Grafana

Alertmanager

18 min Lesson 7 of 32

Alertmanager

Prometheus fires alerts — it evaluates alerting rules and marks them as pending or firing. But Prometheus itself does not send emails, page on-call engineers, or post to Slack. That responsibility belongs to Alertmanager: a dedicated daemon that receives alert notifications from one or many Prometheus servers, applies routing logic, deduplicates, groups, suppresses, and fans out to the right people through the right channel at the right time. Understanding Alertmanager deeply is what separates a system that pages you fifty times during a single outage from one that sends a single, well-described ticket to the right team at 2 AM.

Core Concepts Before the Config

Alertmanager operates on alert notifications pushed by Prometheus over HTTP. Each notification carries a set of labels (the same labels on the alerting rule), annotations, a generator URL, and timing metadata. Alertmanager's job is to answer four questions for every batch of incoming alerts:

Where does it go? — The routing tree maps label sets to receivers.
When does it go? — Grouping and group_wait / group_interval / repeat_interval control timing to avoid notification storms.
Should it be suppressed? — Silences and inhibition rules suppress redundant or expected noise.
Who gets it? — Receivers define the actual integration (PagerDuty, Slack, email, OpsGenie, webhook).

The Routing Tree

Alertmanager's routing configuration is a tree of route nodes. Each node carries match conditions (exact label matchers or regex), a receiver name, and optional timing overrides. Incoming alert groups walk the tree depth-first; the first matching node wins unless continue: true is set to allow further matching.

The group_by key is critical and often misunderstood. It tells Alertmanager which label dimensions define a "group" for the purpose of batching. If you group by [alertname, cluster, env], all alerts sharing those three label values fire as a single notification, even if they differ on pod or instance. This prevents ten simultaneous pod restarts from generating ten separate pages.

# alertmanager.yml — production-grade routing example
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/T.../B.../xxx'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'env']
  group_wait: 30s        # wait before sending the first notification
  group_interval: 5m     # wait before sending new alerts added to an existing group
  repeat_interval: 4h    # re-notify if an alert is still firing after this interval

  routes:
    # Critical infra alerts → PagerDuty P1 immediately
    - matchers:
        - severity="critical"
        - team="infra"
      receiver: 'pagerduty-infra'
      group_wait: 10s
      repeat_interval: 30m
      continue: false

    # Warn-level alerts for the platform team → Slack only
    - matchers:
        - severity="warning"
        - team="platform"
      receiver: 'slack-platform'
      group_by: ['alertname', 'namespace']
      repeat_interval: 8h

    # Database alerts → separate on-call rotation
    - matchers:
        - component="database"
      receiver: 'pagerduty-dba'
      group_wait: 15s
      repeat_interval: 1h

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#alerts-general'
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

  - name: 'pagerduty-infra'
    pagerduty_configs:
      - routing_key: '<PAGERDUTY_INTEGRATION_KEY>'
        severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'

  - name: 'slack-platform'
    slack_configs:
      - channel: '#platform-alerts'
        send_resolved: true

  - name: 'pagerduty-dba'
    pagerduty_configs:
      - routing_key: '<DBA_ROTATION_KEY>'

Key idea — group_wait vs group_interval: group_wait is a buffer time at the birth of a new group, giving Prometheus a chance to fire related alerts before Alertmanager sends the first notification. group_interval governs subsequent flushes to that same group when new alerts join it. Setting group_wait too short causes notification storms during cascading failures; too long delays the first page. 30 seconds is a sane default for most production environments.

Inhibition Rules

Inhibition is the Alertmanager feature most engineers under-use and most wish they had configured earlier. An inhibition rule says: "if alert A is firing with these labels, suppress any alert B that matches these other labels." This is indispensable for preventing symptom noise when the root cause alert is already firing.

The canonical example: a NodeDown alert fires. Seconds later, twenty PodCrashLooping and HighLatency alerts fire from that same node. Without inhibition, your on-call engineer gets twenty-one pages. With an inhibition rule that says "suppress everything on the same node label when NodeDown is firing," they get one. The inhibition source and target must share matching label values (defined in equal) for the suppression to apply.

inhibit_rules:
  # Suppress pod-level alerts when the entire node is down
  - source_matchers:
      - alertname="NodeDown"
    target_matchers:
      - severity=~"warning|critical"
    equal: ['cluster', 'node']

  # Suppress warning-level alerts when a critical alert exists for the same service
  - source_matchers:
      - severity="critical"
    target_matchers:
      - severity="warning"
    equal: ['alertname', 'namespace', 'service']

  # Suppress customer-facing latency alerts during a known deployment window
  - source_matchers:
      - alertname="DeploymentInProgress"
    target_matchers:
      - alertname="HighP99Latency"
    equal: ['cluster', 'namespace']

Pro practice: Model inhibition rules as a directed graph. The source alert is the "cause," the target alerts are the "symptoms." Draw this out when designing your alerting hierarchy — it prevents the situation where an inhibition rule accidentally swallows a real independent alert because the equal labels were too broad.

Silences

Silences are temporary, label-based mutes applied through the Alertmanager UI or API. They are the correct tool for planned maintenance windows and known-bad periods: you silence a set of labels for a duration, and all matching alerts are suppressed without any changes to routing or inhibition config. Silences carry a creator, a comment, and an expiry — they are auditable.

The amtool CLI is the production-grade way to manage silences programmatically, particularly inside maintenance automation scripts:

# Install amtool (ships with the Alertmanager release binary)
# Set the default Alertmanager URL
export ALERTMANAGER_URL=http://alertmanager.monitoring.svc:9093

# Create a 2-hour silence for a database maintenance window
amtool silence add \
  --comment="Scheduled DB failover window" \
  --author="ops-bot" \
  --duration=2h \
  alertname=~"Database.*" env="production" cluster="us-east-1"

# List active silences
amtool silence query

# Expire a silence immediately by its ID
amtool silence expire 4b4f9c7a-81d2-4e8e-a9b3-xxxxxxxxxxxx

# Check current alert status (useful in runbooks)
amtool alert query --alertname=NodeDown

# Validate alertmanager config before reloading
amtool check-config /etc/alertmanager/alertmanager.yml

Production pitfall — silence creep: Silences with generous durations ("let me silence this for a week") are routinely forgotten. Real alerts are then muted indefinitely. Enforce a team policy: silences expire in hours, not days. For recurrent maintenance, automate silence creation and expiry via CI/CD rather than leaving them as open-ended manual entries.

The Alert Lifecycle: End-to-End Path

Understanding the complete path an alert travels helps you debug missed pages and duplicate notifications — two of the most common Alertmanager complaints at scale.

The complete alert path: Prometheus fires → Alertmanager routes → suppression check (silences + inhibition) → notification pipeline fans out to on-call integrations.

On-Call Integrations in Production

At big-tech scale, the PagerDuty and OpsGenie integrations carry the most operational weight. A few patterns that matter:

Severity mapping: Map Prometheus severity labels directly to PagerDuty severity levels (critical → P1 immediate, warning → P3 business hours). This is configured in the pagerduty_configs.severity template field.
Deduplication keys: Alertmanager sends a stable dedup_key (derived from the alert fingerprint) so that repeated notifications for the same firing alert update the existing PagerDuty incident rather than opening a new one. This is automatic — but only works correctly if your group_by labels are stable across alert firings.
Runbook links in annotations: Always include a runbook_url annotation on your alerting rules. Surface it in the notification template so the on-call engineer lands on the correct runbook from the first click.
Webhook receivers: For custom routing logic beyond what Alertmanager's tree supports, a webhook receiver posting to a small Lambda or Cloud Run function gives you arbitrary routing power — useful for multi-tenant products where the alert needs to fan out to the correct customer-specific channel.

High Availability for Alertmanager

A single Alertmanager is a single point of failure for your entire alerting chain. Alertmanager supports a gossip-based HA cluster: run three instances, point all your Prometheus servers at all three via alertmanager_config, and Alertmanager uses the Memberlist protocol to coordinate notification deduplication so that only one instance fires each alert. The critical flag is --cluster.peer:

# Run three Alertmanager peers — each knows the others via --cluster.peer
alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-1:9094 \
  --cluster.peer=alertmanager-2:9094 \
  --web.listen-address=0.0.0.0:9093

# In Prometheus configuration, list all Alertmanager instances
# prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager-0:9093
            - alertmanager-1:9093
            - alertmanager-2:9093
      timeout: 10s

# Kubernetes: deploy as a StatefulSet for stable DNS names
# alertmanager-0.alertmanager.monitoring.svc.cluster.local
# Use --cluster.peer based on the StatefulSet pod index

Key idea — gossip dedup: In HA mode, all three Alertmanager instances receive the same alert from Prometheus. They gossip to elect one sender per notification group. If that instance goes down, another takes over. The result: no notification storms and no missed alerts during rolling restarts. Always run an odd number of instances (3 or 5) to avoid split-brain in the gossip consensus.

Validating and Debugging

Two tools every Alertmanager operator should have muscle-memory for: amtool check-config (validates YAML and routing logic before a reload) and the /api/v2/alerts endpoint (shows currently firing alerts, useful in runbooks). The Alertmanager UI at :9093 provides a visual routing tree debugger under "Status → Routing Tree" — paste any label set and it shows exactly which receiver would receive it. This is invaluable when debugging why an alert went to the wrong channel.