Prometheus & Grafana

Scaling Prometheus

18 min Lesson 9 of 32

Scaling Prometheus

A single Prometheus server can comfortably scrape tens of thousands of targets and handle millions of active time-series — until it cannot. At large organisations, a monolithic Prometheus hits three walls simultaneously: memory (all active series live in RAM), disk (local TSDB is not designed for multi-year retention), and blast radius (one server's outage blinds every alert). This lesson covers the production patterns used to break through those walls: federation, remote write, and purpose-built long-term storage systems such as Thanos and Grafana Mimir.

The Scaling Problem in Numbers

Each active time-series costs roughly 2–3 KB of RAM in Prometheus 2.x. At 10 million active series — common in a large Kubernetes fleet — that is 20–30 GB of heap before you account for WAL, query cardinality, and label overhead. Local TSDB compaction keeps disk manageable for 15 days of retention, but compliance often demands 13 months. These two constraints alone force a distributed architecture.

Federation

Prometheus federation lets one global Prometheus scrape pre-aggregated metrics from several local Prometheus instances via their /federate HTTP endpoint. Each local server scrapes its own shard of targets; the global server pulls only the recording-rule results and critical alert metrics — not raw series.

# global-prometheus.yml — federation scrape config
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 60s
    honor_labels: true          # keep original job/instance labels
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"job:.*"}'        # recording-rule rollups only
        - 'up'
        - 'ALERTS{alertstate="firing"}'
    static_configs:
      - targets:
          - 'prometheus-dc1:9090'
          - 'prometheus-dc2:9090'
          - 'prometheus-dc3:9090'

Federation does not solve retention. It reduces global cardinality by pulling aggregates, but every local server still stores its own raw data locally. For long-term storage you need remote write.

Federation is the right choice when you want a single pane of glass for cross-datacenter alerts without duplicating raw data. Set honor_labels: true so the global Prometheus keeps the job and instance labels from the originating server rather than overwriting them with the federation job name.

Remote Write

Remote write streams every sample from a local Prometheus to an external storage backend over a persistent HTTP/2 connection using a snappy-compressed protobuf payload. The WAL-based queue makes it durable: samples are buffered to disk if the remote is unavailable and replayed on reconnect (up to --storage.tsdb.retention.time worth of WAL).

# prometheus.yml — remote write to Thanos Receive or Mimir
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
    remote_timeout: 30s
    queue_config:
      capacity: 10000           # samples buffered in memory per shard
      max_shards: 200           # parallelism ceiling
      min_shards: 1
      max_samples_per_send: 2000
      batch_send_deadline: 5s
      min_backoff: 30ms
      max_backoff: 5s
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_gc_.*'       # drop noisy runtime metrics
        action: drop

Tune shards with the remote write dashboard. Prometheus ships a built-in Grafana dashboard (ID 15032) that shows queue depth, shard utilisation, and sample age. If prometheus_remote_storage_shards is always at max_shards, increase the ceiling or reduce ingestion volume with write_relabel_configs.

Thanos — Architecture Deep Dive

Thanos is the most widely deployed open-source solution for globally available, long-term Prometheus storage. It is composed of small, independently deployed components rather than a single binary:

Sidecar — runs next to each Prometheus pod; uploads completed 2-hour TSDB blocks to object storage (S3, GCS, Azure Blob) and exposes a gRPC Store API so Thanos Query can reach local data.
Query — a Prometheus-compatible query engine that fans queries out to all registered Store API endpoints (Sidecars, Store Gateways, Rulers) and deduplicates overlapping replicas.
Store Gateway — serves queries against blocks in object storage; supports index caching (Memcached/Redis) to avoid re-reading metadata on every request.
Compactor — runs as a singleton; merges and downsamples blocks in object storage (5m and 1h resolutions) to keep long-range queries fast.
Ruler — evaluates recording and alerting rules globally across all shards, removing the need for per-shard federation.
Receive — accepts remote_write from Prometheus (Thanos without Sidecars) and writes to object storage directly; used in push-based topologies.

Thanos scaled architecture: Sidecars upload TSDB blocks to object storage; Thanos Query fans out to Sidecars (recent data) and the Store Gateway (historical data); the Compactor downsamples blocks for efficient long-range queries.

Grafana Mimir — When Thanos Is Not Enough

Thanos Sidecar depends on Prometheus staying up to serve recent data. Grafana Mimir (open-source, Apache 2.0) is a fully disaggregated, horizontally scalable TSDB built on Cortex that accepts remote_write and runs every component — Ingester, Querier, Compactor, Store Gateway, Ruler — as independently scaling microservices behind a single write and query path. It is the choice when you need millions of series per second ingestion or multi-tenant isolation (each tenant's data is namespaced).

# mimir.yaml — minimal monolithic-mode config (single binary, good for <50 M series)
target: all
multitenancy_enabled: false

blocks_storage:
  backend: s3
  s3:
    bucket_name: mimir-blocks
    endpoint: s3.us-east-1.amazonaws.com
  tsdb:
    dir: /data/tsdb

compactor:
  data_dir: /data/compactor

store_gateway:
  sharding_enabled: false

ingester:
  ring:
    replication_factor: 3
    kvstore:
      store: memberlist

limits:
  max_global_series_per_user: 0    # 0 = unlimited
  ingestion_rate: 100000           # samples/sec per tenant

Choosing Between Thanos and Mimir

Thanos Sidecar — best when you already have Prometheus and need object-storage retention with minimal operational change. Great up to ~50 M active series across all shards.
Thanos Receive — when you want push semantics (no Sidecar) and can tolerate its replication complexity.
Mimir — greenfield at scale; superior write throughput, native multi-tenancy, and better horizontal scaling. Google and Meta run Cortex/Mimir derivatives at hundreds of billions of active series.

Compactor must run as a singleton. Whether using Thanos or Mimir, running two Compactor instances against the same object-storage bucket simultaneously will corrupt the block metadata. Use a distributed lock (Thanos uses object-storage itself for locking via a thanos.shipper.meta.json file) and ensure your deployment does not spawn overlapping compactor jobs.

Cardinality as the Root Cause of Scaling Pain

Before throwing more infrastructure at a Prometheus problem, always audit cardinality first. A single high-cardinality label (like request_id or user_id) can create millions of series that no amount of federation or sharding will make cheap. Use the TSDB admin API to find the offending metrics:

# Top 10 metric names by series count
curl -s http://localhost:9090/api/v1/status/tsdb \
  | python3 -m json.tool \
  | grep -A2 '"seriesCountByMetricName"'

# Or in PromQL — cardinality per job
sort_desc(count by (job) ({__name__!=""}))

Drop at the source, not at the store. Use metric_relabel_configs in Prometheus scrape jobs to drop high-cardinality series before they are ingested. Dropping at the remote_write layer wastes RAM and WAL space on the local server. Dropping at the storage backend wastes network. Drop as early as possible.

Production Checklist for Scaled Prometheus

Set --storage.tsdb.retention.time=15d on local Prometheus; rely on object storage for the rest.
Enable WAL compression: --storage.tsdb.wal-compression (saves ~30% WAL disk).
Size Prometheus memory at roughly 3 bytes × active series × 2 (headroom for peaks).
Stagger scrape intervals across shards to avoid synchronised load spikes on exporters.
Monitor the monitors: expose Thanos Sidecar and Store Gateway metrics to a separate, lightweight Prometheus instance dedicated to your observability stack.
Test object-storage connectivity regularly — a silent S3 permission error can let WAL fill the disk before anyone notices.