Scaling Prometheus
Scaling Prometheus
A single Prometheus server can comfortably scrape tens of thousands of targets and handle millions of active time-series — until it cannot. At large organisations, a monolithic Prometheus hits three walls simultaneously: memory (all active series live in RAM), disk (local TSDB is not designed for multi-year retention), and blast radius (one server's outage blinds every alert). This lesson covers the production patterns used to break through those walls: federation, remote write, and purpose-built long-term storage systems such as Thanos and Grafana Mimir.
The Scaling Problem in Numbers
Each active time-series costs roughly 2–3 KB of RAM in Prometheus 2.x. At 10 million active series — common in a large Kubernetes fleet — that is 20–30 GB of heap before you account for WAL, query cardinality, and label overhead. Local TSDB compaction keeps disk manageable for 15 days of retention, but compliance often demands 13 months. These two constraints alone force a distributed architecture.
Federation
Prometheus federation lets one global Prometheus scrape pre-aggregated metrics from several local Prometheus instances via their /federate HTTP endpoint. Each local server scrapes its own shard of targets; the global server pulls only the recording-rule results and critical alert metrics — not raw series.
Federation is the right choice when you want a single pane of glass for cross-datacenter alerts without duplicating raw data. Set honor_labels: true so the global Prometheus keeps the job and instance labels from the originating server rather than overwriting them with the federation job name.
Remote Write
Remote write streams every sample from a local Prometheus to an external storage backend over a persistent HTTP/2 connection using a snappy-compressed protobuf payload. The WAL-based queue makes it durable: samples are buffered to disk if the remote is unavailable and replayed on reconnect (up to --storage.tsdb.retention.time worth of WAL).
prometheus_remote_storage_shards is always at max_shards, increase the ceiling or reduce ingestion volume with write_relabel_configs.
Thanos — Architecture Deep Dive
Thanos is the most widely deployed open-source solution for globally available, long-term Prometheus storage. It is composed of small, independently deployed components rather than a single binary:
- Sidecar — runs next to each Prometheus pod; uploads completed 2-hour TSDB blocks to object storage (S3, GCS, Azure Blob) and exposes a gRPC Store API so Thanos Query can reach local data.
- Query — a Prometheus-compatible query engine that fans queries out to all registered Store API endpoints (Sidecars, Store Gateways, Rulers) and deduplicates overlapping replicas.
- Store Gateway — serves queries against blocks in object storage; supports index caching (Memcached/Redis) to avoid re-reading metadata on every request.
- Compactor — runs as a singleton; merges and downsamples blocks in object storage (5m and 1h resolutions) to keep long-range queries fast.
- Ruler — evaluates recording and alerting rules globally across all shards, removing the need for per-shard federation.
- Receive — accepts remote_write from Prometheus (Thanos without Sidecars) and writes to object storage directly; used in push-based topologies.
Grafana Mimir — When Thanos Is Not Enough
Thanos Sidecar depends on Prometheus staying up to serve recent data. Grafana Mimir (open-source, Apache 2.0) is a fully disaggregated, horizontally scalable TSDB built on Cortex that accepts remote_write and runs every component — Ingester, Querier, Compactor, Store Gateway, Ruler — as independently scaling microservices behind a single write and query path. It is the choice when you need millions of series per second ingestion or multi-tenant isolation (each tenant's data is namespaced).
Choosing Between Thanos and Mimir
- Thanos Sidecar — best when you already have Prometheus and need object-storage retention with minimal operational change. Great up to ~50 M active series across all shards.
- Thanos Receive — when you want push semantics (no Sidecar) and can tolerate its replication complexity.
- Mimir — greenfield at scale; superior write throughput, native multi-tenancy, and better horizontal scaling. Google and Meta run Cortex/Mimir derivatives at hundreds of billions of active series.
thanos.shipper.meta.json file) and ensure your deployment does not spawn overlapping compactor jobs.
Cardinality as the Root Cause of Scaling Pain
Before throwing more infrastructure at a Prometheus problem, always audit cardinality first. A single high-cardinality label (like request_id or user_id) can create millions of series that no amount of federation or sharding will make cheap. Use the TSDB admin API to find the offending metrics:
metric_relabel_configs in Prometheus scrape jobs to drop high-cardinality series before they are ingested. Dropping at the remote_write layer wastes RAM and WAL space on the local server. Dropping at the storage backend wastes network. Drop as early as possible.
Production Checklist for Scaled Prometheus
- Set
--storage.tsdb.retention.time=15don local Prometheus; rely on object storage for the rest. - Enable WAL compression:
--storage.tsdb.wal-compression(saves ~30% WAL disk). - Size Prometheus memory at roughly 3 bytes × active series × 2 (headroom for peaks).
- Stagger scrape intervals across shards to avoid synchronised load spikes on exporters.
- Monitor the monitors: expose Thanos Sidecar and Store Gateway metrics to a separate, lightweight Prometheus instance dedicated to your observability stack.
- Test object-storage connectivity regularly — a silent S3 permission error can let WAL fill the disk before anyone notices.