Logging Architecture at Scale
Logging Architecture at Scale
Every system emits logs. A single microservice on a laptop produces a manageable stream. A Kubernetes cluster running 400 services across three regions produces billions of log lines per day, at rates that can spike to hundreds of megabytes per second. The architecture you use to collect, transport, store, and query those logs is not a detail — it is a first-class infrastructure concern that directly determines whether your on-call team can investigate a production incident in minutes or hours.
This lesson builds the mental model for centralized logging at scale: what the four stages of the pipeline actually do, where each stage can fail under load, and how top-tier engineering organizations think about the tradeoffs at each boundary.
Why Centralized Logging Exists
Without a centralized logging system, debugging a distributed system means SSH-ing into individual nodes, grepping local files, and hoping the relevant log landed on the machine you happen to be looking at. At scale this is operationally impossible: pods are ephemeral (killed after an OOM, recreated by the scheduler), nodes can be replaced by auto-scaling, and the request you are investigating may have touched twenty services across five namespaces.
Centralized logging solves this by funneling all log output into a single queryable backend. You ask one query, you get the complete picture. The engineering challenge is doing this reliably, without the logging pipeline itself becoming a point of failure, and without the cost spiraling out of control.
Stage 1: Collect
Collection is the act of capturing raw log output from wherever it originates: stdout/stderr of a container, a file on disk written by a daemon, a syslog socket, or an application SDK that writes structured events directly. The collector runs close to the source — as a DaemonSet in Kubernetes, as an agent sidecar, or as a library embedded in the application.
Production-grade collectors (Fluent Bit, Promtail, the OpenTelemetry Collector, Vector) do more than forward bytes. They perform parsing (turning unstructured text into structured fields), enrichment (adding Kubernetes metadata: pod name, namespace, node, container image tag), filtering (dropping health-check noise before it hits the network), and buffering (holding data in memory or on disk when the downstream is temporarily unavailable).
The collection layer is where many organizations make their first expensive mistake: choosing a heavy agent (Logstash as a DaemonSet) that consumes 500 MB of RAM per node, or deploying no buffering so that a transport outage causes log loss. Fluent Bit is the current best-practice choice for Kubernetes: written in C, it runs in under 50 MB of RSS and has native Kubernetes metadata enrichment built in.
Mem_Buf_Limit and storage.type filesystem in Fluent Bit. Without a filesystem buffer, a downstream outage drops logs. With it, the agent spools to disk and replays when the transport recovers. Google, Datadog, and AWS all run buffered collection pipelines for exactly this reason.Stage 2: Transport
The transport layer moves log data from the collectors to the storage backend. At small scale this is a direct HTTP/gRPC push from each agent. At production scale — thousands of nodes, gigabytes per second — a dedicated message broker decouples the collector from the store.
The canonical transport for large-scale logging is Apache Kafka. Collectors publish to Kafka topics partitioned by service or namespace. Downstream consumers (indexers, the storage ingestion layer) read from those topics at their own pace. Kafka provides durability (log segments on disk, configurable retention), backpressure isolation (a slow indexer does not cause the collector to back up and OOM), and replay (if the indexer has a bug and drops a day of data, you can replay from Kafka).
For smaller organizations or those already on AWS, Kinesis Data Streams serves the same role. For Grafana Loki, the transport can be a direct push via Promtail or Grafana Alloy (small scale) or Kafka (large scale). OpenTelemetry Collector pipelines can fan out to multiple backends simultaneously — a powerful pattern for organizations migrating between storage systems.
Stage 3: Store
The storage layer indexes and persists log data so it can be queried efficiently. The two dominant open-source choices reflect fundamentally different design philosophies:
- Elasticsearch (OpenSearch) — a full-text inverted index over every field. Extremely fast for keyword search and aggregations across arbitrary fields. High write cost: every field is parsed and indexed at ingest time, consuming significant CPU and storage (typically 2–5x the raw log size on disk). The right choice when engineers need ad-hoc field-level queries across semi-structured data.
- Grafana Loki — label-indexed, chunk-compressed storage. Only the labels (metadata: app, namespace, level, region) are indexed; log content is stored in compressed chunks and scanned at query time via a LogQL pipeline filter. 10–20x cheaper than Elasticsearch per GB. The right choice when your logs are structured and your query patterns are label-first. Designed to work natively alongside Prometheus and Grafana Tempo.
A third option gaining traction in high-volume analytics contexts is ClickHouse: a columnar OLAP database that can ingest millions of events per second and execute aggregation queries in milliseconds, with very high compression ratios. Used by Cloudflare and ByteDance for petabyte-scale log analytics.
Stage 4: Query
The query layer is how humans and automated systems interact with stored logs. It encompasses the UI (Kibana for Elasticsearch, Grafana Explore for Loki), the query language (KQL/Lucene or LogQL), and the alerting integration (Grafana Alerting rules backed by log queries).
At big-tech scale, query patterns split into two categories with very different requirements:
- Interactive incident investigation — a human running ad-hoc queries, wanting results in under 3 seconds. Requires a hot index, a powerful query frontend that caches and parallelizes queries, and good default time range scoping (last 15 minutes, not all time).
- Automated alerting — a rules engine evaluating a LogQL or KQL expression every 60 seconds and firing an alert when a condition is met. These must be deterministic, low-cardinality queries that are cheap to evaluate repeatedly. Route them through a dedicated query path to avoid competing with interactive queries during an incident.
Pipeline Failure Modes
Understanding where the pipeline fails under load is as important as understanding its normal operation. The most common production failure modes:
- Collector OOM: A log burst (exception storm, verbose debug logging left on in prod) overwhelms the collector\'s in-memory buffer. Without
Mem_Buf_Limitand filesystem buffering, the collector pod is OOM-killed and logs are lost until it restarts. Fix: filesystem buffer + limit set just below the collector\'s memory request. - Kafka consumer lag: The storage ingestion layer falls behind the Kafka producer (collectors). Lag accumulates. If Kafka retention is set too short (e.g., 24 hours), unconsumed segments are deleted before the indexer can process them. Fix: set Kafka retention to at least 48–72 hours for logging topics; alert on consumer group lag.
- Elasticsearch shard hotspots: All logs from a time window land in the same Elasticsearch shard because the index pattern is time-based and the shard count is too low. That shard becomes a write bottleneck. Fix: size shards to 30–50 GB max; use ILM rollover policies to create new shards at the right cadence.
- Query storm during incident: When a major outage hits, ten engineers simultaneously run broad queries ("show me all logs from the last hour across all services"). This can overload the query layer and make the logging system unavailable exactly when it is needed. Fix: query frontend with result caching and per-user rate limits; always scope queries by label before opening the time range.
Choosing Your Stack
The right stack depends on your scale and query patterns, but the 2025 industry defaults look like this:
- Kubernetes-native, cost-sensitive: Grafana Alloy (collector) → direct push or Kafka → Grafana Loki (store) → Grafana (query). Lowest operational cost; native integration with Prometheus and Tempo for correlated observability.
- Mixed workloads, rich ad-hoc search: Fluent Bit (collector) → Kafka → Elasticsearch/OpenSearch (store) → Kibana (query). More expensive but more flexible for unstructured log content.
- High-volume analytics, data engineering team: OTel Collector → Kafka → ClickHouse (store) → Grafana or custom SQL tooling. Best compression and aggregation performance at petabyte scale.
The remaining lessons in this tutorial build each component of this stack in depth — starting with the structured logging standards that make all of it queryable in the first place.