Logging at Scale: ELK & Loki

Logging Architecture at Scale

18 min Lesson 1 of 28

Logging Architecture at Scale

Every system emits logs. A single microservice on a laptop produces a manageable stream. A Kubernetes cluster running 400 services across three regions produces billions of log lines per day, at rates that can spike to hundreds of megabytes per second. The architecture you use to collect, transport, store, and query those logs is not a detail — it is a first-class infrastructure concern that directly determines whether your on-call team can investigate a production incident in minutes or hours.

This lesson builds the mental model for centralized logging at scale: what the four stages of the pipeline actually do, where each stage can fail under load, and how top-tier engineering organizations think about the tradeoffs at each boundary.

Why Centralized Logging Exists

Without a centralized logging system, debugging a distributed system means SSH-ing into individual nodes, grepping local files, and hoping the relevant log landed on the machine you happen to be looking at. At scale this is operationally impossible: pods are ephemeral (killed after an OOM, recreated by the scheduler), nodes can be replaced by auto-scaling, and the request you are investigating may have touched twenty services across five namespaces.

Centralized logging solves this by funneling all log output into a single queryable backend. You ask one query, you get the complete picture. The engineering challenge is doing this reliably, without the logging pipeline itself becoming a point of failure, and without the cost spiraling out of control.

Key idea: The logging pipeline has four distinct stages — Collect, Transport, Store, Query. Each stage has different latency, reliability, and cost characteristics. Designing each stage correctly is what separates a logging system that holds up during an incident from one that falls over exactly when you need it most.

Stage 1: Collect

Collection is the act of capturing raw log output from wherever it originates: stdout/stderr of a container, a file on disk written by a daemon, a syslog socket, or an application SDK that writes structured events directly. The collector runs close to the source — as a DaemonSet in Kubernetes, as an agent sidecar, or as a library embedded in the application.

Production-grade collectors (Fluent Bit, Promtail, the OpenTelemetry Collector, Vector) do more than forward bytes. They perform parsing (turning unstructured text into structured fields), enrichment (adding Kubernetes metadata: pod name, namespace, node, container image tag), filtering (dropping health-check noise before it hits the network), and buffering (holding data in memory or on disk when the downstream is temporarily unavailable).

The collection layer is where many organizations make their first expensive mistake: choosing a heavy agent (Logstash as a DaemonSet) that consumes 500 MB of RAM per node, or deploying no buffering so that a transport outage causes log loss. Fluent Bit is the current best-practice choice for Kubernetes: written in C, it runs in under 50 MB of RSS and has native Kubernetes metadata enrichment built in.

Pro practice: Always set Mem_Buf_Limit and storage.type filesystem in Fluent Bit. Without a filesystem buffer, a downstream outage drops logs. With it, the agent spools to disk and replays when the transport recovers. Google, Datadog, and AWS all run buffered collection pipelines for exactly this reason.

Stage 2: Transport

The transport layer moves log data from the collectors to the storage backend. At small scale this is a direct HTTP/gRPC push from each agent. At production scale — thousands of nodes, gigabytes per second — a dedicated message broker decouples the collector from the store.

The canonical transport for large-scale logging is Apache Kafka. Collectors publish to Kafka topics partitioned by service or namespace. Downstream consumers (indexers, the storage ingestion layer) read from those topics at their own pace. Kafka provides durability (log segments on disk, configurable retention), backpressure isolation (a slow indexer does not cause the collector to back up and OOM), and replay (if the indexer has a bug and drops a day of data, you can replay from Kafka).

For smaller organizations or those already on AWS, Kinesis Data Streams serves the same role. For Grafana Loki, the transport can be a direct push via Promtail or Grafana Alloy (small scale) or Kafka (large scale). OpenTelemetry Collector pipelines can fan out to multiple backends simultaneously — a powerful pattern for organizations migrating between storage systems.

Production pitfall: If your transport layer has no backpressure mechanism, a storage outage cascades upstream. Collectors fill their in-memory buffers and start dropping logs. You experience the outage, open the logging system to investigate, and find no logs for the period in question. Design the transport to absorb storage brownouts: Kafka or a queue with disk-backed consumer-side buffering is the standard solution.

The four-stage centralized logging pipeline. The dashed arrow shows a small-scale direct path; at production scale a message broker decouples collection from storage to absorb load spikes.

Stage 3: Store

The storage layer indexes and persists log data so it can be queried efficiently. The two dominant open-source choices reflect fundamentally different design philosophies:

Elasticsearch (OpenSearch) — a full-text inverted index over every field. Extremely fast for keyword search and aggregations across arbitrary fields. High write cost: every field is parsed and indexed at ingest time, consuming significant CPU and storage (typically 2–5x the raw log size on disk). The right choice when engineers need ad-hoc field-level queries across semi-structured data.
Grafana Loki — label-indexed, chunk-compressed storage. Only the labels (metadata: app, namespace, level, region) are indexed; log content is stored in compressed chunks and scanned at query time via a LogQL pipeline filter. 10–20x cheaper than Elasticsearch per GB. The right choice when your logs are structured and your query patterns are label-first. Designed to work natively alongside Prometheus and Grafana Tempo.

A third option gaining traction in high-volume analytics contexts is ClickHouse: a columnar OLAP database that can ingest millions of events per second and execute aggregation queries in milliseconds, with very high compression ratios. Used by Cloudflare and ByteDance for petabyte-scale log analytics.

Storage tiering: At scale, storing everything in a hot index is prohibitively expensive. Production logging stacks tier data: hot (last 7 days, fast SSD, full query speed) → warm (8–30 days, cheaper storage, slightly slower) → cold (31–365 days, object storage like S3/GCS, query via selective reload). Elasticsearch ILM policies and Loki\'s S3 backend with ruler implement this automatically.

Stage 4: Query

The query layer is how humans and automated systems interact with stored logs. It encompasses the UI (Kibana for Elasticsearch, Grafana Explore for Loki), the query language (KQL/Lucene or LogQL), and the alerting integration (Grafana Alerting rules backed by log queries).

At big-tech scale, query patterns split into two categories with very different requirements:

Interactive incident investigation — a human running ad-hoc queries, wanting results in under 3 seconds. Requires a hot index, a powerful query frontend that caches and parallelizes queries, and good default time range scoping (last 15 minutes, not all time).
Automated alerting — a rules engine evaluating a LogQL or KQL expression every 60 seconds and firing an alert when a condition is met. These must be deterministic, low-cardinality queries that are cheap to evaluate repeatedly. Route them through a dedicated query path to avoid competing with interactive queries during an incident.

# ── Fluent Bit DaemonSet: production-grade Kubernetes config ─────────────────

[SERVICE]
    Flush         5
    Log_Level     warn
    storage.type  filesystem          # spill to disk when downstream is slow
    storage.path  /var/log/flb-storage/

[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    multiline.parser  cri             # handle CRI multi-line log lines
    Tag               kube.*
    Refresh_Interval  10
    Mem_Buf_Limit     50MB
    storage.type      filesystem

[FILTER]
    Name                kubernetes
    Match               kube.*
    Merge_Log           On            # parse nested JSON into top-level fields
    Keep_Log            Off
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On

[FILTER]
    Name    grep
    Match   kube.*
    Exclude log (ELB-HealthChecker|kube-probe)   # drop noise before transport

[OUTPUT]
    Name           kafka
    Match          *
    Brokers        kafka-0.kafka:9092,kafka-1.kafka:9092,kafka-2.kafka:9092
    Topics         logs.production
    rdkafka.queue.buffering.max.ms   100
    rdkafka.compression.codec        lz4

# ── LogQL query examples (Grafana Loki) ──────────────────────────────────────

# 1. Stream all error logs from the checkout service (last 15 min):
{app="checkout", namespace="production"} |= "level=error"

# 2. Parse JSON logs and filter by HTTP status code:
{app="api-gateway"} | json | status_code >= 500

# 3. Count errors per pod over time (for a Grafana panel):
sum by (pod) (
  rate({app="checkout"} | json | level="error" [5m])
)

# 4. Full-text search for a specific trace ID across all services:
{namespace="production"} |= "trace_id=abc123def456"
# → returns every log line, any service, that touched this trace

# 5. Alert rule — fire if checkout errors exceed 5/min for 2 minutes:
# (Grafana Alerting → Alert rule → type: Loki)
sum(rate({app="checkout"} | json | level="error" [1m])) > 5

Pipeline Failure Modes

Understanding where the pipeline fails under load is as important as understanding its normal operation. The most common production failure modes:

Collector OOM: A log burst (exception storm, verbose debug logging left on in prod) overwhelms the collector\'s in-memory buffer. Without Mem_Buf_Limit and filesystem buffering, the collector pod is OOM-killed and logs are lost until it restarts. Fix: filesystem buffer + limit set just below the collector\'s memory request.
Kafka consumer lag: The storage ingestion layer falls behind the Kafka producer (collectors). Lag accumulates. If Kafka retention is set too short (e.g., 24 hours), unconsumed segments are deleted before the indexer can process them. Fix: set Kafka retention to at least 48–72 hours for logging topics; alert on consumer group lag.
Elasticsearch shard hotspots: All logs from a time window land in the same Elasticsearch shard because the index pattern is time-based and the shard count is too low. That shard becomes a write bottleneck. Fix: size shards to 30–50 GB max; use ILM rollover policies to create new shards at the right cadence.
Query storm during incident: When a major outage hits, ten engineers simultaneously run broad queries ("show me all logs from the last hour across all services"). This can overload the query layer and make the logging system unavailable exactly when it is needed. Fix: query frontend with result caching and per-user rate limits; always scope queries by label before opening the time range.

Pro practice: Treat the logging pipeline as a critical system, not a best-effort sidecar. Run it with the same SLO rigor you apply to your product services. At minimum: collector DaemonSet with PodDisruptionBudget, Kafka with replication factor 3, storage backend with HA topology, and an alert on "no logs received from namespace X in the last 5 minutes" — so you know when the pipeline itself is broken before an on-call engineer discovers it during an incident.

Choosing Your Stack

The right stack depends on your scale and query patterns, but the 2025 industry defaults look like this:

Kubernetes-native, cost-sensitive: Grafana Alloy (collector) → direct push or Kafka → Grafana Loki (store) → Grafana (query). Lowest operational cost; native integration with Prometheus and Tempo for correlated observability.
Mixed workloads, rich ad-hoc search: Fluent Bit (collector) → Kafka → Elasticsearch/OpenSearch (store) → Kibana (query). More expensive but more flexible for unstructured log content.
High-volume analytics, data engineering team: OTel Collector → Kafka → ClickHouse (store) → Grafana or custom SQL tooling. Best compression and aggregation performance at petabyte scale.

The remaining lessons in this tutorial build each component of this stack in depth — starting with the structured logging standards that make all of it queryable in the first place.