Grafana Loki
Grafana Loki
Elasticsearch indexes every token in every log line. That brute-force approach is powerful but expensive: a 1 TB/day log estate typically needs 3–5x that in Elasticsearch storage (indexes, replicas, segment merges) and a cluster that costs thousands of dollars a month just to keep warm. Grafana Loki was designed by the Grafana Labs team to answer a different question: what if you stored logs the way Prometheus stores metrics? Index only the labels, compress the raw text, and query lazily at read time. The result is dramatically lower cost with a trade-off in raw query speed that is almost never the bottleneck for real incident investigation.
This lesson takes you deep into Loki's architecture, its query language LogQL, and the operational situations where Loki wins — so you can choose the right tool and run it correctly in production.
Label-Based Indexing vs Full-Text Indexing
In Elasticsearch, every log line is parsed at ingest time: every word, every number, every field becomes an entry in an inverted index. This makes arbitrary full-text search instant, but it also means the index can easily be larger than the raw data. CPU burns at ingest to tokenize and segment-merge; heap burns at query time to load posting lists.
Loki takes the opposite approach. At ingest time it extracts only the labels you declare — fixed key-value pairs attached by the log shipper, such as app="api-gateway", env="production", namespace="payments". These labels define a stream. Within each stream, log lines are bundled into compressed chunks (Snappy or LZ4 by default) and written to object storage (S3, GCS, Azure Blob). The label index lives in a tiny store — originally Cassandra or BoltDB, now the embedded TSDB index introduced in Loki 2.8. At query time, Loki first identifies matching streams by label, then decompresses only the relevant chunks and applies any filter expressions against the raw text in memory. Think of it as a label-gated grep over compressed logs.
|=, |~, parser pipelines) are for finding what inside those streams. If your query touches too many streams, it is slow. If a stream has too many lines per second, chunks grow large and decompression becomes the bottleneck — both are label design problems.
Loki Architecture
A production Loki deployment runs in microservices mode with independently scalable components:
- Distributor — receives push requests from Promtail/Alloy/Fluent Bit via the
/loki/api/v1/pushHTTP endpoint. Validates labels, enforces rate limits, and fans out to Ingesters via a consistent-hash ring. - Ingester — holds the in-memory write-ahead log (WAL) and open chunks. Flushes sealed chunks to object storage and writes the index to the TSDB store. Scale horizontally; use a WAL on local SSD for durability.
- Querier — handles
/loki/api/v1/query_rangerequests. It fetches data from both the in-memory Ingesters (recent data) and object storage (older data), merges results, and applies LogQL pipelines. - Query Frontend — shards long time-range queries, caches results, and queues requests for fairness. Always deploy this in front of Queriers at scale.
- Ruler — evaluates metric-type LogQL rules (alerting and recording) on a schedule, identical in concept to Prometheus rules.
- Compactor — periodically merges small TSDB index shards and enforces retention via delete markers on object storage.
LogQL — Loki's Query Language
LogQL is deliberately modeled after PromQL. Every query starts with a log stream selector — a set of label matchers in curly braces — followed by optional pipeline stages that filter and transform log lines.
The key stages in a pipeline are:
| json/| logfmt/| pattern/| regexp— parse the raw line into fields|= "string"/!= "string"/|~ "regex"/!~ "regex"— line filters (fast; applied before decompression in newer Loki builds)| field_name = "value"— label filters on extracted fields| line_format "template"— reshape the output line using Go template syntax| unwrap field_name— extract a numeric field for metric aggregation (rate, avg_over_time, quantile_over_time)
{app="checkout", env="production"} scans far fewer chunks than {env="production"} alone. A query touching more than ~200 streams at once will be slow regardless of how efficient the filter pipeline is.
Label Design: The Most Important Operational Decision
Labels in Loki behave identically to labels in Prometheus: every unique combination of label values creates a new stream. Too many streams — high cardinality — destroys Loki's cost advantage because the TSDB index explodes and chunk files become tiny (low compression ratio). The canonical antipatterns are:
- Using a
user_id,request_id, ortrace_idas a label. These belong inside the log line, extracted at query time with a parser pipeline. - Using the pod name or deployment hash as a label in Kubernetes — use
podsparingly and preferapp/component. - Encoding severity as a label (e.g.,
level="error") when the field is already inside the JSON payload — parse it with| json | level="error"at query time instead.
A safe starting set for a Kubernetes environment is: cluster, namespace, app, env. Everything else — hostname, pod name, log level, trace ID — lives in the log line and is extracted via parser pipelines.
Deploying Loki with Promtail on Kubernetes
When Loki Wins (and When It Does Not)
Loki wins when:
- You already run Grafana and Prometheus and want a unified observability stack without a separate Elasticsearch team.
- Log volume is high (hundreds of GB/day) and cost is a constraint — S3 storage is 10–30x cheaper than Elasticsearch EBS volumes per GB.
- Queries follow a known pattern: "show me logs from this service in this namespace during this incident window." Label-first retrieval is fast for this workload.
- You need to correlate logs with metrics and traces in a single Grafana dashboard — Loki's Explore view links directly to Tempo traces via
traceIDfields.
Loki struggles when:
- You need sub-second arbitrary full-text search across all logs simultaneously (compliance tooling, forensic investigation of unknown patterns). Elasticsearch is faster here.
- Your team's workflow is built around Kibana's UI — LogQL has a learning curve and Grafana Explore is not a drop-in replacement.
- You have unstructured, inconsistently formatted logs from legacy systems where parsing is unreliable — the full-text index of Elasticsearch is more forgiving.
retention_enabled: true is set in the Compactor config AND the Compactor component is actually running. Many teams discover after 90 days that their S3 bucket has grown without bound. Always verify with aws s3 ls s3://my-loki-chunks --recursive --summarize | tail -2 after a week of operation.
Alerting with Loki Ruler
The Loki Ruler evaluates LogQL metric expressions on a schedule and fires Prometheus-compatible alerts. Define rules in the same YAML format as Prometheus:
Rules are loaded via the Ruler API or stored in object storage (S3 prefix). Alerts route through Alertmanager, identical to Prometheus. This lets you maintain a single alert routing tree for both metric and log-based alerts.