Querying & Investigating with Logs
Querying & Investigating with Logs
Knowing how to ingest and store logs is the foundation. Knowing how to query them under pressure — during an active incident at 3 AM, when the CEO is asking what broke — is the operational skill that separates senior engineers from everyone else. This lesson is a deep dive into effective log search: the query languages, the debugging workflows, and the habits that let you move from "something is wrong" to "root cause identified" in minutes rather than hours.
Two Query Languages: LogQL and KQL
Your choice of storage backend determines your query language. Loki uses LogQL; Elasticsearch uses KQL (Kibana Query Language) or the full Lucene syntax. Both follow the same conceptual pattern: first restrict the search space with label/index filters, then apply content filters over the narrowed result set. Doing it in the wrong order — scanning all content before filtering by label — is the single most common cause of slow queries at scale.
LogQL (Grafana Loki)
A LogQL query has two mandatory parts: a stream selector (curly-brace label matchers) and a log pipeline (pipe-delimited stages). The stream selector is evaluated against the index — it is cheap. The pipeline stages run over the raw chunk data — they are expensive and should be as narrow as possible.
namespace="checkout" before env="production". Loki evaluates label matchers left-to-right; the first match prunes the most chunks and makes every subsequent stage faster. Google SRE teams call this "filter pushdown" and it applies to every log query engine.KQL (Kibana / Elasticsearch)
KQL is a simplified query language layered over Lucene. It maps intuitively to field-level searches, ranges, and boolean logic. For power users, the full Lucene syntax (enabled via the toggle in Kibana Discover) adds fuzzy matching, proximity searches, and wildcard patterns. Both are translated into Elasticsearch DSL queries internally.
filter clauses (as opposed to must in the DSL) do not affect relevance scoring and are cached by Elasticsearch — they are dramatically faster. Prefer filter for all structured-field queries. Only use must (which scores) when you need full-text relevance ranking, which is rare in operational log search.The Log-Driven Debugging Workflow
A structured debugging workflow is not optional at big-tech scale — it is what stops you from thrashing randomly through logs while MTTR climbs. Top-tier SRE teams follow a consistent pattern regardless of the logging backend:
- Scope: Set the time window to just before the alert fired. Lock in the relevant service labels. Broad time windows exponentially increase query cost — start narrow and expand only if needed.
- Volume: Run an error-rate query aggregated over time (e.g.,
rate({app="checkout"} | json | level="error" [1m])). Identify the exact minute the error rate spiked. This prevents you from wasting time on unrelated noise in the same window. - Filter: Drill into the spike minute. Apply keyword and field filters to isolate the error class. Look at a handful of raw log lines — the actual error message, stack trace, or upstream host is almost always visible here.
- Correlate: Take a
trace_id,request_id, oruser_idfrom one of the failing log lines. Query across all services for that identifier. This reconstructs the full request path and reveals which service actually injected the fault, versus which services are downstream victims. - RCA: Establish the timeline: when did the first anomalous log appear? What changed (deploy, config push, traffic spike, certificate expiry)? What is the blast radius (how many users/requests were affected)? Document this for the postmortem.
Cross-Service Correlation with Trace IDs
The most powerful capability in a modern observability stack is jumping from a log line to the full distributed trace and back. This works only if every log line emitted by every service carries a trace_id that matches the trace recorded by Tempo or Jaeger. In practice this means your application framework (OpenTelemetry SDK, Spring Sleuth, etc.) must inject the active trace context into the log MDC/context map, and your structured log format must emit it as a top-level JSON field.
In Grafana, the Loki data source can be configured with a derived field that turns every trace_id value in a log line into a clickable link that opens Grafana Tempo at that exact trace. This eliminates the manual copy-paste step and is how top-tier teams achieve sub-five-minute MTTR on complex distributed failures.
Alerting from Log Queries
A log query that you run manually during an incident is only half the value. The same query, running on a schedule, becomes a proactive alert that pages you before a user files a ticket. Both Grafana and Kibana support log-based alerting rules. In Grafana, a LogQL metric query can back a standard alert rule; in Kibana, Alerting rules evaluate KQL/ES|QL queries on a configurable schedule.
for: 3m in Grafana alert rules to require the condition to be true continuously before firing. This single setting eliminates the vast majority of spurious pages.Common Query Anti-Patterns
These are the mistakes that make incident investigations slow and logging systems expensive:
- Querying without a time bound. "All time" queries in Loki scan every chunk ever written. Always set a time range in the Grafana time picker or pass
start/endto the Loki HTTP API explicitly. - Using regex when a string match suffices.
|= "error"is a simple byte-scan;|~ "err.*r"invokes the RE2 engine. On gigabytes of logs the difference is a 10–20× query time increase. Only use regex when the pattern genuinely requires it. - High-cardinality labels in Loki. Adding
user_idorrequest_idas Loki stream labels creates millions of streams and collapses query performance. Put high-cardinality data in the log line body (parsed with| json), not the label set. - Forgetting to check for parser errors. When LogQL parses a JSON log line that is malformed, it sets the
__error__label. Including| __error__=""in rate queries ensures you are counting real events, not parse failures masquerading as gaps in your data.