Logging at Scale: ELK & Loki

The ELK Stack

18 min Lesson 3 of 28

The ELK Stack

The ELK Stack — Elasticsearch, Logstash, and Kibana — is the most widely deployed open-source centralized logging solution in the industry. Large organisations at Netflix, LinkedIn, Uber, and Goldman Sachs run ELK at petabyte scale. Understanding its architecture is the foundation for all operational log work at big-tech standard.

The Three-Component Architecture

Each component has one job, and the separation is intentional. Breaking that boundary is one of the most common causes of production ELK incidents.

Elasticsearch — a distributed, inverted-index search and analytics engine. It stores logs, indexes every field, and executes queries across terabytes in milliseconds. Logs are written to time-stamped indices (logs-app-2025.06.11). Each index is sharded across nodes and replicated for fault-tolerance.
Logstash (or Elastic Ingest Nodes) — the ingestion and transformation layer. It receives raw log streams, parses them into structured JSON (using Grok, Dissect, CSV, JSON filters), enriches fields (GeoIP, user-agent, DNS), and forwards to Elasticsearch. It also acts as a buffer under backpressure.
Kibana — the visualisation and query interface. Engineers use the Discover tab to run KQL (Kibana Query Language) searches, build dashboards with aggregation charts, and set alerting rules. In production it is also the entry point for APM and Fleet management.

ELK Stack: logs flow from sources through lightweight shippers, into Logstash for parsing and enrichment, and finally into an Elasticsearch cluster queried by Kibana.

Elasticsearch: What Operators Must Know

Elasticsearch uses an inverted index — every token in every field is indexed at write time. This is why full-text search across billions of log lines is sub-second, but it also means write throughput and disk usage are 3–5× the raw log volume. Each index is divided into primary shards (write targets) and replica shards (read scaling and HA). Missharding is the most common cause of Elasticsearch cluster degradation in production.

Oversharding kills clusters. The Elastic team's own guidance is one shard per 30–50 GB of data. Thousands of tiny shards (a common mistake when daily indices are created with default settings) overload the master nodes with metadata, causing slow searches and eventually out-of-memory crashes. Use ILM (Index Lifecycle Management) and set number_of_shards explicitly in your index template.

Logstash: Ingestion Pipeline

A Logstash pipeline has three sections: input (where logs come from), filter (how they are parsed and enriched), and output (where they go). The most important filter is Grok, which matches free-text log lines against named regex patterns.

# /etc/logstash/conf.d/nginx.conf
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][log_type] == "nginx_access" {
    grok {
      match => {
        "message" => '%{IPORHOST:client_ip} - %{USER:ident} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status:int} %{NUMBER:bytes:int} "%{DATA:referrer}" "%{DATA:user_agent}"'
      }
    }
    date {
      match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
      target => "@timestamp"
    }
    geoip {
      source => "client_ip"
      target => "geoip"
    }
    mutate {
      remove_field => ["message", "timestamp", "agent"]
    }
  }
  if "_grokparsefailure" in [tags] {
    drop {}
  }
}

output {
  elasticsearch {
    hosts => ["https://es-node-1:9200", "https://es-node-2:9200"]
    index => "nginx-access-%{+YYYY.MM.dd}"
    user => "logstash_writer"
    password => "${LOGSTASH_ES_PASSWORD}"
    ssl_certificate_authorities => ["/etc/logstash/certs/ca.crt"]
  }
}

Prefer Elastic Ingest Pipelines for simple parsing. Logstash is a heavyweight JVM process — it needs 1–4 GB of heap and careful tuning. For straightforward field extraction and enrichment, Elasticsearch's built-in Ingest Node pipelines (run server-side, defined via the API) eliminate the Logstash hop entirely and reduce operational complexity. Reserve Logstash for complex routing logic, multiple output destinations, or heavy transformation workloads.

Kibana: Querying and Dashboards

Kibana connects to Elasticsearch via its REST API and provides the Discover view for ad-hoc investigation and the Dashboard editor for persistent visualisations. The query language is KQL (Kibana Query Language), a simplified syntax over Elasticsearch DSL.

# KQL examples in Kibana Discover
# Find all 5xx errors from the checkout service
status >= 500 and service.name : "checkout-api"

# Errors from a specific pod in the last 15 minutes (relative time in the time-picker)
kubernetes.pod.name : "checkout-api-*" and log.level : "error"

# Find slow requests (latency > 2000 ms) NOT from internal health checks
duration_ms > 2000 and NOT http.url : "/health"

# Elasticsearch DSL equivalent (what Kibana builds internally)
POST /nginx-access-*/_search
{
  "query": {
    "bool": {
      "filter": [
        { "range": { "status": { "gte": 500 } } },
        { "term":  { "service.name": "checkout-api" } }
      ]
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "size": 200
}

Production Failure Modes

Three failure modes account for the majority of ELK production incidents:

Heap pressure and GC pauses. Elasticsearch nodes are JVM processes. When heap usage exceeds 75%, the garbage collector starts causing multi-second pauses, slowing ingestion and query response. The hard limit is 50% of available RAM, capped at 31 GB (due to compressed OOPs). Monitor jvm.mem.heap_used_percent as a critical Prometheus metric.
Split-brain. In Elasticsearch 7+ this is largely solved by the Raft-based cluster coordination, but it can still occur if discovery.seed_hosts is misconfigured or network partitions occur. Always run an odd number of master-eligible nodes (3 or 5) and set cluster.initial_master_nodes correctly on first bootstrap only.
Logstash pipeline stall. If Elasticsearch is backpressured (full disk, circuit breaker open), Logstash's persistent queue fills up. Without persistent queue enabled (queue.type: persisted), log lines are dropped silently. Always enable the persistent queue in production and size it for at least 30 minutes of peak ingestion volume.

Hot-warm-cold architecture. At scale, running all data on fast NVMe nodes is cost-prohibitive. Elasticsearch supports tiered storage via ILM policies: logs live on hot nodes (SSD) for the first 7 days, roll to warm nodes (larger HDDs, reduced replicas) for 30 days, then move to cold (read-only, searchable snapshots in S3) for long-term retention. This is the standard architecture at any organisation ingesting more than 50 GB/day.

Security Baseline

Never run ELK without security enabled. The default open-access Elasticsearch has led to thousands of publicly exposed databases. Since Elasticsearch 8.0, security (TLS + basic auth) is enabled by default. For existing deployments:

Enable TLS on all inter-node and client-to-node communication (xpack.security.enabled: true, xpack.security.http.ssl.enabled: true).
Use dedicated service accounts with minimal privileges (logstash_writer role: only write to specific index patterns; kibana_read_only for dashboard consumers).
Never expose the Elasticsearch HTTP port (9200) to the internet — place it behind a VPC security group or firewall, accessible only from the ingest tier and Kibana.