Logging at Scale: ELK & Loki

Elasticsearch for Operators

18 min Lesson 4 of 28

Elasticsearch for Operators

Running Elasticsearch in production is not the same as running it for a demo. At scale, the difference between a cluster that holds up under a 200 GB/day ingest spike and one that falls over comes down to four things you must own as an operator: index design, shard strategy, mapping discipline, and Index Lifecycle Management (ILM). This lesson is your operator's field guide to all four, plus the cluster-health signals that tell you when something is about to go wrong.

Indices and What They Actually Are

An index in Elasticsearch is the logical namespace for a collection of related JSON documents. Under the hood it is a set of one or more Apache Lucene indices. Every write goes to a primary shard, which then replicates to its replica shard(s). Reads are served from either — this is how Elasticsearch delivers both durability and read scalability from the same cluster.

For logs, you almost never create a single monolithic index. Instead you use a data stream, which is a sequence of time-stamped backing indices managed automatically. A write to the data stream alias always lands on the current write index (called the "write backing index"). Old backing indices become read-only and are rolled over or deleted by ILM policy.

Data streams are the right model for logs. Do not create bare indices for time-series log data in 2025. Data streams give you automatic rollover, ILM integration, and the @timestamp sorting guarantee that log queries rely on.

Shards: the Unit of Scale and the Source of Most Production Pain

A shard is a single Lucene index. Every index has a fixed number of primary shards set at creation time — you cannot change it without reindexing. Each shard has a configurable number of replica shards. The rules of thumb used at large Elasticsearch deployments are:

  • Target 20–50 GB per shard. Shards outside this range are problematic: too small and the overhead of coordinating thousands of shards degrades query latency; too large and segment merges stall, recovery after a node failure takes too long, and heap pressure spikes.
  • One replica minimum in production. Zero replicas means any node failure causes data loss and a red cluster state.
  • Total shard count drives heap usage. Elasticsearch keeps shard metadata in heap. At scale, every shard consumes ~few KB of heap on every node. A cluster with 100,000 shards will OOM even on 64 GB nodes.
Shard proliferation is the #1 cause of Elasticsearch cluster failures in production. Fluentd or Logstash writing a new index per service per day creates thousands of shards within weeks. Always use ILM with rollover conditions to prevent runaway shard growth.

To inspect shard allocation right now:

# Check overall cluster shard counts and sizes GET /_cat/shards?v&h=index,shard,prirep,state,docs,store,node&s=store:desc # Find the biggest shards (good for spotting over-grown shards) GET /_cat/shards?v&h=index,shard,prirep,store,node&s=store:desc&size=20 # See unassigned shards and why they are unassigned GET /_cluster/allocation/explain

Mappings: Schema-on-Write for Search

Elasticsearch is often described as "schema-free," but that is misleading. What it actually does is dynamic mapping — the first document that arrives defines the field types for every subsequent document. This is dangerous for log pipelines. A field that arrives as a long in one service's logs but as a keyword in another will trigger a mapping conflict, causing one of those documents to be rejected with a 400 and silently dropped depending on your ingest configuration.

The production pattern is to define an explicit index template with component templates that lock down the fields you know about and configure sensible defaults for the rest. Use dynamic: strict for fields where you want hard enforcement, or dynamic: true with a dynamic_templates block that catches unexpected fields and maps them as keyword instead of allowing Elasticsearch to auto-detect a numeric or date type that might conflict later.

# Create a component template for timestamp + log level fields PUT /_component_template/logs-base-mappings { "template": { "mappings": { "dynamic_templates": [ { "strings_as_keyword": { "match_mapping_type": "string", "mapping": { "type": "keyword", "ignore_above": 1024 } } } ], "properties": { "@timestamp": { "type": "date" }, "log.level": { "type": "keyword" }, "message": { "type": "text", "norms": false }, "host.name": { "type": "keyword" }, "service.name": { "type": "keyword" }, "trace.id": { "type": "keyword" }, "http.response.status_code": { "type": "short" } } }, "settings": { "number_of_shards": 2, "number_of_replicas": 1, "index.codec": "best_compression", "index.refresh_interval": "5s" } } } # Create the index template that ties it to a data stream PUT /_index_template/logs-app { "index_patterns": ["logs-app-*"], "data_stream": {}, "composed_of": ["logs-base-mappings"], "priority": 200 }
Set norms: false on the message field. Norms store per-document field-length normalization data for relevance scoring — useless for log queries where you care about exact matches, not BM25 ranking. Disabling them saves ~1 byte per document per field, which compounds to gigabytes in high-volume clusters.

Index Lifecycle Management (ILM)

ILM is the policy engine that moves indices through defined phases — hot → warm → cold → frozen → delete — as they age. For a logging cluster this is non-negotiable: without ILM, disks fill up and you are deleting indices manually at 3 AM.

The key transition triggers are rollover conditions on the hot phase: when an index reaches a maximum age, a maximum size, or a maximum document count, ILM creates a new write index and the old one becomes read-only and starts moving toward warm. Typical production thresholds at a mid-size company: rollover at max_age: 1d or max_size: 40gb, move to warm after 2 days (force merge to 1 segment + shrink shards), move to cold after 7 days (mount as frozen searchable snapshot on object storage), delete after 30–90 days (regulatory window).

Elasticsearch ILM Phase Transitions HOT Primary shards Active writes Rollover WARM Force-merge Shrink shards Age / size COLD Searchable snapshot Object storage Retention DELETE Policy expiry Compliance window Day 0 Day 1-2 Day 7 Day 30-90 ILM Phase Transitions for a Log Data Stream
ILM moves log indices from hot (active writes) through warm (merged, read-only) and cold (snapshot on object storage) to deletion, keeping storage costs predictable.
# Define a production ILM policy for application logs PUT /_ilm/policy/logs-app-policy { "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "rollover": { "max_age": "1d", "max_primary_shard_size": "40gb" }, "set_priority": { "priority": 100 } } }, "warm": { "min_age": "2d", "actions": { "set_priority": { "priority": 50 }, "readonly": {}, "forcemerge": { "max_num_segments": 1 }, "shrink": { "number_of_shards": 1 } } }, "cold": { "min_age": "7d", "actions": { "set_priority": { "priority": 0 }, "searchable_snapshot": { "snapshot_repository": "s3-log-archive" } } }, "delete": { "min_age": "30d", "actions": { "delete": {} } } } } }

Reading Cluster Health

Elasticsearch exposes cluster health as green / yellow / red. Red means at least one primary shard is unassigned — some data is unavailable and writes to that shard are rejected. Yellow means all primaries are assigned but at least one replica shard is unassigned — the cluster is fully functional but has no redundancy for those shards. Green means everything is assigned and healthy.

In a production logging cluster, a sustained yellow state during a rolling restart is expected and acceptable. Red is always an incident. The fastest triage path is:

  1. Check GET /_cluster/health?pretty — look at unassigned_shards and active_shards_percent_as_number.
  2. Run GET /_cluster/allocation/explain — Elasticsearch will tell you exactly why a shard is unassigned (disk watermark, no eligible nodes, etc.).
  3. If disk watermarks are the cause, the default thresholds are 85% (low — stop allocating replicas), 90% (high — move shards away), 95% (flood stage — all indices become read-only). Check with GET /_cat/nodes?v&h=name,diskUsed,diskAvail,diskTotal,diskPercent.
Monitor jvm.mem.heap_used_percent on every node. Above 85% heap usage, the JVM garbage collector starts causing multi-second stop-the-world pauses. This is the second-most-common cause of production Elasticsearch incidents after shard proliferation. Alert at 75% heap; page at 85%.

Key Operator Habits

  • Always test a new ILM policy against a non-production data stream before applying it to production indices — the shrink action is destructive and cannot be undone.
  • Pin your index template priority (use values like 200 for your templates) so that Elastic's built-in templates at priority 100 do not accidentally override yours.
  • Use GET /_data_stream/logs-app-* to see which backing indices exist and which is the current write index — this is the first thing to check when logs stop arriving.
  • Avoid _forcemerge on hot indices. It is a heavily I/O-bound operation that competes with active indexing and can destabilize a node under load.