Logging at Scale: ELK & Loki

Project: A Centralized Logging Platform

18 min Lesson 10 of 28

Project: A Centralized Logging Platform

Every concept from the preceding nine lessons converges here. You have learned how logs move through a pipeline, how to emit structured events, how to configure the ELK stack and Grafana Loki, how Elasticsearch handles shards and retention, how Fluent Bit and Promtail ship logs in Kubernetes, how to write LogQL and Kibana KQL queries, and how to balance cost against compliance. This capstone project ties all of it together by walking you through the complete design of a centralized logging platform for a realistic microservices estate — the kind of system you would design on a whiteboard interview at a principal-level SRE role, or propose in an architecture review at a major cloud-native company.

The Problem Domain

You are joining a fintech company that runs 60 microservices across three Kubernetes clusters (us-east-1, eu-west-1, ap-southeast-1). Current state: each service writes to stdout, kubectl logs is the only investigation tool, and there is no retention beyond the 72-hour node buffer. A P0 incident last quarter took four hours to diagnose because the relevant pod had been rescheduled and its logs were gone. Leadership has approved a centralized logging platform. Your job is to design and implement it.

Requirements gathered from stakeholders: full-text search on all logs within 30 seconds of emission; 90-day retention with fast (sub-5-second) queries for the last 7 days; PCI-DSS compliance (card numbers, CVVs, and raw PANs must never reach the storage layer); alerting on error-rate spikes; cost under $4,000/month at 200 GB/day ingestion; and a single query interface for all three regions.

Choosing the Right Stack

At 200 GB/day, both ELK and Loki are viable. The decision hinges on query patterns. Fintech incident investigation is almost always field-based — "show me all transactions where user_id=U123 failed in the last hour" — which maps perfectly to Elasticsearch's inverted index. Loki's label-based model would require every query to be preceded by the correct label set, and ad-hoc field searches need LogQL's | json | line_format pipeline, which is slower at this volume. Decision: ELK for primary storage, Loki as a lightweight sidecar for infrastructure/system logs. OpenTelemetry Collector acts as the unified collection and routing layer.

Key principle: Choose the storage backend to match your dominant query pattern. ELK wins on full-text and field search. Loki wins on label cardinality, cost per GB, and native Grafana integration for ops logs. Most large organizations run both.

Pipeline Architecture

The diagram below shows the complete end-to-end pipeline. Every microservice pod emits structured JSON to stdout. Fluent Bit DaemonSets collect and enrich with Kubernetes metadata. The OpenTelemetry Collector receives from all three regional Fluent Bit fleets, applies PCI redaction, and fans out to Kafka (for durability and replay) and directly to a regional Loki instance (for infra logs). Logstash consumers read from Kafka and write to a globally replicated Elasticsearch cluster. Kibana and Grafana provide the query interfaces.

End-to-end centralized logging pipeline: three regional clusters feed a global OTel Collector that redacts PCI data, fans out to Kafka for durability, then Logstash indexes into Elasticsearch. Infrastructure logs also flow to a regional Loki instance.

PCI Redaction at the Collector Layer

The hardest compliance requirement is ensuring card numbers never reach Elasticsearch. The right place to enforce this is at the OpenTelemetry Collector, not in the application — application-level redaction is easy to miss across 60 services, and a single missed log line is a compliance violation. The OTel Collector's transform processor applies OTTL (OpenTelemetry Transformation Language) rules before any data is forwarded.

## otelcol-config.yaml — PCI redaction in the transform processor
processors:
  transform/redact_pci:
    log_statements:
      - context: log
        statements:
          ## Redact 13-19 digit PAN patterns (Visa, MC, Amex, Discover)
          - replace_pattern(body, "\\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})\\b", "***REDACTED-PAN***")
          ## Redact 3-4 digit CVV adjacent to card fields
          - replace_pattern(attributes["card.cvv"], ".*", "***")
          ## Drop any log body containing raw track data
          - delete_matching_keys(attributes, "^track[12]$")

  ## After redaction, hash user IDs for pseudonymisation
  transform/pseudonymise:
    log_statements:
      - context: log
        statements:
          - set(attributes["user.id.hash"], SHA256(attributes["user.id"]))
          - delete_key(attributes, "user.id")

service:
  pipelines:
    logs:
      receivers: [otlp, fluentforward]
      processors: [memory_limiter, transform/redact_pci, transform/pseudonymise, batch]
      exporters: [kafka, loki]

Production pitfall: Regex-based redaction has false negatives. Always combine it with a secondary control: Elasticsearch field-level security (deny access to any field matching *.card*, *.pan*) and periodic audit queries (GET logs-*/_search?q=body:4[0-9]{12}) that alert if a PAN pattern ever lands in the index. Defense in depth is the PCI standard, not single-layer redaction.

Kafka Topic Design

At 200 GB/day (~2.3 MB/s average, with spikes to 20 MB/s), a single Kafka topic would work but creates operational coupling. The production pattern for a fintech estate is three topics, partitioned by criticality: logs.critical (error/fatal, 48-hour Kafka retention, 12 partitions), logs.standard (info/warn, 24-hour retention, 24 partitions), and logs.debug (debug — disabled in prod by default, 6-hour retention, 6 partitions). Logstash runs separate consumer groups per topic so that a Logstash restart does not replay debug noise through the expensive indexing path.

Elasticsearch Index Strategy

Use ILM (Index Lifecycle Management) with three phases. The hot phase runs on SSD-backed nodes, holds 7 days of data, and uses one primary shard per 30 GB of daily volume (so at 200 GB/day: ~7 primary shards per daily index). The warm phase rolls to general-purpose storage after 7 days and force-merges segments to 1 per shard (read-only optimization). The cold phase moves to S3-backed searchable snapshots after 30 days, keeping data queryable at near-zero storage cost. At day 90, ILM deletes the index.

## ILM policy — logs-policy.json
PUT _ilm/policy/fintech-logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": { "max_primary_shard_size": "30gb", "max_age": "1d" },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "forcemerge": { "max_num_segments": 1 },
          "shrink": { "number_of_shards": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": { "snapshot_repository": "s3-logs-repo" },
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

Alerting on Error Rate Spikes

Kibana Alerting (or Elasticsearch Watcher in older stacks) drives the operational alerting layer. The canonical alert for a fintech estate is a per-service error rate threshold: if a service emits more than 50 level:error events in any 5-minute window, page the on-call engineer. A secondary alert fires when the ratio of errors to total events crosses 5% — this catches degradation that is still within absolute thresholds but signals a systemic problem.

Pair these Elasticsearch alerts with a Grafana dashboard backed by Loki for infrastructure logs: node filesystem saturation, Fluent Bit buffer overflow events, and Kafka consumer lag on logs.critical. Kafka consumer lag is the most important pipeline health signal — if Logstash falls behind, logs will still arrive eventually (Kafka holds them), but incident investigators will not see recent events.

Deployment Checklist

When this design goes through a production readiness review, the checklist covers: Fluent Bit storage.type filesystem enabled on every node; OTel Collector deployed as a Deployment (not DaemonSet) with HPA based on collector queue length; Kafka replication factor 3 with min.insync.replicas=2; Elasticsearch cross-cluster replication (CCR) from us-east-1 to eu-west-1 for the 24-hour hot tier (disaster recovery); Kibana Spaces configured so that each team only sees their namespace's logs (RBAC); quarterly PCI audit query scheduled as a Watcher; and runbook links embedded in every Kibana alert.

Pro practice: Store all Kibana dashboards, index templates, ILM policies, and Watcher definitions as YAML files in a git repository and apply them via the Elasticsearch API during CI/CD. This gives you version history for your observability configuration — when an alert starts misfiring, you can git blame who changed the threshold and why. Grafana has the same capability via its provisioning directory.

You have now designed a production-grade, PCI-compliant, multi-region centralized logging platform: structured emission, enriched collection, durable transport, cost-optimized tiered storage, compliance-by-default redaction at the pipeline layer, and actionable alerting. This is the architecture a senior SRE at a major bank or payments company would be proud to put their name on.