Logging at Scale: ELK & Loki

Kubernetes Logging Patterns

18 min Lesson 7 of 28

Kubernetes Logging Patterns

Kubernetes does not have a built-in mechanism for persisting or forwarding pod logs. The platform intentionally leaves logging as an operator concern, which means every team must make deliberate architectural choices about how logs are collected, what format they are emitted in, and how multi-line events are reconstructed before they reach the storage backend. This lesson covers the three production-grade patterns — node-level agents, stdout discipline, and multi-line handling — that together form the foundation of every serious Kubernetes logging implementation, from a 10-node startup cluster to the multi-thousand-node fleets run by top-tier cloud providers.

How the Container Runtime Handles Logs

When a container writes to stdout or stderr, the kubelet captures those bytes and routes them to the configured container runtime interface (CRI) — containerd or CRI-O in virtually all production clusters today. The CRI writes each line to a log file under /var/log/pods/<namespace>_<pod-name>_<uid>/<container-name>/0.log, in a format called CRI log format:

# CRI log format — one physical line per entry
# <RFC3339Nano timestamp> <stream> <flags> <log body>

2025-07-14T09:23:01.482831200Z stdout F {"level":"info","ts":1752486181.4,"msg":"request processed","path":"/api/v1/orders","duration_ms":12}
2025-07-14T09:23:01.483100000Z stdout F {"level":"error","ts":1752486181.4,"msg":"db connection failed","error":"context deadline exceeded"}

# 'F' = full line (complete log entry)
# 'P' = partial line (multi-line event — more bytes follow)

This CRI-wrapped file is what log shippers actually tail. /var/log/containers/ contains symlinks pointing to these files, named <pod>_<namespace>_<container>-<container-id>.log — the symlink naming convention is what most shipper configurations reference. Understanding this indirection is critical: when you configure your DaemonSet to watch /var/log/containers/*.log, you are following symlinks to the real CRI log files, and your shipper must know how to strip the CRI wrapper before parsing the log body.

Node-Level Agent Pattern (DaemonSet)

The standard Kubernetes logging architecture places a lightweight log-shipping agent on every node as a DaemonSet. The DaemonSet guarantees exactly one agent replica per node — it tracks Kubernetes scheduling events so new nodes automatically get an agent, and drained nodes have their agent gracefully terminated. This pattern is preferred over per-pod sidecars at scale because a single agent can multiplex the logs of dozens of pods running on the same node, amortizing CPU and memory costs.

One Fluent Bit DaemonSet pod per node tails all container logs from hostPath mounts and forwards them to a centralized backend.

The DaemonSet agent accesses log files via hostPath volumes. The required mounts are:

/var/log — CRI log files and pod log symlinks
/var/lib/docker/containers (legacy Docker runtime) or /run/containerd (containerd)
/run/log/journal — systemd journal for node-level daemon logs (kubelet, containerd itself)
/var/lib/fluent-bit — the agent state directory (offset registry); must be a hostPath so it survives agent pod restarts

The registry hostPath is non-negotiable. If your Fluent Bit DaemonSet uses an emptyDir for its state directory, every agent pod restart (OOM kill, node reboot, DaemonSet rollout) resets the offset registry to zero. The agent then replays every log file on the node from the beginning, flooding your storage backend with duplicates and potentially triggering index capacity alerts. Mount /var/lib/fluent-bit as a hostPath so the registry persists across pod restarts.

# Fluent Bit DaemonSet — hostPath mounts (production YAML fragment)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    spec:
      serviceAccountName: fluent-bit   # needs get/list/watch pods
      tolerations:
        - operator: Exists             # run on ALL nodes incl. control-plane
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.1.9
          resources:
            requests:
              cpu: 100m
              memory: 64Mi
            limits:
              cpu: 500m
              memory: 128Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: fluentbit-state
              mountPath: /var/lib/fluent-bit   # offset registry — hostPath!
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: fluentbit-state
          hostPath:
            path: /var/lib/fluent-bit
            type: DirectoryOrCreate

Stdout Discipline: Why It Matters and How to Enforce It

The entire node-level agent pattern depends on a fundamental contract: all application logs must go to stdout/stderr, never to files inside the container filesystem. This contract exists because the container filesystem is ephemeral — when a pod is deleted or rescheduled, its writable layer disappears, taking any file-based logs with it. The kubelet-managed log files under /var/log/pods/ survive container restarts (up to a configurable rotation limit) precisely because the CRI writes them on the node, outside the container.

In practice, stdout discipline means three things at big-tech companies:

Log to stdout/stderr only. No file appenders, no logging.FileHandler, no /app/logs/*.log. Configure your frameworks: LOG_FILE=stdout in Spring Boot, logging.handlers.StreamHandler in Python, --log-format=json and stderr in Go's log/slog.
Emit structured JSON on a single line per event. One log entry = one line. The CRI and all shippers treat newlines as event boundaries. A multi-line JSON dump to stdout breaks this contract and requires expensive reassembly (discussed below).
Never log to both stdout and a file. Dual logging creates duplicate events in your backend and inflates costs. Worse, ops teams learn to check one or the other, not both — critical context ends up in the wrong place during an incident.

Enforce stdout discipline at the platform level. Add a CI gate (OPA/Conftest policy or a custom linter) that rejects Dockerfile layers that create /app/logs directories or install log rotation daemons. Pair this with a PodSecurity admission controller that denies emptyDir volume mounts named logs. At Google and Meta, these controls are enforced by the platform team, not left to individual application developers.

The kubelet enforces log rotation on the CRI-managed files: by default, logs are rotated at 10 MB with 5 rotations kept (--container-log-max-size, --container-log-max-files kubelet flags). Your node-level agent must be configured to follow rotated files (inode tracking, not filename tracking) or you will miss the tail of each rotation. Fluent Bit does this correctly by default via its inotify-based tail implementation.

Multi-Line Log Handling

Multi-line logs are one of the most common sources of silent data corruption in Kubernetes logging pipelines. A Java stack trace, a Python traceback, a Go panic dump, or a pretty-printed JSON blob all span multiple lines of stdout. The CRI writes each line as a separate log entry, tagged with the P (partial) flag for continuation lines and F (full) for the terminating line. If your shipper does not reassemble these partial lines into a single logical event, your backend receives dozens of disconnected one-liners instead of one coherent stack trace — and your alerting rules that search for Exception in the log body find the first line but not the context on lines 2–20.

There are two distinct reassembly problems and they must both be solved:

CRI partial-line reassembly (P/F flags). The CRI splits very long lines (longer than 16 KB in containerd) across multiple log file entries with the P flag. Fluent Bit's cri multiline parser handles this automatically. Promtail handles it via its docker/cri pipeline stages. This layer is about raw byte reassembly, not semantic understanding.
Application-level multi-line reassembly (stack traces, panics). The CRI correctly wrote each line as a separate entry (each with F), but they represent a single logical error event. Your shipper must detect the pattern — typically "starts with a timestamp = new event; a line that does not start with a timestamp = continuation" — and merge them before forwarding.

# Fluent Bit — application-level multi-line for Java stack traces
# /etc/fluent-bit/parsers.conf

[MULTILINE_PARSER]
    name          java_multiline
    type          regex
    flush_timeout 2000          # emit incomplete group after 2 s of silence

    # A new event starts with an ISO timestamp (2025-07-14T09:23:01...)
    rule "start_state"    "/(^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})/"    "java_st"
    # Continuation: indented lines or lines starting with 'at ', 'Caused by:', ...
    rule "java_st"        "/^(\s+at |\s+Caused by:|\s+\.\.\. \d+ more)/"  "java_st"

# /etc/fluent-bit/fluent-bit.conf  (INPUT section)
[INPUT]
    Name              tail
    Path              /var/log/containers/order-service*.log
    multiline.parser  cri,java_multiline   # CRI first, then app-level
    Tag               kube.order-service
    Mem_Buf_Limit     32MB
    storage.type      filesystem

The flush_timeout parameter is critical in production: if the application crashes mid-stack-trace, Fluent Bit will wait for this duration before emitting the incomplete group rather than holding it indefinitely. Set it to 2–5 seconds. Too short and you split events during GC pauses; too long and you delay alert firing during an outage.

The best multi-line solution is to eliminate the problem entirely. Configure your application logging framework to emit stack traces as a single-line JSON string (exception.stack_trace field). Log4j2 JSON layout, Logback's logstash-logback-encoder, Python's python-json-logger, and Go's log/slog with a JSON handler all do this natively. When the stack trace is a JSON string value rather than literal newlines, the CRI writes exactly one log file entry per event and multi-line reassembly becomes unnecessary. This is the approach used by Netflix, Uber, and Shopify.

Kubernetes Metadata Enrichment

Raw CRI logs contain only the log body — no pod name, no namespace, no deployment, no container image tag. The shipper must join this metadata from the Kubernetes API at collection time. Fluent Bit's built-in kubernetes filter and Promtail's kubernetes_sd_configs both query the local kubelet's pod metadata endpoint (https://<NODE_IP>:10250/pods) and the Kubernetes API server to attach standard labels:

# Fluent Bit — kubernetes filter adds pod metadata to every log record
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On          # parse JSON logs into top-level fields
    Keep_Log            Off         # discard raw 'log' field after merge
    Labels              On          # attach pod labels (app, version, team)
    Annotations         Off         # skip annotations — usually high-cardinality noise
    K8S-Logging.Parser  On          # honour pod annotation: fluentbit.io/parser
    K8S-Logging.Exclude On          # honour annotation: fluentbit.io/exclude: "true"

The K8S-Logging.Exclude annotation is a powerful escape hatch: pods that generate high-volume, low-value logs (health-check aggregators, metrics scrapers) can opt out of collection entirely by setting the annotation fluentbit.io/exclude: "true" in their pod spec. This is a much cheaper filter than processing and then dropping events downstream.

RBAC for the DaemonSet service account. The Fluent Bit service account needs get, list, and watch on pods and namespaces cluster-wide. In clusters with strict RBAC, the most common DaemonSet failure mode is a silent metadata enrichment failure: Fluent Bit logs kube-filter: API call failed at warn level but continues shipping — logs arrive at the backend with no namespace or pod-name labels, making them unfilterable. Always verify the ClusterRoleBinding is present after deploying the DaemonSet.

When to Use Sidecar Containers Instead

The DaemonSet pattern handles 95% of Kubernetes logging needs. The 5% exception is when an application cannot be modified to write to stdout — legacy JVM apps that write to rolling files, databases that write binary WAL segments to a data volume, or processes that mix application logs with audit logs that must be shipped to a different backend. In these cases, a sidecar log shipper co-located in the same pod can tail the shared volume and forward to the appropriate destination. The tradeoff is resource overhead: each pod now carries its own Fluent Bit or Vector process, increasing per-pod CPU and memory requirements. At scale (thousands of pods), this cost is significant. Prefer fixing the application to write to stdout over deploying sidecar shippers.