Retention, Cost & Compliance
Retention, Cost & Compliance
At small scale, logging feels free. At production scale — millions of containers, hundreds of services, thousands of requests per second — logging becomes one of the largest line items in your infrastructure budget. A busy microservices platform can generate 10 TB of raw logs per day. Storing all of it forever in hot Elasticsearch is financial suicide. Keeping none of it invites regulatory fines and makes post-mortems impossible. The discipline of retention engineering is knowing exactly what to keep, for how long, at what fidelity, and at what cost.
Tiered Retention: The Three-Zone Model
Production logging platforms at big tech companies implement a three-tier architecture, modeled after storage tiering. Each zone has different cost, query speed, and retention duration characteristics.
In Elasticsearch, tier movement is automated via Index Lifecycle Management (ILM). In Loki, the equivalent is compaction rules and object storage lifecycle policies. The key insight is that you almost never need to query logs older than 30 days at sub-second latency — those queries are for audits and forensics, where waiting 30 seconds is acceptable.
searchable_snapshot action in the cold phase is key: Elasticsearch mounts the index directly from S3 without fully restoring it to disk. You pay S3 prices (~$0.023/GB/mo) instead of EBS prices (~$0.10/GB/mo), and queries still work — they are just slower. This is how teams cut their Elasticsearch storage bill by 60-80% without losing queryability.
Sampling Noisy Logs
Not every log line has equal value. A healthy payment API that logs every successful transaction at INFO level produces enormous volume with near-zero diagnostic value. Log sampling is the practice of keeping only a statistical fraction of repetitive, low-value log lines while retaining 100% of high-signal logs (errors, warnings, slow requests, security events).
There are two mainstream sampling strategies:
- Head-based sampling: The decision is made at the start of the request — keep 1 in 100 requests unconditionally. Simple to implement in the shipper or SDK, but you may drop the 1-in-a-million request that caused a production bug.
- Tail-based sampling: Buffer the request, wait for the outcome, then decide — always keep errors and slow requests, sample the rest. More CPU-intensive at the collector, but far more intelligent. This is the Google and Netflix model.
sampling_rate="0.01"). When you query sampled data and want to extrapolate true counts, divide by the sampling rate. Without this label, your dashboards will silently undercount and nobody will know why.
PII Redaction: Protecting User Data in Logs
Logs are a privacy minefield. Engineers frequently log HTTP request bodies, headers, or database query results that contain personally identifiable information (PII) — email addresses, phone numbers, credit card numbers, session tokens, passwords. Under GDPR, PCI-DSS, and HIPAA, storing unredacted PII in your logging platform is a compliance violation that can result in regulatory fines. The correct approach is to redact at the point of collection, before the data ever reaches your storage layer.
PII redaction should happen in the collector pipeline, not in the application — because you cannot trust every developer to sanitize every log call correctly. A defense-in-depth model treats the collector as the last line of defense.
Compliance Frameworks and What They Actually Require
Different regulations have different log retention mandates. As a DevOps engineer operating in regulated environments, you need to know these numbers without looking them up:
- PCI-DSS v4.0 (Requirement 10.5): Audit logs must be retained for at least 12 months, with the most recent 3 months immediately available. Logs must be write-protected against modification.
- HIPAA: 6 years retention for audit logs of PHI access. Encryption at rest is required. Access logs must record who accessed what and when.
- SOC 2 Type II: No mandated duration, but auditors typically expect 12 months of logs covering the audit period with tamper evidence to prove log integrity.
- GDPR: No minimum retention requirement, but the right to erasure creates a maximum — you cannot store logs containing personal data indefinitely. Implement a deletion workflow for your log stores, not just your databases.
Implementing S3 Object Lock in WORM (Write Once Read Many) mode — or the GCS equivalent — satisfies the tamper-evidence requirement for PCI and SOC 2. Once a log object is written and locked, not even a root-level cloud administrator can delete or overwrite it before the retention period expires.
Cost Optimization Levers in Practice
When your logging bill is too high, work through these levers in order — ranked by effort-to-impact ratio:
- Raise log level thresholds in production. Switching from
DEBUGtoINFOin production typically reduces volume by 50-80% with near-zero engineering effort. - Drop high-volume, low-value log classes at the shipper. Health-check endpoints, Kubernetes liveness probes, and static asset requests are the top offenders — filter them before they reach your backend.
- Compress aggressively. Loki uses Snappy or gzip by default. Elasticsearch\'s
best_compressioncodec (DEFLATE) saves 20-40% versus the default. Enable it on all warm and cold indices. - Right-size your hot tier. Instrument how often on-call engineers query logs older than 7 days during incidents. If the answer is rarely, shrink the hot window from 30 days to 7 days.
- Disable dynamic field mapping in Elasticsearch. Every new field you log becomes a mapped field by default. Disable dynamic mapping and explicitly map only fields you query on — unmapped fields are still stored in
_sourcebut do not consume high-cardinality term memory.
Retention engineering is not a one-time setup task. Build a monthly review cadence: check your top 10 highest-volume log sources, verify redaction coverage, and reconcile actual retention costs against your targets. Logging platforms drift toward expensive over time as teams add new services and nobody removes old noisy sources.