Observability Stack
Observability Stack
When a production incident fires at 03:00, the difference between a 5-minute resolution and a 2-hour outage is almost always the quality of your observability stack. Metrics tell you something is wrong; traces tell you where; logs tell you why. At big-tech scale these three signals must be architected as a unified platform — not bolted-on tools — with retention policies, cardinality budgets, and SLO-driven alerting wired together before a single service ships to production.
The Three Pillars at Scale
Every observability system is constrained by the same three axes: ingestion throughput, query latency, and retention cost. The architectural choices below are driven by those constraints, not by vendor preferences.
- Metrics — Prometheus + Thanos (or Mimir). A single Prometheus instance fails at roughly 1 M active time series on commodity hardware. Beyond that you need either Thanos (sidecar model, stores to object storage, global query layer) or Grafana Mimir (microservices, horizontally scalable). Google-scale deployments use Monarch; at mid-big-tech (10k–50k pods) Mimir with S3 backend and 13-month retention is the current canonical choice. Scrape interval 15 s for infrastructure metrics, 30 s for application metrics; never go lower than 10 s — you create cardinality explosion without real signal improvement.
- Logs — OpenTelemetry Collector → Loki (or OpenSearch). Structured JSON logs only. Every log line emits
trace_id,service.name,env, andseverity. At >500 GB/day, Loki's chunk store on S3 with a 30-day hot tier and 1-year cold tier costs roughly 70 % less than an Elasticsearch cluster of equivalent query performance. Log sampling is legitimate at >10k req/s per service; sample DEBUG at 1 %, INFO at 10 %, WARN/ERROR at 100 %, and always propagate trace context so sampled logs stay correlated. - Traces — OpenTelemetry SDK → Tempo (or Jaeger). Tail-based sampling is mandatory at scale. Head-based sampling (sample at ingress) throws away the traces of slow and errored requests — exactly the ones you need. Tempo 2.x + trace-pipeline sampling keeps 100 % of error traces, 100 % of P99+ latency traces, and a configurable tail for normal traffic. Typical ratio: 1 % baseline tail + 100 % error/latency capture.
Architecture: Signal Flow Diagram
SLO Design and the Error Budget
An SLO without an error budget is just a number. The budget is the operational lever: when it is healthy you ship features; when it is burning you freeze the release pipeline and focus engineering on reliability. The alert hierarchy follows the burn-rate model from the Google SRE book, which you should treat as read-only specification at this point:
- Page (P0) alert: burn rate > 14.4× for 1 minute. At this rate the entire 30-day error budget is consumed in 2 hours. Wake the on-call immediately.
- Ticket (P1) alert: burn rate > 6× for 5 minutes. Budget gone in 5 days. Fix during business hours today.
- Burn-rate warning: burn rate > 1× for 1 hour. Budget is shrinking; create a task, no pager needed.
user_id, request_id, trace_id — can explode a 100k-series Prometheus into 50 M series overnight. Enforce label value cardinality budgets with the Mimir cardinality API and reject metrics at the collector level using the filter processor. At Uber, a single high-cardinality metric from an SDK change caused a $250k/month infrastructure overspend before it was caught by a cardinality alarm.
Production Prometheus + Alertmanager Config
OpenTelemetry Collector: Tail Sampling Config
SLO Architecture Diagram
Alertmanager Routing and Notification Strategy
Raw Prometheus alerts routed directly to Slack or PagerDuty without grouping create alert fatigue within weeks. The correct pattern is: group by SLO and cluster, inhibit lower-severity alerts when a critical fires on the same service, and deduplicate within a 5-minute group window. The Alertmanager config below encodes that pattern for the capstone platform.
runbook_url annotation pointing to a live, maintained runbook before it can fire in production. Alerts without runbooks are disabled. This is the single highest-leverage reliability practice: the engineer who gets paged at 03:00 needs the first three diagnostic commands, the expected failure modes, and the rollback procedure — not the source code. Codify this as a CI check in your alerting rule repository.
Retention, Cost, and Operational Hygiene
Observability infrastructure typically runs at 8–15 % of total cloud spend for companies that instrument thoroughly. Keeping that figure sustainable requires deliberate cost engineering:
- Metrics: 13-month retention in Mimir (covers year-over-year capacity comparisons). Raw resolution for 7 days; 5-minute downsampling for 30 days; 1-hour downsampling beyond that. Downsampling reduces storage 40×.
- Logs: 30-day hot tier in Loki (frequent access); 1-year cold tier in S3 Glacier with a 24h restore SLA. Enforce
log_retention_daysper namespace via Loki ruler policies — debug logs from a batch job should not cost the same as payment service error logs. - Traces: Tempo with S3 backend, 14-day retention for full traces. Error traces: 90 days. Slow traces: 30 days. Normal baseline: 7 days. This asymmetry reflects how investigations actually work — nobody needs a normal 200ms trace from 6 weeks ago.
- Dashboards: Standardize on RED dashboards (Rate, Error, Duration) for every service. The first Grafana view any on-call engineer opens should answer: is this service healthy right now? Sprawling 40-panel dashboards with no clear hierarchy slow incident response.
By the time this observability stack is fully operational, every service in the capstone platform is instrumented, every SLO has a burn-rate alert wired to PagerDuty, and the platform team can answer the three incident questions — what broke, where it broke, and why — within a 5-minute MTTD target.