Observability Foundations

Monitoring vs Observability

18 min Lesson 1 of 28

Monitoring vs Observability

You have spent earlier tutorials building systems that run: containerized services on Kubernetes, infrastructure provisioned by Terraform, pipelines that promote code from commit to production. Now comes the question that every senior engineer eventually confronts: how do you actually know what those systems are doing — not just right now, but when something goes wrong at 2 AM in a way you have never seen before?

This distinction drives everything in this tutorial. Monitoring and observability are related but fundamentally different ideas, and conflating them is one of the most expensive mistakes you can make in a large-scale system.

Monitoring: Known-Unknowns

Monitoring is the practice of collecting and alerting on a predefined set of signals that you already believe to be important. You decide in advance: "I care about CPU utilization, HTTP 5xx rate, queue depth, and p99 latency." You set thresholds. When a threshold is crossed, you get paged.

Monitoring answers questions you already know to ask. The mental model is a checklist: is CPU okay? Is error rate okay? Is disk okay? If every item on the checklist passes, you declare the system healthy. If one item fails, you know which dashboard to open.

Monitoring is excellent at detecting known failure modes — the failures you have seen before, the ones you anticipated when you designed the system. A database connection pool exhausted, a memory leak that manifests as a steady RSS climb, a downstream service timing out and causing latency spikes. These are your known-unknowns: you do not know when they will happen, but you know that they can happen, so you instrumented for them.

Key idea: Monitoring tells you when a system deviates from known-good behavior. It is a detection system for failure modes you anticipated in advance. At Google this is called the "alerting on symptoms vs. causes" distinction — monitor on symptoms (user-visible signals like error rate and latency), not causes (disk full, CPU high), because symptoms are what actually hurt users.

The Observability Gap: Unknown-Unknowns

Here is the hard truth: in a sufficiently complex distributed system, the failures that really hurt you are the ones you did not anticipate. A rare combination of input parameters that exposes a code path never hit in load testing. A network partition between two specific availability zones that only occurs under a particular traffic pattern. A third-party API that begins returning stale data silently, causing your recommendation engine to degrade without any error rate spike. These are unknown-unknowns — you did not know the failure mode existed, so you did not instrument for it, and your monitoring tells you nothing useful.

Observability is the property of a system that lets you understand its internal state by examining its external outputs — without needing to know in advance what questions you will ask. The term comes from control theory: a system is "observable" if you can determine its internal state from its outputs alone. Applied to software engineering, it means your system emits enough data — metrics, logs, traces — that you can reason backward from an unexpected symptom to its root cause, even for failure modes you have never seen before.

The critical difference: with monitoring you get an alert and then look at predefined dashboards. With observability you get an alert and then ask new questions of your telemetry data to explore what is actually happening. You slice by user ID, by region, by version, by request path — whatever the data leads you to — until you narrow the causal chain.

Monitoring vs Observability: Known vs Unknown Unknowns MONITORING Known-Unknowns ✓ CPU < 80% ✓ HTTP 5xx < 0.1% ✓ p99 Latency < 500ms ✗ Queue Depth > 10k → ALERT Silent data staleness in recommendation engine → no alert, no dashboard OBSERVABILITY Unknown-Unknowns What changed for user_id=X? Which AZ is slow for this API? Trace this request end-to-end Which version introduced this? Silent data staleness → slice by model_version, root cause found in 12 min +data
Monitoring catches anticipated failures via fixed thresholds; observability lets you ask new questions to find failures you never anticipated.

A Concrete Production Example

Consider a checkout service at an e-commerce company. Your monitoring tells you: error rate is 0.05%, p99 latency is 340ms, CPU is at 42%. All green. But revenue is down 18% for the past six hours. This is a classic observability gap — no monitored metric crossed a threshold, yet the system is deeply broken.

With a fully observable system you can ask: which segment of users is failing? Break it down by country — Brazil shows checkout completion rate of 11% versus the usual 89%. Drill into traces for Brazilian requests — the payment gateway call has a 28-second timeout being silently swallowed by a try/catch that returns a fake success. The exception is logged, but nobody built a monitor on the error count for that specific payment provider.

You found this in minutes because the data was there and you could query it freely. Without that data, you would have spent hours looking at the green dashboards wondering why revenue was down.

Pro practice: At Netflix, Uber, and Stripe the guiding question for on-call engineers is not "which alert fired?" but "what changed recently?" Observability tools let you correlate a degradation with a deployment, a config change, or a traffic shift — none of which would trigger a traditional threshold alert. Build your telemetry stack with this exploratory workflow in mind from day one.

Why the Distinction Matters Now

In a monolith with ten servers, monitoring was sufficient. You had a small number of components, well-understood failure modes, and engineers who knew every code path. In a microservices system on Kubernetes with 200 services, thousands of pods, and hundreds of deploys per day, the cardinality of possible failure states is astronomically higher. Monitoring alone cannot keep up.

Two trends have made this non-optional at production scale:

  1. Cardinality explosion: Modern systems emit high-cardinality data — request IDs, user IDs, trace IDs, feature flags, experiment variants. Traditional time-series monitoring tools (Nagios, early Prometheus) were not designed to query across these dimensions freely. Purpose-built observability backends are.
  2. The shift-left of production ownership: As "you build it, you run it" becomes the norm, product engineers own their own on-call. They are not domain experts in every failure mode — they need tools that let them investigate freely, not just check dashboards built by someone else months ago.

The Three Pillars (Preview)

Observability is typically implemented through three complementary signal types, which the next lesson covers in depth:

  • Metrics — aggregated numeric measurements over time (request rate, error count, latency percentiles). Efficient to store, great for alerting, limited cardinality.
  • Logs — structured event records emitted at runtime. High detail, arbitrary fields, expensive at scale without careful management.
  • Traces — records of a single request's journey across multiple services. Essential for latency attribution and dependency mapping in distributed systems.

None of these alone is sufficient. A mature observability practice uses all three and correlates them — you get an alert from a metric, jump to the related traces, and inspect the log lines from the spans that look anomalous. The tooling that makes this correlation seamless (Honeycomb, Grafana + Tempo + Loki, Datadog APM) is what separates a production observability stack from a collection of dashboards.

# A quick illustration of the difference in practice. # Monitoring approach — Prometheus alert rule (known threshold): # prometheus/alerts.yml groups: - name: checkout-service rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 for: 2m labels: severity: critical annotations: summary: "Error rate above 1% for 2 minutes" # This fires when error rate exceeds 1%. It tells you NOTHING about: # - which users are affected # - which downstream service is responsible # - whether this is a new failure mode or a known one # Observability approach — structured log + trace ID so you can query: # Application emits (Go example): # log.Info("checkout_attempt", # "trace_id", span.SpanContext().TraceID().String(), # "user_id", userID, # "country", country, # "payment_provider", provider, # "duration_ms", elapsed.Milliseconds(), # "success", success, # "error", errStr, # ) # Now in your log backend (Loki / Splunk / Datadog Logs) you can ask: # sum by (country) (rate({app="checkout"} | json | success="false" [10m])) # → reveals Brazil at 89% failure rate, all others < 0.1%
Production pitfall: Many teams declare themselves "observable" because they have Prometheus dashboards and a Grafana instance. That is monitoring, not observability. The test is simple: when your last unexpected production incident occurred, could you answer "who was affected, which code path failed, and what was different about those requests?" in under 15 minutes using your existing tooling? If not, you have monitoring, not observability.

Shifting the Mental Model

The practical takeaway is a shift in how you think about your systems and your relationship to their failure modes. Monitoring asks: is this component within expected parameters? Observability asks: what is this system actually doing, and why? Monitoring is a set of assertions. Observability is a capability for inquiry.

Building toward the latter requires making deliberate decisions at every layer of your stack — how your applications emit data, how your infrastructure is instrumented, what tooling you use to store and query telemetry, and how your team develops the practice of exploratory debugging against live data. The remaining nine lessons in this tutorial build each of those layers systematically.

ES
Edrees Salih
1 hour ago

We are still cooking the magic in the way!