We are still cooking the magic in the way!
Instrumenting Applications
Instrumenting Applications
Observability does not happen automatically. A running process emits nothing useful until an engineer deliberately adds instrumentation — code that generates metrics, traces, and logs as a first-class concern, not an afterthought. At companies like Google, Meta, and Stripe, instrumentation is treated as a production readiness criterion: a service cannot ship without defined SLI metrics, structured log emitters, and at least a basic trace context propagated through every request path.
This lesson covers three interlocked disciplines: choosing and wiring up the right instrumentation libraries, naming your signals so they stay maintainable at scale, and keeping cardinality under control so your observability stack does not bankrupt you or fall over under load.
Instrumentation Libraries: OpenTelemetry First
The industry has converged on OpenTelemetry (OTel) as the vendor-neutral instrumentation standard. OTel ships SDKs for Go, Java, Python, Node.js, .NET, Ruby, Rust, PHP, and more. It produces all three signal types (metrics, traces, logs) through a single API surface and exports to any backend — Prometheus, Jaeger, Tempo, Datadog, Honeycomb — via the OTel Collector without changing application code.
Before OTel you had to choose: StatsD for metrics, OpenTracing for traces, a bespoke log library. That fragmentation meant every team built its own glue. OTel ends that. The correct default in 2025 is:
- Metrics — OTel SDK or the Prometheus client library (both are fine; Prometheus client is more battle-tested for pure metrics).
- Traces — OTel SDK with OTLP export. Do not use Jaeger's native SDK — it is deprecated.
- Logs — OTel Logs SDK (still maturing) or a structured logger (zerolog, zap, logrus) that emits JSON piped through the OTel Collector.
db.tenant_id or recording a business metric like orders.checkout.total_usd.
A minimal Python service instrumented with OTel looks like this. Install the SDK and the OTLP exporter, then initialise before any framework code runs:
Naming Conventions
A metric name is a contract. Once dashboards, alerts, and runbooks reference http_requests_total, you cannot silently rename it without breaking half your on-call tooling. Big-tech companies enforce naming through linter rules in CI and schema registries. Follow these conventions from day one:
- Prometheus / OTel convention:
{namespace}_{subsystem}_{name}_{unit}. Units go at the end in plural snake_case:_seconds,_bytes,_total(for counters). Example:checkout_payment_requests_total,checkout_payment_latency_seconds. - No abbreviations in the base name.
http_req_dur_swill confuse the engineer paged at 3 AM. Writehttp_server_request_duration_seconds. - Suffix rules: counters end in
_total; histograms/summaries omit a suffix (Prometheus appends_bucket,_sum,_countautomatically); gauges describe the thing they measure (_bytes,_connections,_queue_depth). - Span names (traces): use
verb noun—GET /orders/{id},db.query orders,kafka.publish payment-events. Do not include variable values in the name; put them in span attributes.
http.request.method, db.system, messaging.destination.name, rpc.method. Using them means your data is compatible with off-the-shelf dashboards and correlations across any OTel-aware backend. Always prefer a semantic convention attribute over an ad-hoc one.
Cardinality Discipline
Cardinality is the number of unique time-series a metric generates. A metric with labels method (5 values) × status_code (10 values) × route (200 values) produces 10,000 series. Add user_id (1,000,000 users) and you have ten billion series — Prometheus explodes, Datadog sends you a six-figure invoice.
High-cardinality values — user IDs, trace IDs, email addresses, free-form strings, UUIDs, IP addresses — must never appear as metric label values. They belong in trace span attributes and structured log fields, where the storage model handles them efficiently.
labels={"customer_id": customer_id} to a counter. Each new customer creates a new series. After a marketing campaign drives signups, Prometheus OOM-crashes and alerting goes dark exactly when you need it most. Add metric label validation to your PR checklist and code review culture.
The correct approach is to bound every label to a small, finite set. For HTTP routes, normalise path parameters: /orders/12345 becomes the label value /orders/{id}. For status codes, group into classes (2xx, 4xx, 5xx) if you have many distinct codes. For anything else, ask: "Will this label value set grow unboundedly?" If yes, move it to traces.
Instrumentation Placement: The Four Golden Signals
Instrument every service boundary using the four golden signals popularised by Google SRE: latency, traffic, errors, and saturation. At a minimum, every HTTP/gRPC handler, every outbound database or cache call, and every background job should emit these four signal types. Use a middleware or interceptor pattern so instrumentation is automatic and consistent — never rely on individual developers remembering to add it per endpoint.
Testing Your Instrumentation
Instrumentation bugs are silent — the service runs fine but emits no data, or emits wrong data. Add these checks to your local development and CI pipeline:
- Use OTel's in-memory exporter in unit tests to assert that specific spans and metrics are produced by a code path.
- Run a local OTel Collector with a
debugexporter to print all received telemetry to stdout during development. - Add a
/metricsor/debug/metricsendpoint and curl it as a smoke test in your CI pipeline after deploying to staging. - Set up an alert on
absent(up{job="my-service"})in Prometheus — if scraping stops, you hear about it immediately.
otelcol_exporter_send_failed_spans_total and alert on it. Many teams discover telemetry gaps only during an incident, when they need the data most.
Instrumentation is the foundation everything else rests on. Get the library choice, naming, and cardinality boundaries right upfront — retrofitting them across a microservices estate with 200 services is an expensive multi-quarter project. Build the habit now: every new service ships with instrumentation before it ships with a single feature.