Instrumenting with OTel
Instrumenting with OTel
OpenTelemetry gives you two distinct paths to add tracing to a service: auto-instrumentation, which works with zero code changes, and manual instrumentation, where you write explicit SDK calls to create spans, attach attributes, and record events. In production systems at Google-scale, you almost always use both together — auto for the boilerplate (HTTP, database, gRPC), manual for the business-critical logic that the framework cannot see. This lesson teaches you exactly when to use each, and how to do it right.
Auto-Instrumentation: Zero-Code Observability
Auto-instrumentation works by patching well-known libraries at load time. The OTel agent (JVM, Python, Node.js) or SDK hooks intercept calls to popular frameworks — Django, Flask, Express, Spring Boot, gRPC, psycopg2, redis-py — and automatically create spans with sensible defaults. You get spans for every HTTP request, every DB query, and every outbound call without touching application code.
Python example — enabling auto-instrumentation with zero code change:
That single command wraps the process. Every Flask route handler, every SQLAlchemy query, every Redis call now emits spans — with zero diff to your application source. The agent injects the W3C traceparent header into all outbound HTTP calls automatically, ensuring traces propagate across service boundaries.
checkout span involved a high-value order, or that a cache miss triggered an expensive fallback. That is where manual instrumentation earns its keep.
Manual Instrumentation: Annotating What Matters
Manual instrumentation lets you create spans for arbitrary code blocks, attach structured key-value attributes, and record point-in-time span events. Think of a span as a stopwatch around a unit of work; attributes as labels on that stopwatch; and events as timestamped notes taken while the clock is running.
Here is a realistic Python example for a payment processing function — the kind of code where understanding latency breakdown actually matters in production:
This gives you a trace waterfall that shows: the full payments.process span with fraud check duration visible as event deltas, the nested gateway.charge child span with Stripe-specific metadata, and every attribute available as a filter in your Jaeger or Tempo UI — instantly answerable: "Show me all traces where payment.amount_cents > 100000 and fraud.decision == allow that took over 2 seconds."
Attributes: Designing for Queryability
Attributes are the core of trace-driven debugging. OTel defines semantic conventions — a shared vocabulary that makes spans from any service, in any language, look the same in your backend. Follow them religiously.
Key semantic conventions to memorise (from opentelemetry-semantic-conventions):
http.request.method,http.response.status_code,url.path— HTTP spansdb.system,db.name,db.operation.name,db.query.text— database spansmessaging.system,messaging.destination.name— Kafka, SQS, RabbitMQrpc.system,rpc.service,rpc.method— gRPC spansservice.name,service.version,deployment.environment— resource attributes (set once, apply to all spans from that process)
OTEL_RESOURCE_ATTRIBUTES or the Resource SDK class. Span attributes describe this specific operation. Never repeat resource data on every span — it bloats storage and is redundant since your backend joins them automatically.
Span Events vs Logs: The Right Mental Model
A span event is a structured, timestamped log entry that is automatically correlated to a trace — you get a precise timeline of what happened inside a span without needing log correlation by hand. Use events for: state transitions inside a span, cache hit/miss decisions, retry attempts, significant branch points. Use a regular log (with trace_id injected) for high-volume operational noise that should not be part of the trace payload.
user.id, order.id) and look up PII separately in your application database when needed. This is both a GDPR requirement and a security boundary: your tracing backend typically has a much wider access control surface than your production database.
Initialising the SDK: The Tracer Provider
Before any span can be emitted, your process must configure a TracerProvider with an exporter and resource. Do this once at application startup — in your main(), WSGI entrypoint, or framework bootstrap. In Python with OTLP gRPC export to the Collector:
The BatchSpanProcessor buffers spans in memory and flushes them asynchronously — the only option safe for production. The synchronous SimpleSpanProcessor blocks the calling thread on every export and should be used only in tests or CLIs.
Putting It Together: What Big-Tech Teams Actually Do
At companies operating hundreds of microservices, the pattern is always: auto-instrumentation handles the transport layer, manual instrumentation annotates business operations, and a shared internal library encapsulates the SDK bootstrap so every service is configured consistently. Teams define a company-wide attribute taxonomy (approved user.*, order.*, payment.* keys) enforced by lint rules on the tracer calls. New engineers do not write raw OTel SDK calls — they use the internal wrapper that already embeds the right resource attributes, the right exporter endpoint, and the right sampling configuration. That consistency is what makes traces useful at scale: you can write a single Jaeger query across 300 services and get coherent results because every span follows the same attribute schema.