Distributed Tracing & OpenTelemetry

Instrumenting with OTel

18 min Lesson 4 of 28

Instrumenting with OTel

OpenTelemetry gives you two distinct paths to add tracing to a service: auto-instrumentation, which works with zero code changes, and manual instrumentation, where you write explicit SDK calls to create spans, attach attributes, and record events. In production systems at Google-scale, you almost always use both together — auto for the boilerplate (HTTP, database, gRPC), manual for the business-critical logic that the framework cannot see. This lesson teaches you exactly when to use each, and how to do it right.

Auto-Instrumentation: Zero-Code Observability

Auto-instrumentation works by patching well-known libraries at load time. The OTel agent (JVM, Python, Node.js) or SDK hooks intercept calls to popular frameworks — Django, Flask, Express, Spring Boot, gRPC, psycopg2, redis-py — and automatically create spans with sensible defaults. You get spans for every HTTP request, every DB query, and every outbound call without touching application code.

Python example — enabling auto-instrumentation with zero code change:

# Install the OTel Python agent and Flask instrumentation pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-bootstrap -a install # auto-detects installed libs and installs their instrumentation # Run your Flask app under the agent opentelemetry-instrument \ --service_name order-service \ --exporter_otlp_endpoint http://otel-collector:4317 \ --exporter_otlp_protocol grpc \ python app.py

That single command wraps the process. Every Flask route handler, every SQLAlchemy query, every Redis call now emits spans — with zero diff to your application source. The agent injects the W3C traceparent header into all outbound HTTP calls automatically, ensuring traces propagate across service boundaries.

Auto-instrumentation is your baseline, not your ceiling. It handles the infrastructure layer: HTTP servers, DB drivers, message consumers. It knows nothing about your business logic — it cannot tell you that the checkout span involved a high-value order, or that a cache miss triggered an expensive fallback. That is where manual instrumentation earns its keep.

Manual Instrumentation: Annotating What Matters

Manual instrumentation lets you create spans for arbitrary code blocks, attach structured key-value attributes, and record point-in-time span events. Think of a span as a stopwatch around a unit of work; attributes as labels on that stopwatch; and events as timestamped notes taken while the clock is running.

Here is a realistic Python example for a payment processing function — the kind of code where understanding latency breakdown actually matters in production:

from opentelemetry import trace from opentelemetry.trace import SpanKind, StatusCode tracer = trace.get_tracer("payments.service", "1.4.2") def process_payment(order_id: str, amount_cents: int, payment_method: str) -> dict: with tracer.start_as_current_span( "payments.process", kind=SpanKind.INTERNAL, ) as span: # --- Attributes: structured facts about this unit of work --- span.set_attribute("order.id", order_id) span.set_attribute("payment.method", payment_method) span.set_attribute("payment.amount_cents", amount_cents) span.set_attribute("payment.currency", "USD") # --- Span Event: record a timestamped moment --- span.add_event("fraud_check.start") fraud_result = run_fraud_check(order_id, amount_cents) span.add_event("fraud_check.complete", { "fraud.score": fraud_result.score, "fraud.decision": fraud_result.decision, }) if fraud_result.decision == "block": span.set_status(StatusCode.ERROR, "Payment blocked by fraud check") span.set_attribute("payment.blocked", True) raise PaymentBlockedError(order_id) # Child span for the gateway call (auto-instrumented HTTP, or manual) with tracer.start_as_current_span("gateway.charge") as gw_span: gw_span.set_attribute("gateway.provider", "stripe") response = stripe.charge(order_id, amount_cents) gw_span.set_attribute("gateway.charge_id", response.charge_id) span.set_attribute("payment.success", True) return {"charge_id": response.charge_id}

This gives you a trace waterfall that shows: the full payments.process span with fraud check duration visible as event deltas, the nested gateway.charge child span with Stripe-specific metadata, and every attribute available as a filter in your Jaeger or Tempo UI — instantly answerable: "Show me all traces where payment.amount_cents > 100000 and fraud.decision == allow that took over 2 seconds."

Attributes: Designing for Queryability

Attributes are the core of trace-driven debugging. OTel defines semantic conventions — a shared vocabulary that makes spans from any service, in any language, look the same in your backend. Follow them religiously.

OTel Span Anatomy — attributes, events, status, parent link payments.process Attributes: order.id = "ord-9821" payment.amount_cents = 45000 payment.method = "card" http.status_code = 200 Span Events (timestamped) t+0ms → fraud_check.start t+43ms → fraud_check.complete { score: 0.12 } t+48ms → gateway.charge.start t+310ms→ gateway.charge.complete { id: ch_xyz } gateway.charge (child) gateway.provider = "stripe" gateway.charge_id = "ch_xyz" Span Status UNSET (default — no error) ERROR (set explicitly on failure) Attributes persist for the span lifetime; events are timestamped snapshots mid-span.
Anatomy of an OTel span — attributes label the work, events capture moments within it, child spans model sub-operations.

Key semantic conventions to memorise (from opentelemetry-semantic-conventions):

  • http.request.method, http.response.status_code, url.path — HTTP spans
  • db.system, db.name, db.operation.name, db.query.text — database spans
  • messaging.system, messaging.destination.name — Kafka, SQS, RabbitMQ
  • rpc.system, rpc.service, rpc.method — gRPC spans
  • service.name, service.version, deployment.environment — resource attributes (set once, apply to all spans from that process)
Resource attributes vs span attributes. Resource attributes describe the process (service name, version, host, k8s pod). Set them once at SDK initialisation via OTEL_RESOURCE_ATTRIBUTES or the Resource SDK class. Span attributes describe this specific operation. Never repeat resource data on every span — it bloats storage and is redundant since your backend joins them automatically.

Span Events vs Logs: The Right Mental Model

A span event is a structured, timestamped log entry that is automatically correlated to a trace — you get a precise timeline of what happened inside a span without needing log correlation by hand. Use events for: state transitions inside a span, cache hit/miss decisions, retry attempts, significant branch points. Use a regular log (with trace_id injected) for high-volume operational noise that should not be part of the trace payload.

Do not store sensitive data in attributes or events. Span attributes are exported to your tracing backend — Jaeger, Tempo, or a vendor SaaS — and are often retained for days or weeks. Never put PII (email addresses, full names, payment card numbers, SSNs) in a span attribute. Store an opaque ID (user.id, order.id) and look up PII separately in your application database when needed. This is both a GDPR requirement and a security boundary: your tracing backend typically has a much wider access control surface than your production database.

Initialising the SDK: The Tracer Provider

Before any span can be emitted, your process must configure a TracerProvider with an exporter and resource. Do this once at application startup — in your main(), WSGI entrypoint, or framework bootstrap. In Python with OTLP gRPC export to the Collector:

from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION resource = Resource.create({ SERVICE_NAME: "order-service", SERVICE_VERSION: "2.3.1", "deployment.environment": "production", "k8s.pod.name": os.environ.get("POD_NAME", "unknown"), }) exporter = OTLPSpanExporter( endpoint="http://otel-collector.observability.svc:4317", insecure=True, # TLS terminated at the Collector in-cluster ) provider = TracerProvider(resource=resource) provider.add_span_processor(BatchSpanProcessor(exporter)) trace.set_tracer_provider(provider) # Now any library instrumentation and tracer.get_tracer() calls use this provider tracer = trace.get_tracer(__name__)

The BatchSpanProcessor buffers spans in memory and flushes them asynchronously — the only option safe for production. The synchronous SimpleSpanProcessor blocks the calling thread on every export and should be used only in tests or CLIs.

Putting It Together: What Big-Tech Teams Actually Do

At companies operating hundreds of microservices, the pattern is always: auto-instrumentation handles the transport layer, manual instrumentation annotates business operations, and a shared internal library encapsulates the SDK bootstrap so every service is configured consistently. Teams define a company-wide attribute taxonomy (approved user.*, order.*, payment.* keys) enforced by lint rules on the tracer calls. New engineers do not write raw OTel SDK calls — they use the internal wrapper that already embeds the right resource attributes, the right exporter endpoint, and the right sampling configuration. That consistency is what makes traces useful at scale: you can write a single Jaeger query across 300 services and get coherent results because every span follows the same attribute schema.