Service Mesh: Istio & Linkerd

Mesh Observability

18 min Lesson 7 of 27

Mesh Observability

One of the most underappreciated gifts of a service mesh is what it gives you for free the moment you inject sidecars: a complete, consistent observability plane that requires zero application changes. Every pod that gains an Envoy proxy automatically emits the four Google SRE golden signals — latency, traffic, errors, and saturation — for every service-to-service edge in your topology. No SDK instrumentation. No per-team OTel configuration. No code reviews for forgotten metric registrations. The mesh sees every TCP byte and HTTP exchange and knows who sent it, who received it, whether it succeeded, and how long it took.

In production, this matters enormously. At Lyft — the company that originally created Envoy — the mesh telemetry layer was one of the primary justifications for the multi-year investment. Before the mesh, each team instrumented metrics differently. After, every service automatically exposed istio_requests_total, istio_request_duration_milliseconds, and istio_tcp_sent_bytes_total with a consistent label schema. P99 cross-service latency became a first-class observable with no per-engineer work.

The Golden Signals from the Sidecar

Envoy (Istio) and the linkerd2-proxy (Linkerd) both emit the four golden signals at the L4 and L7 layers automatically. The key Prometheus metrics you will use every day in production:

Traffic (request rate): istio_requests_total — a counter with labels source_workload, destination_workload, response_code, request_protocol. Derive RPS with rate(istio_requests_total[1m]).
Errors: Filter istio_requests_total by response_code!~"2.." for client/server error rates. The mesh reports errors at the transport layer — connection refused, upstream timeout, circuit-breaker open — using the x-envoy-upstream-service-time header and upstream_rq_time.
Latency: istio_request_duration_milliseconds_bucket — a histogram exposing the full distribution. Use histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket[5m])) for p99 per workload pair. This is the latency of the proxied request as seen by the sidecar, which includes application processing time.
Saturation: Derived from envoy_cluster_upstream_cx_active (active connections) versus envoy_cluster_upstream_cx_overflow (connection-pool overflow). Rising overflow is the earliest warning sign of overload before latency degrades visibly.

Key idea — label consistency at zero cost: Every metric from every sidecar carries a standardized label set: source_workload, destination_workload, source_namespace, destination_namespace, destination_service, request_protocol, response_code, response_flags. The response_flags label is Envoy-specific and invaluable — it encodes the reason for a non-2xx response: UH (no healthy upstream), UT (upstream timeout), UC (upstream connection failure), URX (upstream retry limit exceeded), DC (downstream connection termination). Filtering on response_flags in a PromQL alert can distinguish "your service is returning 500s" from "Envoy is timing out waiting for your service."

The standard Prometheus scrape configuration for Istio uses pod annotations. Istiod's telemetry API exposes metrics on port 15020 of each sidecar. The Istio Prometheus integration (or a PodMonitor if you use the Prometheus Operator) scrapes this port.

# PodMonitor — scrape Istio sidecar metrics via Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: istio-proxy
  namespace: istio-system
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      security.istio.io/tlsMode: istio    # every injected pod gets this
  namespaceSelector:
    any: true                             # scrape all namespaces
  podMetricsEndpoints:
  - port: http-envoy-prom                 # port 15090 (Envoy admin /stats/prometheus)
    path: /stats/prometheus
    scheme: http
    interval: 15s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_label_app]
      targetLabel: app
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace

For Linkerd, the equivalent is the built-in Prometheus scrape annotations that linkerd inject writes onto each pod: prometheus.io/scrape: "true" and prometheus.io/port: "4191". The linkerd-viz extension ships a pre-built Grafana dashboard that renders the golden signals out of the box.

The Service Graph and Kiali

Golden-signal metrics give you numbers; Kiali gives you a live service graph that maps those numbers onto your topology. Kiali is the official Istio observability UI. It queries Prometheus, Jaeger/Zipkin, and the Kubernetes API simultaneously to render a real-time dependency graph where every edge is annotated with RPS, error rate, p99 latency, and the mTLS status of the connection.

The mesh observability stack: sidecars emit metrics and traces at the data plane; Prometheus, OTel Collector, and the Kubernetes API collect them; Grafana, Kiali, and Jaeger surface them to engineers.

Pro practice — use Kiali only for topology, not alerting: Kiali is a read-only visualization tool. It is invaluable during incident investigation (instantly shows which edges are red) and during rollouts (watch p99 rise on a canary in real-time). But do not build paging alerts on Kiali data. All alerting should be PromQL rules against Prometheus metrics directly — Kiali's internal cache has a configurable staleness and is not designed for sub-minute alert evaluation. The pattern big platforms use: Kiali for exploration, Prometheus alertmanager for paging.

Distributed Tracing Integration

The mesh handles the hardest part of distributed tracing: it automatically generates trace spans for every proxied request and propagates the W3C traceparent (or Zipkin B3) headers between services. You do not need to instrument your application code to get inter-service spans — the Envoy sidecars create them. What you do need from the application is a single behavior: propagate the incoming trace headers on every outgoing call. Envoy reads the headers on ingress, creates a server span, then injects updated headers into the downstream request. If your service code swallows the headers (does not forward them), Envoy on the next hop starts a new disconnected trace.

Production pitfall — the one-line app change you cannot skip: Many teams deploy the mesh, see traces in Jaeger, and assume everything is working. Then they file a bug: "traces are broken, I can only see two hops." The root cause is always the same: a service is not forwarding the trace headers. With Istio, the headers to propagate are x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, and x-ot-span-context (or just traceparent if you switch to W3C mode). The fix is simple, but it requires touching every service. Do it systematically during the mesh rollout; leaving it until after means a retrofit project across dozens of teams.

Configuring Istio to send traces to a Jaeger (or OTel Collector) backend is done via a Telemetry API resource. The MeshConfig in the istio ConfigMap sets the global sampling rate and tracing provider:

# istio ConfigMap — enable tracing with 1% tail-based sampling
# kubectl -n istio-system edit configmap istio
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio
  namespace: istio-system
data:
  mesh: |
    enableTracing: true
    defaultConfig:
      tracing:
        sampling: 1.0          # 1% head-based; override per namespace with Telemetry CR
        zipkin:
          address: otel-collector.observability:9411   # Zipkin-compatible receiver on your OTel Collector
    extensionProviders:
    - name: otel-tracing
      opentelemetry:
        service: otel-collector.observability.svc.cluster.local
        port: 4317             # OTLP gRPC

---
# Per-namespace Telemetry resource — 100% sampling for the staging namespace
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: staging-full-trace
  namespace: staging
spec:
  tracing:
  - providers:
    - name: otel-tracing
    randomSamplingPercentage: 100.0

In production at 10,000 RPS, a 1% head-based sampling rate produces 100 traces per second — more than enough for latency analysis and incident debugging. The critical upgrade for production is tail-based sampling: sample 100% of error traces and slow traces (p99 > threshold) regardless of the global rate, and sample fast/successful traces at 1%. The Tempo or OTel Collector tail sampler implements this, ensuring you never miss a failure trace while keeping storage costs bounded.

Grafana Dashboards: the Canonical Stack

The Istio project ships pre-built Grafana dashboards that you import from the grafana.com catalog (or directly from the istio/istio GitHub repo under samples/addons/grafana). The four dashboards you will use in production:

Istio Mesh Dashboard (ID 7639) — global view: total RPS, global error rate, p50/p90/p99 latency across the entire mesh. The first screen you open during an incident.
Istio Service Dashboard (ID 7636) — per-service drilldown: inbound RPS by source, outbound RPS by destination, error rate broken down by response code and Envoy response flag, latency histograms. Sufficient for most incident root-cause investigations.
Istio Workload Dashboard (ID 7630) — pod-level view: useful when multiple deployments serve the same service (canary analysis, multi-version traffic splits).
Istio Performance Dashboard (ID 11829) — control-plane health: istiod CPU/memory, xDS push rate, config distribution latency. Essential for diagnosing mesh-layer problems as opposed to application problems.

# Core PromQL queries for mesh observability — use these in Grafana panels and alerts

# --- TRAFFIC: requests per second, per service pair ---
sum(rate(istio_requests_total{reporter="destination"}[1m])) by (destination_service_name)

# --- ERRORS: non-2xx rate by service ---
sum(rate(istio_requests_total{reporter="destination",response_code!~"2.."}[5m])) by (destination_service_name)
/ on(destination_service_name)
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

# --- LATENCY: p99 for a specific service pair ---
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{
    reporter="destination",
    destination_service_name="payments"
  }[5m])) by (le)
)

# --- SATURATION: upstream connection overflow (early overload signal) ---
sum(rate(envoy_cluster_upstream_cx_overflow[5m])) by (pod, cluster_name)

# --- ENVOY RESPONSE FLAGS: split error reasons ---
sum(rate(istio_requests_total{
  reporter="destination",
  response_flags=~"UH|UT|UC|URX|UF"
}[5m])) by (destination_service_name, response_flags)

The reporter="destination" filter is important: Envoy emits duplicate metrics from both the source sidecar (reporter=source) and the destination sidecar (reporter=destination). Using destination avoids double-counting and gives you the latency as seen by the receiving end, which is the correct view for SLO compliance. Use reporter=source only when you specifically need the client-perceived latency including network transit time.

Linkerd observability — same signals, different shape: Linkerd emits the same golden signals via its own metric names: request_total, response_latency_ms_bucket, tcp_open_connections. The Linkerd Viz extension ships a set of Grafana dashboards and a CLI: linkerd viz stat deploy gives you a live terminal table of success rate, RPS, and p99 latency per deployment. linkerd viz edges shows the mTLS status of every pod-to-pod edge. linkerd viz tap deploy/payments --to deploy/database streams live request details — method, path, response code, latency — without a full trace backend. This tap feature is the Linkerd equivalent of Envoy's access logs and is invaluable during development.

Connecting Metrics, Traces, and Logs: Exemplars

The final piece of production mesh observability is exemplar linkage — the ability to click on a spike in a Grafana latency panel and jump directly to a representative trace. Prometheus 2.43+ supports native exemplars: a histogram bucket can carry a trace_id label alongside the observation. Envoy 1.24+ emits exemplars when tracing is enabled.

With Grafana 9+, a latency panel backed by a Prometheus histogram with exemplars shows scatter dots overlaid on the quantile line. Each dot is a real request with a real trace ID. Clicking a dot opens Jaeger or Tempo filtered to that trace. This closes the loop: you see the p99 spike in your SLO dashboard, click the worst exemplar, land on the trace waterfall that shows exactly which service and which downstream call drove the spike. No grep, no log correlation, no context switching between four tools.

At organizations running hundreds of services — Airbnb, Pinterest, Shopify — the exemplar linkage pattern is considered mandatory for any latency SLO dashboard. The mesh provides the trace IDs for free; the only requirement is Prometheus scraping with exemplar support (--enable-feature=exemplar-storage flag on Prometheus) and a Grafana data source configured to use them.