Service Mesh: Istio & Linkerd

Traffic Management

18 min Lesson 4 of 27

Traffic Management

Traffic management is the core value proposition of a service mesh. Without a mesh, traffic routing lives inside your application code or in coarse-grained load-balancer rules — both are inflexible and operationally dangerous at scale. Istio externalises every routing decision into two Kubernetes custom resources: VirtualService and DestinationRule. Understanding the boundary between them, and how they compose, is the prerequisite to doing anything sophisticated in the data plane — canaries, blue/green, circuit breaking, fault injection, and header-based routing all depend on this pair.

The Two-Resource Model

It helps to think of these resources as two distinct layers of abstraction:

  • VirtualService — the routing layer. Answers the question "where does this request go?" It matches on HTTP method, URI prefix, headers, source namespace, or query parameters, and forwards matched traffic to one or more destinations. A destination references a Kubernetes service plus an optional subset label.
  • DestinationRule — the policy layer. Answers the question "how should traffic behave once it reaches this destination?" It defines subsets (pod label selectors that group versions of a workload), load-balancing algorithms, connection pool limits, and outlier ejection (circuit breaking). The subset names referenced by a VirtualService must be declared in the DestinationRule for that host.
Key coupling to understand: A VirtualService route that names a subset (e.g. subset: v2) will silently drop traffic if no DestinationRule exists that declares that subset for the same host. Envoy simply has no endpoints to forward to. This is the single most common misconfiguration when teams first adopt Istio.
VirtualService and DestinationRule composition — traffic flow Client Sidecar Proxy VirtualService Match: URI / headers Weight: 90% v1 / 10% v2 Retries / Timeouts Fault injection DestinationRule Subset: v1 (version=v1) Subset: v2 (version=v2) LB: LEAST_CONN Circuit breaker Pod v1 version=v1 Pod v2 version=v2 90% 10% VirtualService routes; DestinationRule defines subsets and policies.
Traffic flows from the client sidecar through the VirtualService routing layer, then through DestinationRule subsets to the correct pod version.

VirtualService: Routing Rules in Depth

A VirtualService applies to traffic destined for a given host (which maps to a Kubernetes service name). The http array is evaluated top-to-bottom; the first matching rule wins. This ordering matters — put more specific matches (header-based canary) before weight-based catch-alls, or your specific rules will never fire.

# virtualservice-checkout.yaml apiVersion: networking.istio.io/v1 kind: VirtualService metadata: name: checkout namespace: production spec: hosts: - checkout # Kubernetes service name (short form within namespace) http: # Rule 1: internal testers get v2 always (header-based canary) - match: - headers: x-canary-user: exact: "true" route: - destination: host: checkout subset: v2 weight: 100 # Rule 2: 10% of remaining traffic to v2 (weighted canary) - route: - destination: host: checkout subset: v1 weight: 90 - destination: host: checkout subset: v2 weight: 10 timeout: 5s retries: attempts: 3 perTryTimeout: 2s retryOn: 5xx,reset,connect-failure,retriable-4xx

Key fields to know in production:

  • timeout — per-request timeout. Defaults to 15s in Istio 1.x. Set this explicitly; relying on the default leads to silent latency budget overruns when a downstream service degrades.
  • retries.retryOn — comma-separated Envoy retry conditions. 5xx alone is dangerous if your POST endpoints are not idempotent; prefer connect-failure,reset,retriable-4xx for mutation paths.
  • match.sourceLabels — route based on the caller's pod labels, not just the request headers. Useful when a batch job and the user-facing API share a service name but need different routing policies.

DestinationRule: Subsets and Load Balancing

The DestinationRule for the same checkout service declares the subsets the VirtualService referenced, and sets per-subset (or global) policies:

# destinationrule-checkout.yaml apiVersion: networking.istio.io/v1 kind: DestinationRule metadata: name: checkout namespace: production spec: host: checkout trafficPolicy: loadBalancer: simple: LEAST_CONN # Better than ROUND_ROBIN for heterogeneous pod latency connectionPool: http: http2MaxRequests: 1000 http1MaxPendingRequests: 100 tcp: maxConnections: 200 outlierDetection: # Passive circuit breaking — eject unhealthy pods consecutiveGatewayErrors: 5 interval: 10s baseEjectionTime: 30s maxEjectionPercent: 50 # Never eject more than half the subset — prevents cascade subsets: - name: v1 labels: version: v1 trafficPolicy: loadBalancer: simple: ROUND_ROBIN # Override global policy for v1 specifically - name: v2 labels: version: v2 # Inherits global LEAST_CONN
Set maxEjectionPercent explicitly. The default is 10%, which is safe for large pools but means a small subset (e.g. 3 pods) ejects at most 0 pods at the default — Envoy rounds down. At Google-scale teams typically set 50% for critical services and keep 100% only for stateless read replicas where partial failure is acceptable.

Canary Deployments with Traffic Splitting

A canary releases a new version to a controlled percentage of traffic before a full rollout. Istio's weighted routing makes this trivially precise — unlike a Kubernetes Deployment rollout that can only approximate percentages via pod replica ratios (10% requires a 9:1 replica ratio, meaning 10 pods minimum).

Production canary workflow:

  1. Deploy v2 as a separate Deployment with version: v2 labels. Keep the replica count at 1–2 for the canary — Istio controls percentages, not pod counts.
  2. Update the DestinationRule to declare the v2 subset.
  3. Update the VirtualService to split traffic: start at 1–5%, monitor error rate and p99 latency, then advance in steps (10%, 25%, 50%, 100%).
  4. At 100%, delete the v1 Deployment and remove the weight split from the VirtualService.
# Automate canary progression with kubectl patch (or GitOps — update the manifest in git) # Step: advance canary from 10% to 25% kubectl patch virtualservice checkout -n production --type=json \ -p='[ {"op":"replace","path":"/spec/http/1/route/0/weight","value":75}, {"op":"replace","path":"/spec/http/1/route/1/weight","value":25} ]' # Verify the live weights via istioctl istioctl proxy-config routes deploy/checkout-v2 \ --name 80 \ -o json | jq '.[].virtualHosts[].routes[].route.weightedClusters'
Weights must always sum to 100. Istio validates this at admission time since 1.14 (the webhook will reject the resource), but older clusters silently accept invalid weights and Envoy normalises them in undefined ways. Always validate before applying: istioctl analyze -f virtualservice-checkout.yaml catches this and ~50 other common configuration errors before they reach the cluster.

Header-Based Routing for Dark Launches

Weight-based canaries expose the new version to real users proportionally. Header-based routing is a complementary strategy: a specific request header (set by internal tooling, a feature flag SDK, or a cookie) routes the bearer to v2 while 100% of normal traffic stays on v1. This is called a dark launch — the version is technically in production but invisible to ordinary users.

At Netflix and Uber this pattern is used for multi-region validation: internal employees hitting production from a corporate network get a forwarded header that shadows them onto canary versions of hundreds of services simultaneously, running real production load against the new code without risk to end users.

# Header-based routing — route employees to v2 via cookie # The ingress gateway or API gateway injects x-canary-user: "true" for known employee IPs - match: - headers: cookie: regex: ".*employee_beta=1.*" route: - destination: host: checkout subset: v2 # Mirror traffic to v2 for passive testing (no user impact) - route: - destination: host: checkout subset: v1 weight: 100 mirror: host: checkout subset: v2 mirrorPercentage: value: 100.0 # Send a copy of ALL v1 traffic to v2 in the background

The mirror field sends a fire-and-forget copy of matched requests to the shadow destination. Responses are discarded — users see only v1 responses. This lets you validate v2 against real production traffic shapes (bursty patterns, unusual payloads, edge-case headers) before shifting any real user to it.

Operational Validation

After applying any VirtualService or DestinationRule change, validate the configuration has propagated to all Envoy sidecars before calling the rollout complete:

# Check that pilot has distributed the latest config to all proxies in the namespace istioctl proxy-status -n production # If a proxy shows STALE, inspect its xDS config directly istioctl proxy-config cluster deploy/frontend-v1 -n production | grep checkout # Run the mesh-level analyzer against the whole namespace istioctl analyze -n production # Confirm actual routing weights from Envoy's perspective (not just the CRD) istioctl proxy-config routes deploy/frontend-v1 \ --name 80 \ -o json | jq '.[] | select(.name=="80") | .virtualHosts[] | select(.name | contains("checkout"))'
The CRD being applied is not proof the routing changed. kubectl apply succeeding only means the API server accepted the resource. Istiod must then translate it to xDS and push it to every Envoy proxy. At high pod counts (1,000+) this propagation can take 5–30 seconds. istioctl proxy-status shows the sync state — wait for all proxies to show SYNCED before running smoke tests.

Production Failure Modes

The most common production incidents caused by VirtualService and DestinationRule misconfiguration, ranked by frequency in public post-mortems:

  • Missing DestinationRule subset: VirtualService references subset: v2 but DestinationRule has no matching entry — Envoy black-holes traffic. Use istioctl analyze in the CD pipeline before merging.
  • Namespace scope mismatch: VirtualService in namespace A targeting a service in namespace B requires the FQDN (checkout.production.svc.cluster.local), not the short name. Short names resolve within the VirtualService's own namespace only.
  • Retry storms: Aggressive retries on a degraded downstream multiply QPS by the retry count. A service handling 10,000 RPS with 3 retries on 5xx will generate 40,000 RPS to the failing downstream. Pair retries with circuit breaking and exponential backoff, and never set retryOn: 5xx on non-idempotent endpoints.
  • Weight-based split during pod restarts: If the v2 Deployment has zero ready pods and traffic still has weight > 0 going to its subset, every request in that percentage hits a 503. Always ensure the target Deployment is healthy before incrementing its weight.