Service Mesh: Istio & Linkerd

Traffic Management

18 min Lesson 4 of 27

Traffic Management

Traffic management is the core value proposition of a service mesh. Without a mesh, traffic routing lives inside your application code or in coarse-grained load-balancer rules — both are inflexible and operationally dangerous at scale. Istio externalises every routing decision into two Kubernetes custom resources: VirtualService and DestinationRule. Understanding the boundary between them, and how they compose, is the prerequisite to doing anything sophisticated in the data plane — canaries, blue/green, circuit breaking, fault injection, and header-based routing all depend on this pair.

The Two-Resource Model

It helps to think of these resources as two distinct layers of abstraction:

VirtualService — the routing layer. Answers the question "where does this request go?" It matches on HTTP method, URI prefix, headers, source namespace, or query parameters, and forwards matched traffic to one or more destinations. A destination references a Kubernetes service plus an optional subset label.
DestinationRule — the policy layer. Answers the question "how should traffic behave once it reaches this destination?" It defines subsets (pod label selectors that group versions of a workload), load-balancing algorithms, connection pool limits, and outlier ejection (circuit breaking). The subset names referenced by a VirtualService must be declared in the DestinationRule for that host.

Key coupling to understand: A VirtualService route that names a subset (e.g. subset: v2) will silently drop traffic if no DestinationRule exists that declares that subset for the same host. Envoy simply has no endpoints to forward to. This is the single most common misconfiguration when teams first adopt Istio.

Traffic flows from the client sidecar through the VirtualService routing layer, then through DestinationRule subsets to the correct pod version.

VirtualService: Routing Rules in Depth

A VirtualService applies to traffic destined for a given host (which maps to a Kubernetes service name). The http array is evaluated top-to-bottom; the first matching rule wins. This ordering matters — put more specific matches (header-based canary) before weight-based catch-alls, or your specific rules will never fire.

# virtualservice-checkout.yaml
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: checkout
  namespace: production
spec:
  hosts:
    - checkout                # Kubernetes service name (short form within namespace)
  http:
    # Rule 1: internal testers get v2 always (header-based canary)
    - match:
        - headers:
            x-canary-user:
              exact: "true"
      route:
        - destination:
            host: checkout
            subset: v2
          weight: 100

    # Rule 2: 10% of remaining traffic to v2 (weighted canary)
    - route:
        - destination:
            host: checkout
            subset: v1
          weight: 90
        - destination:
            host: checkout
            subset: v2
          weight: 10
      timeout: 5s
      retries:
        attempts: 3
        perTryTimeout: 2s
        retryOn: 5xx,reset,connect-failure,retriable-4xx

Key fields to know in production:

timeout — per-request timeout. Defaults to 15s in Istio 1.x. Set this explicitly; relying on the default leads to silent latency budget overruns when a downstream service degrades.
retries.retryOn — comma-separated Envoy retry conditions. 5xx alone is dangerous if your POST endpoints are not idempotent; prefer connect-failure,reset,retriable-4xx for mutation paths.
match.sourceLabels — route based on the caller's pod labels, not just the request headers. Useful when a batch job and the user-facing API share a service name but need different routing policies.

DestinationRule: Subsets and Load Balancing

The DestinationRule for the same checkout service declares the subsets the VirtualService referenced, and sets per-subset (or global) policies:

# destinationrule-checkout.yaml
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: checkout
  namespace: production
spec:
  host: checkout
  trafficPolicy:
    loadBalancer:
      simple: LEAST_CONN          # Better than ROUND_ROBIN for heterogeneous pod latency
    connectionPool:
      http:
        http2MaxRequests: 1000
        http1MaxPendingRequests: 100
      tcp:
        maxConnections: 200
    outlierDetection:             # Passive circuit breaking — eject unhealthy pods
      consecutiveGatewayErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50      # Never eject more than half the subset — prevents cascade
  subsets:
    - name: v1
      labels:
        version: v1
      trafficPolicy:
        loadBalancer:
          simple: ROUND_ROBIN     # Override global policy for v1 specifically
    - name: v2
      labels:
        version: v2
      # Inherits global LEAST_CONN

Set maxEjectionPercent explicitly. The default is 10%, which is safe for large pools but means a small subset (e.g. 3 pods) ejects at most 0 pods at the default — Envoy rounds down. At Google-scale teams typically set 50% for critical services and keep 100% only for stateless read replicas where partial failure is acceptable.

Canary Deployments with Traffic Splitting

A canary releases a new version to a controlled percentage of traffic before a full rollout. Istio's weighted routing makes this trivially precise — unlike a Kubernetes Deployment rollout that can only approximate percentages via pod replica ratios (10% requires a 9:1 replica ratio, meaning 10 pods minimum).

Production canary workflow:

Deploy v2 as a separate Deployment with version: v2 labels. Keep the replica count at 1–2 for the canary — Istio controls percentages, not pod counts.
Update the DestinationRule to declare the v2 subset.
Update the VirtualService to split traffic: start at 1–5%, monitor error rate and p99 latency, then advance in steps (10%, 25%, 50%, 100%).
At 100%, delete the v1 Deployment and remove the weight split from the VirtualService.

# Automate canary progression with kubectl patch (or GitOps — update the manifest in git)
# Step: advance canary from 10% to 25%
kubectl patch virtualservice checkout -n production --type=json \
  -p='[
    {"op":"replace","path":"/spec/http/1/route/0/weight","value":75},
    {"op":"replace","path":"/spec/http/1/route/1/weight","value":25}
  ]'

# Verify the live weights via istioctl
istioctl proxy-config routes deploy/checkout-v2 \
  --name 80 \
  -o json | jq '.[].virtualHosts[].routes[].route.weightedClusters'

Weights must always sum to 100. Istio validates this at admission time since 1.14 (the webhook will reject the resource), but older clusters silently accept invalid weights and Envoy normalises them in undefined ways. Always validate before applying: istioctl analyze -f virtualservice-checkout.yaml catches this and ~50 other common configuration errors before they reach the cluster.

Header-Based Routing for Dark Launches

Weight-based canaries expose the new version to real users proportionally. Header-based routing is a complementary strategy: a specific request header (set by internal tooling, a feature flag SDK, or a cookie) routes the bearer to v2 while 100% of normal traffic stays on v1. This is called a dark launch — the version is technically in production but invisible to ordinary users.

At Netflix and Uber this pattern is used for multi-region validation: internal employees hitting production from a corporate network get a forwarded header that shadows them onto canary versions of hundreds of services simultaneously, running real production load against the new code without risk to end users.

# Header-based routing — route employees to v2 via cookie
# The ingress gateway or API gateway injects x-canary-user: "true" for known employee IPs
- match:
    - headers:
        cookie:
          regex: ".*employee_beta=1.*"
  route:
    - destination:
        host: checkout
        subset: v2

# Mirror traffic to v2 for passive testing (no user impact)
- route:
    - destination:
        host: checkout
        subset: v1
      weight: 100
  mirror:
    host: checkout
    subset: v2
  mirrorPercentage:
    value: 100.0    # Send a copy of ALL v1 traffic to v2 in the background

The mirror field sends a fire-and-forget copy of matched requests to the shadow destination. Responses are discarded — users see only v1 responses. This lets you validate v2 against real production traffic shapes (bursty patterns, unusual payloads, edge-case headers) before shifting any real user to it.

Operational Validation

After applying any VirtualService or DestinationRule change, validate the configuration has propagated to all Envoy sidecars before calling the rollout complete:

# Check that pilot has distributed the latest config to all proxies in the namespace
istioctl proxy-status -n production

# If a proxy shows STALE, inspect its xDS config directly
istioctl proxy-config cluster deploy/frontend-v1 -n production | grep checkout

# Run the mesh-level analyzer against the whole namespace
istioctl analyze -n production

# Confirm actual routing weights from Envoy's perspective (not just the CRD)
istioctl proxy-config routes deploy/frontend-v1 \
  --name 80 \
  -o json | jq '.[] | select(.name=="80") | .virtualHosts[] | select(.name | contains("checkout"))'

The CRD being applied is not proof the routing changed. kubectl apply succeeding only means the API server accepted the resource. Istiod must then translate it to xDS and push it to every Envoy proxy. At high pod counts (1,000+) this propagation can take 5–30 seconds. istioctl proxy-status shows the sync state — wait for all proxies to show SYNCED before running smoke tests.

Production Failure Modes

The most common production incidents caused by VirtualService and DestinationRule misconfiguration, ranked by frequency in public post-mortems:

Missing DestinationRule subset: VirtualService references subset: v2 but DestinationRule has no matching entry — Envoy black-holes traffic. Use istioctl analyze in the CD pipeline before merging.
Namespace scope mismatch: VirtualService in namespace A targeting a service in namespace B requires the FQDN (checkout.production.svc.cluster.local), not the short name. Short names resolve within the VirtualService's own namespace only.
Retry storms: Aggressive retries on a degraded downstream multiply QPS by the retry count. A service handling 10,000 RPS with 3 retries on 5xx will generate 40,000 RPS to the failing downstream. Pair retries with circuit breaking and exponential backoff, and never set retryOn: 5xx on non-idempotent endpoints.
Weight-based split during pod restarts: If the v2 Deployment has zero ready pods and traffic still has weight > 0 going to its subset, every request in that percentage hits a 503. Always ensure the target Deployment is healthy before incrementing its weight.