Service Mesh: Istio & Linkerd

Resilience Policies in the Mesh

18 min Lesson 6 of 27

Resilience Policies in the Mesh

One of the original promises of a service mesh was shifting resilience logic out of application libraries and into the infrastructure plane. Before Istio, every service team copy-pasted retry and circuit-breaker code from frameworks like Hystrix, Resilience4j, or Polly — with subtly different configurations, inconsistent timeouts, and no unified observability. The mesh makes these policies declarative, uniform, and observable across every service in the cluster without a single line of application code changing.

But that same uniformity is dangerous if you configure it carelessly. Misconfigured retries cause request storms that amplify a partial outage into a total one. Circuit breakers tuned too aggressively shed traffic that would have succeeded. This lesson covers the four resilience mechanisms Istio exposes — retries, timeouts, circuit breaking, and outlier detection — and how senior engineers configure them for production workloads at scale.

Retries

Retries belong in the mesh for idempotent, read-heavy paths — GET requests, cached lookups, read replicas. They should almost never be configured on writes without application-level idempotency guarantees (unique request IDs, deduplication tokens). At Google and Amazon, the internal guidance is to be conservative with retry counts and always cap them with a budget.

Istio retries are configured on a VirtualService. The key fields are attempts (max retry count), perTryTimeout (timeout per individual attempt), and retryOn (which conditions trigger a retry). The retryOn field is critical — the default connect-failure,refused-stream,unavailable,cancelled,retriable-4xx is reasonable but you must understand what each means in your traffic profile.

# VirtualService: retry policy for a read-heavy catalog service
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: catalog-vs
  namespace: production
spec:
  hosts:
    - catalog.production.svc.cluster.local
  http:
    - route:
        - destination:
            host: catalog.production.svc.cluster.local
            port:
              number: 8080
      timeout: 5s            # total envelope; must be > attempts * perTryTimeout
      retries:
        attempts: 3
        perTryTimeout: 1.5s  # 3 * 1.5 = 4.5s  <  5s envelope
        retryOn: connect-failure,refused-stream,unavailable,503

Retry amplification: With 100 clients each sending 10 RPS, a brief 503 spike causes those clients to retry. With attempts: 3 you momentarily triple the RPS seen by the backend — from 1,000 to 3,000 — exactly when it is already struggling. Google SRE literature calls this a retry storm. Mitigate it by (1) keeping attempts at 2-3 maximum, (2) pairing retries with exponential backoff at the client (not natively in Istio — you need application-side jitter), and (3) setting a retry budget at the load balancer tier so retries never exceed 10% of total requests.

Timeouts

Timeouts are the most impactful single knob in a distributed system. An absent timeout means a hung upstream can hold a thread, a connection pool slot, and a goroutine indefinitely — cascading resource exhaustion that turns a single slow pod into a cluster-wide incident. The mesh enforces timeouts even when the application forgot them.

The timeout field in VirtualService is the total request timeout including all retry attempts. A common mistake is setting timeout: 3s with attempts: 3 and perTryTimeout: 2s — the math does not add up and Istio silently truncates to the envelope. Always verify: total_timeout >= attempts * perTryTimeout + overhead.

At production scale, different traffic classes need different timeout profiles. Istio lets you match on headers, URI prefixes, or source labels to apply per-route policies — for example, 500ms for synchronous API calls and 30s for batch export endpoints on the same service.

# Timeout tiers on the same VirtualService using match conditions
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: export-service-vs
spec:
  hosts:
    - export-service
  http:
    - match:
        - uri:
            prefix: /v1/export/batch   # long-running export path
      timeout: 30s
      route:
        - destination:
            host: export-service
            port:
              number: 8080
    - route:                           # default: all other routes
        - destination:
            host: export-service
            port:
              number: 8080
      timeout: 800ms

Circuit Breaking

Circuit breaking is a connection-layer control. Unlike retries and timeouts which act on individual requests, a circuit breaker limits the concurrency and pending queue depth for a destination. In Istio this is configured via a DestinationRule with a trafficPolicy.connectionPool block.

The key parameters are: http1MaxPendingRequests / http2MaxRequests (queue/concurrent request caps), maxRequestsPerConnection (forces reconnect, avoids stale long-lived connections), and maxRetries (caps retry concurrency at the connection-pool level, independent of the VirtualService retry count).

# DestinationRule: circuit breaker for the payments service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments-dr
  namespace: production
spec:
  host: payments.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100          # L4 connection cap per Envoy worker
        connectTimeout: 250ms
        tcpKeepalive:
          time: 300s
          interval: 75s
      http:
        http2MaxRequests: 500        # concurrent HTTP/2 streams
        http1MaxPendingRequests: 50  # queue depth before 503
        maxRequestsPerConnection: 25 # recycle connections; avoids head-of-line blocking
        maxRetries: 10               # pool-level retry cap
    outlierDetection:                # companion to circuit breaking (see below)
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

What "circuit open" means in Envoy: Unlike Hystrix's half-open state machine, Envoy/Istio does not have a discrete CLOSED/OPEN/HALF-OPEN circuit. Instead, when http1MaxPendingRequests is exceeded, Envoy immediately returns 503 UO (upstream overflow) for new requests without forwarding them. You see this in metrics as upstream_rq_pending_overflow. The "circuit" reopens automatically as pending requests drain — there is no explicit state transition to monitor.

Outlier Detection

Outlier detection is the mesh equivalent of passive health checking. It watches actual traffic responses from each individual upstream pod and ejects hosts that exhibit anomalous behavior from the load-balancing pool for a cooling-off period. This is fundamentally different from circuit breaking: circuit breaking limits traffic to the whole service, while outlier detection ejects individual bad pods from the subset.

This matters enormously in large fleets. A Deployment with 50 replicas where 2 pods have a memory leak causing them to return 5xx on every 3rd request will degrade all clients 4% of the time — not enough for an alert, but enough to violate an SLO. Outlier detection silently ejects those two pods while the other 48 serve clean traffic. The mesh does the work that would otherwise require a human to correlate per-pod error rates in Prometheus and manually kubectl cordon the node.

Outlier detection ejects Pod D after 5 consecutive 5xx errors; healthy pods absorb all traffic; Pod D is re-probed after the base ejection window expires.

The critical parameter interaction to understand is maxEjectionPercent. This is a safety floor: if your detection thresholds are too tight and start ejecting healthy pods under a load spike (which also generates transient errors), Istio will not eject beyond this percentage of the pool. Set it too high (100%) and you risk ejecting the entire fleet. A production-safe value is 50% for most services; lower it to 25-30% for stateful or singleton services.

# Tuned outlier detection for a stateful service (database proxy, cache layer)
# More conservative: longer intervals, higher consecutive threshold
outlierDetection:
  consecutive5xxErrors: 10          # needs sustained failures, not a blip
  consecutiveGatewayErrors: 5       # L4/L7 gateway errors (502, 503, 504)
  interval: 30s                     # evaluation window
  baseEjectionTime: 60s             # minimum ejection duration
  maxEjectionPercent: 25            # never eject more than 1 in 4 hosts
  minHealthPercent: 50              # stop ejecting if pool drops below 50% healthy
  splitExternalLocalOriginErrors: true  # differentiate local-origin vs upstream errors

Combining circuit breaking + outlier detection: These two mechanisms are complementary, not redundant. Configure them together on every DestinationRule for critical services. Circuit breaking (connectionPool) protects against concurrency overload of the service as a whole. Outlier detection protects against individual bad instances poisoning your success rate. Both appearing in the same DestinationRule block, as shown in the payments example above, is the standard production pattern at big-tech shops running Istio at scale.

Observing Resilience Policy Behavior

Policies configured but never verified are a liability. After every change to retry/timeout/circuit-breaker config, validate behavior through Envoy's admin API and Prometheus metrics. The most useful signals are:

envoy_cluster_upstream_rq_pending_overflow — requests dropped by circuit breaker (connection pool overflow).
envoy_cluster_outlier_detection_ejections_active — currently ejected host count; alert if this is non-zero for more than 2 minutes.
envoy_cluster_upstream_rq_retry and envoy_cluster_upstream_rq_retry_success — retry volume and success rate. A retry success rate below 60% means you are retrying failures that are not transient.
istio_requests_total{response_flags="UO"} — upstream overflow; direct signal that circuit breaking is actively shedding traffic.

# Check Envoy circuit-breaker stats directly on a pod sidecar
# (no port-forward needed if you have kubectl exec)
kubectl exec -n production deploy/payments -c istio-proxy -- \
  curl -s http://localhost:15000/clusters | grep payments | grep -E "cx_active|rq_active|rq_pending"

# Check current ejections for a service
kubectl exec -n production deploy/frontend -c istio-proxy -- \
  curl -s http://localhost:15000/clusters | grep payments | grep outlier

# Prometheus query: retry rate over last 5m (as fraction of total requests)
# rate(envoy_cluster_upstream_rq_retry[5m]) / rate(envoy_cluster_upstream_rq_total[5m])

At companies running Istio at thousands of services, resilience policies are managed in a central GitOps repository and rendered as VirtualService / DestinationRule templates per service tier — not hand-crafted per team. Default profiles (strict/relaxed/off) exist for service tiers, and teams opt into a stricter profile for payment-critical paths. This prevents the configuration drift that makes debugging production incidents so painful when every service has slightly different retry semantics.