Service Mesh: Istio & Linkerd

Resilience Policies in the Mesh

18 min Lesson 6 of 27

Resilience Policies in the Mesh

One of the original promises of a service mesh was shifting resilience logic out of application libraries and into the infrastructure plane. Before Istio, every service team copy-pasted retry and circuit-breaker code from frameworks like Hystrix, Resilience4j, or Polly — with subtly different configurations, inconsistent timeouts, and no unified observability. The mesh makes these policies declarative, uniform, and observable across every service in the cluster without a single line of application code changing.

But that same uniformity is dangerous if you configure it carelessly. Misconfigured retries cause request storms that amplify a partial outage into a total one. Circuit breakers tuned too aggressively shed traffic that would have succeeded. This lesson covers the four resilience mechanisms Istio exposes — retries, timeouts, circuit breaking, and outlier detection — and how senior engineers configure them for production workloads at scale.

Retries

Retries belong in the mesh for idempotent, read-heavy paths — GET requests, cached lookups, read replicas. They should almost never be configured on writes without application-level idempotency guarantees (unique request IDs, deduplication tokens). At Google and Amazon, the internal guidance is to be conservative with retry counts and always cap them with a budget.

Istio retries are configured on a VirtualService. The key fields are attempts (max retry count), perTryTimeout (timeout per individual attempt), and retryOn (which conditions trigger a retry). The retryOn field is critical — the default connect-failure,refused-stream,unavailable,cancelled,retriable-4xx is reasonable but you must understand what each means in your traffic profile.

# VirtualService: retry policy for a read-heavy catalog service apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: catalog-vs namespace: production spec: hosts: - catalog.production.svc.cluster.local http: - route: - destination: host: catalog.production.svc.cluster.local port: number: 8080 timeout: 5s # total envelope; must be > attempts * perTryTimeout retries: attempts: 3 perTryTimeout: 1.5s # 3 * 1.5 = 4.5s < 5s envelope retryOn: connect-failure,refused-stream,unavailable,503
Retry amplification: With 100 clients each sending 10 RPS, a brief 503 spike causes those clients to retry. With attempts: 3 you momentarily triple the RPS seen by the backend — from 1,000 to 3,000 — exactly when it is already struggling. Google SRE literature calls this a retry storm. Mitigate it by (1) keeping attempts at 2-3 maximum, (2) pairing retries with exponential backoff at the client (not natively in Istio — you need application-side jitter), and (3) setting a retry budget at the load balancer tier so retries never exceed 10% of total requests.

Timeouts

Timeouts are the most impactful single knob in a distributed system. An absent timeout means a hung upstream can hold a thread, a connection pool slot, and a goroutine indefinitely — cascading resource exhaustion that turns a single slow pod into a cluster-wide incident. The mesh enforces timeouts even when the application forgot them.

The timeout field in VirtualService is the total request timeout including all retry attempts. A common mistake is setting timeout: 3s with attempts: 3 and perTryTimeout: 2s — the math does not add up and Istio silently truncates to the envelope. Always verify: total_timeout >= attempts * perTryTimeout + overhead.

At production scale, different traffic classes need different timeout profiles. Istio lets you match on headers, URI prefixes, or source labels to apply per-route policies — for example, 500ms for synchronous API calls and 30s for batch export endpoints on the same service.

# Timeout tiers on the same VirtualService using match conditions apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: export-service-vs spec: hosts: - export-service http: - match: - uri: prefix: /v1/export/batch # long-running export path timeout: 30s route: - destination: host: export-service port: number: 8080 - route: # default: all other routes - destination: host: export-service port: number: 8080 timeout: 800ms

Circuit Breaking

Circuit breaking is a connection-layer control. Unlike retries and timeouts which act on individual requests, a circuit breaker limits the concurrency and pending queue depth for a destination. In Istio this is configured via a DestinationRule with a trafficPolicy.connectionPool block.

The key parameters are: http1MaxPendingRequests / http2MaxRequests (queue/concurrent request caps), maxRequestsPerConnection (forces reconnect, avoids stale long-lived connections), and maxRetries (caps retry concurrency at the connection-pool level, independent of the VirtualService retry count).

# DestinationRule: circuit breaker for the payments service apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: payments-dr namespace: production spec: host: payments.production.svc.cluster.local trafficPolicy: connectionPool: tcp: maxConnections: 100 # L4 connection cap per Envoy worker connectTimeout: 250ms tcpKeepalive: time: 300s interval: 75s http: http2MaxRequests: 500 # concurrent HTTP/2 streams http1MaxPendingRequests: 50 # queue depth before 503 maxRequestsPerConnection: 25 # recycle connections; avoids head-of-line blocking maxRetries: 10 # pool-level retry cap outlierDetection: # companion to circuit breaking (see below) consecutive5xxErrors: 5 interval: 10s baseEjectionTime: 30s maxEjectionPercent: 50
What "circuit open" means in Envoy: Unlike Hystrix's half-open state machine, Envoy/Istio does not have a discrete CLOSED/OPEN/HALF-OPEN circuit. Instead, when http1MaxPendingRequests is exceeded, Envoy immediately returns 503 UO (upstream overflow) for new requests without forwarding them. You see this in metrics as upstream_rq_pending_overflow. The "circuit" reopens automatically as pending requests drain — there is no explicit state transition to monitor.

Outlier Detection

Outlier detection is the mesh equivalent of passive health checking. It watches actual traffic responses from each individual upstream pod and ejects hosts that exhibit anomalous behavior from the load-balancing pool for a cooling-off period. This is fundamentally different from circuit breaking: circuit breaking limits traffic to the whole service, while outlier detection ejects individual bad pods from the subset.

This matters enormously in large fleets. A Deployment with 50 replicas where 2 pods have a memory leak causing them to return 5xx on every 3rd request will degrade all clients 4% of the time — not enough for an alert, but enough to violate an SLO. Outlier detection silently ejects those two pods while the other 48 serve clean traffic. The mesh does the work that would otherwise require a human to correlate per-pod error rates in Prometheus and manually kubectl cordon the node.

Outlier Detection: Ejecting Unhealthy Pods from Load Balancing Pool Client Envoy Sidecar Outlier Detector (Envoy LB policy) tracks 5xx/latency per upstream host Upstream Pod Pool Pod A healthy Pod B healthy Pod C healthy Pod D EJECTED (30s) no traffic 5 consec. 5xx detected re-probe after 30s
Outlier detection ejects Pod D after 5 consecutive 5xx errors; healthy pods absorb all traffic; Pod D is re-probed after the base ejection window expires.

The critical parameter interaction to understand is maxEjectionPercent. This is a safety floor: if your detection thresholds are too tight and start ejecting healthy pods under a load spike (which also generates transient errors), Istio will not eject beyond this percentage of the pool. Set it too high (100%) and you risk ejecting the entire fleet. A production-safe value is 50% for most services; lower it to 25-30% for stateful or singleton services.

# Tuned outlier detection for a stateful service (database proxy, cache layer) # More conservative: longer intervals, higher consecutive threshold outlierDetection: consecutive5xxErrors: 10 # needs sustained failures, not a blip consecutiveGatewayErrors: 5 # L4/L7 gateway errors (502, 503, 504) interval: 30s # evaluation window baseEjectionTime: 60s # minimum ejection duration maxEjectionPercent: 25 # never eject more than 1 in 4 hosts minHealthPercent: 50 # stop ejecting if pool drops below 50% healthy splitExternalLocalOriginErrors: true # differentiate local-origin vs upstream errors
Combining circuit breaking + outlier detection: These two mechanisms are complementary, not redundant. Configure them together on every DestinationRule for critical services. Circuit breaking (connectionPool) protects against concurrency overload of the service as a whole. Outlier detection protects against individual bad instances poisoning your success rate. Both appearing in the same DestinationRule block, as shown in the payments example above, is the standard production pattern at big-tech shops running Istio at scale.

Observing Resilience Policy Behavior

Policies configured but never verified are a liability. After every change to retry/timeout/circuit-breaker config, validate behavior through Envoy's admin API and Prometheus metrics. The most useful signals are:

  • envoy_cluster_upstream_rq_pending_overflow — requests dropped by circuit breaker (connection pool overflow).
  • envoy_cluster_outlier_detection_ejections_active — currently ejected host count; alert if this is non-zero for more than 2 minutes.
  • envoy_cluster_upstream_rq_retry and envoy_cluster_upstream_rq_retry_success — retry volume and success rate. A retry success rate below 60% means you are retrying failures that are not transient.
  • istio_requests_total{response_flags="UO"} — upstream overflow; direct signal that circuit breaking is actively shedding traffic.
# Check Envoy circuit-breaker stats directly on a pod sidecar # (no port-forward needed if you have kubectl exec) kubectl exec -n production deploy/payments -c istio-proxy -- \ curl -s http://localhost:15000/clusters | grep payments | grep -E "cx_active|rq_active|rq_pending" # Check current ejections for a service kubectl exec -n production deploy/frontend -c istio-proxy -- \ curl -s http://localhost:15000/clusters | grep payments | grep outlier # Prometheus query: retry rate over last 5m (as fraction of total requests) # rate(envoy_cluster_upstream_rq_retry[5m]) / rate(envoy_cluster_upstream_rq_total[5m])

At companies running Istio at thousands of services, resilience policies are managed in a central GitOps repository and rendered as VirtualService / DestinationRule templates per service tier — not hand-crafted per team. Default profiles (strict/relaxed/off) exist for service tiers, and teams opt into a stricter profile for payment-critical paths. This prevents the configuration drift that makes debugging production incidents so painful when every service has slightly different retry semantics.