Service Mesh: Istio & Linkerd

Project: Mesh a Microservices App

18 min Lesson 10 of 27

Project: Mesh a Microservices App

This capstone project wires together every concept from the tutorial into one end-to-end walkthrough: installing Istio on a real Kubernetes cluster, onboarding a multi-service application, enforcing mutual TLS across all communication, deploying a canary release with precise weight-based traffic splitting, layering resilience policies, and confirming everything via the observability stack. The goal is a production-grade procedure you can run verbatim or adapt to your own systems.

Target Application: Online Boutique

Google's Online Boutique (microservices-demo) is the canonical mesh test application: eleven polyglot services (Go, Python, C#, Java, Node.js), realistic gRPC and HTTP traffic, and a frontend that exercises every path. We will mesh it, enforce mTLS, then perform a canary rollout of the productcatalogservice from v1 to v2.

All commands assume Istio 1.22, Kubernetes 1.29+, and istioctl on your PATH. A single-node k3s cluster on a 4-core / 8 GB machine is sufficient for this walkthrough. Online Boutique manifests live at github.com/GoogleCloudPlatform/microservices-demo.

Step 1 — Install Istio with a Production-Grade IstioOperator

Use the default profile but override three settings critical for production: structured JSON access logs so your pipeline can parse them without regex, 5% trace sampling (raise to 100% during incidents), and explicit resource requests on istiod so it is not evicted under pressure.

cat > istio-install.yaml <<EOF apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: name: production spec: profile: default meshConfig: accessLogFile: /dev/stdout accessLogFormat: | {"ts":"%START_TIME%","method":"%REQ(:METHOD)%", "path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%", "status":"%RESPONSE_CODE%","ms":"%DURATION%", "upstream":"%UPSTREAM_CLUSTER%", "trace":"%REQ(X-B3-TRACEID)%"} defaultConfig: tracing: sampling: 5.0 components: pilot: k8s: resources: requests: {cpu: 200m, memory: 256Mi} limits: {cpu: "1", memory: 512Mi} EOF istioctl install -f istio-install.yaml -y istioctl verify-install kubectl get pods -n istio-system

Step 2 — Inject Sidecars: Label the Namespace

Enable automatic sidecar injection on the application namespace, then deploy Online Boutique. Every pod will start with two containers: the app and the Envoy sidecar.

kubectl create namespace boutique kubectl label namespace boutique istio-injection=enabled # Deploy Online Boutique (strip out its bundled service mesh config if present) kubectl apply -n boutique \ -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml # Confirm 2/2 READY for every pod (app + istio-proxy) kubectl get pods -n boutique kubectl get pod -n boutique -l app=frontend -o jsonpath='{.items[0].spec.containers[*].name}' # Expected output: server istio-proxy

Step 3 — Enforce Strict mTLS Across the Namespace

By default Istio runs in PERMISSIVE mode — it accepts both plain-text and mTLS traffic. This is useful during onboarding but must be tightened before you claim the namespace is secure. A namespace-scoped PeerAuthentication resource switches all services to STRICT mode: plain-text connections are rejected at the sidecar, not the application.

mTLS Strict Mode: mutual certificate exchange between sidecars Frontend Pod App Container Envoy Sidecar mTLS (SPIFFE certs) plain-text REJECTED Catalog Pod App Container Envoy Sidecar istiod cert authority (CA) PeerAuthentication: STRICT namespace: boutique
Strict mTLS: both sidecars present SPIFFE X.509 certificates issued by istiod; plain-text traffic is dropped at the receiving proxy.
# Lock down the entire namespace to STRICT mTLS kubectl apply -n boutique -f - <<EOF apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: boutique spec: mtls: mode: STRICT EOF # Verify: try to curl a service WITHOUT a sidecar (simulates a rogue pod) # This pod has no sidecar — it should get connection reset kubectl run test-no-mesh --image=curlimages/curl --restart=Never \ --command -- curl -s http://productcatalogservice.boutique:3550/ # Expected: curl: (56) Recv failure: Connection reset by peer # Verify mTLS is active from INSIDE the mesh kubectl exec -n boutique deploy/frontend -c istio-proxy -- \ curl -s http://productcatalogservice:3550/ -v 2>&1 | grep "TLS"
Run istioctl authn tls-check <pod> productcatalogservice.boutique.svc.cluster.local to see whether a specific client-server pair is using mTLS. The output shows the PeerAuthentication mode, the DestinationRule TLS mode, and whether they are compatible — far faster than reading raw Envoy config.

Step 4 — Canary Traffic Split for productcatalogservice

Label the existing Deployment version: v1, then deploy a v2 Deployment (with your new image). The DestinationRule defines two subsets. The VirtualService starts at 95/5 and you shift weight over multiple rollout stages.

# 1. Patch the existing deployment to carry a version label kubectl patch deployment productcatalogservice -n boutique \ --type=json \ -p='[{"op":"add","path":"/spec/template/metadata/labels/version","value":"v1"}]' # 2. Deploy the v2 variant (same image, env var NEW_FEATURES=true simulates the change) kubectl apply -n boutique -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: productcatalogservice-v2 spec: replicas: 1 selector: matchLabels: app: productcatalogservice version: v2 template: metadata: labels: app: productcatalogservice version: v2 spec: serviceAccountName: productcatalogservice containers: - name: server image: gcr.io/google-samples/microservices-demo/productcatalogservice:v0.10.0 env: - name: NEW_FEATURES value: "true" ports: - containerPort: 3550 resources: requests: {cpu: 100m, memory: 64Mi} EOF # 3. DestinationRule — declare v1 and v2 subsets kubectl apply -n boutique -f - <<EOF apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: productcatalogservice namespace: boutique spec: host: productcatalogservice trafficPolicy: connectionPool: http: http1MaxPendingRequests: 100 http2MaxRequests: 1000 outlierDetection: consecutive5xxErrors: 3 interval: 10s baseEjectionTime: 30s maxEjectionPercent: 50 subsets: - name: v1 labels: version: v1 - name: v2 labels: version: v2 EOF # 4. VirtualService — start at 95% v1 / 5% v2 kubectl apply -n boutique -f - <<EOF apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: productcatalogservice namespace: boutique spec: hosts: - productcatalogservice http: - route: - destination: host: productcatalogservice subset: v1 weight: 95 - destination: host: productcatalogservice subset: v2 weight: 5 timeout: 3s retries: attempts: 2 perTryTimeout: 1s retryOn: gateway-error,connect-failure,retriable-4xx EOF

Step 5 — Traffic Shift Automation and Rollback

In production, weight changes are driven by CI/CD (Argo Rollouts or Flux with a Kustomize patch) rather than manual kubectl patch commands. For this walkthrough, observe error rate and latency in Grafana or Kiali after each shift before advancing. A simple bash snippet encodes the promotion logic:

#!/usr/bin/env bash # canary-promote.sh — query Prometheus; promote or rollback set -euo pipefail PROM="http://prometheus.istio-system:9090" SVC="productcatalogservice" NS="boutique" WEIGHTS=(5 20 50 80 100) error_rate() { local w=$1 # fraction of 5xx from v2 in the last 2 minutes curl -sG "${PROM}/api/v1/query" \ --data-urlencode "query=sum(rate(istio_requests_total{destination_service_name=\"${SVC}\",destination_version=\"v2\",response_code=~\"5..\"}[2m])) / sum(rate(istio_requests_total{destination_service_name=\"${SVC}\",destination_version=\"v2\"}[2m]))" \ | python3 -c "import sys,json; d=json.load(sys.stdin); print(float(d['data']['result'][0]['value'][1]) if d['data']['result'] else 0)" } rollback() { echo "ERROR RATE TOO HIGH — rolling back to v1" kubectl patch virtualservice productcatalogservice -n ${NS} \ --type=json \ -p='[{"op":"replace","path":"/spec/http/0/route/0/weight","value":100}, {"op":"replace","path":"/spec/http/0/route/1/weight","value":0}]' exit 1 } for WEIGHT in "${WEIGHTS[@]}"; do echo "Setting v2 weight to ${WEIGHT}%" kubectl patch virtualservice productcatalogservice -n ${NS} \ --type=json \ -p="[{\"op\":\"replace\",\"path\":\"/spec/http/0/route/0/weight\",\"value\":$((100-WEIGHT))}, {\"op\":\"replace\",\"path\":\"/spec/http/0/route/1/weight\",\"value\":${WEIGHT}}]" sleep 120 # observe for 2 minutes ERR=$(error_rate) echo "v2 error rate: ${ERR}" python3 -c "import sys; sys.exit(1) if float('${ERR}') > 0.01 else sys.exit(0)" || rollback done echo "Canary complete — v2 is serving 100% of traffic" kubectl delete deployment productcatalogservice -n ${NS} kubectl patch deployment productcatalogservice-v2 -n ${NS} \ --type=json -p='[{"op":"replace","path":"/metadata/name","value":"productcatalogservice"}]'
Never delete the v1 Deployment until the VirtualService has been at 100% v2 for at least one full SLO window (typically 30 minutes to 1 hour). Deleting v1 early means rollback requires a new image push — losing the fast-rollback property that makes canary deployments safe.

Step 6 — Apply Resilience Policies

mTLS and canary splitting are in place. Now apply the resilience layer: circuit breaking via outlierDetection (already on the DestinationRule above), and a rate-limit for the frontend using an EnvoyFilter with the local rate-limit filter. Also add a per-request timeout on the most latency-sensitive downstream: the checkoutservice.

# Timeout and retry policy for checkoutservice kubectl apply -n boutique -f - <<EOF apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: checkoutservice namespace: boutique spec: hosts: - checkoutservice http: - route: - destination: host: checkoutservice timeout: 10s retries: attempts: 1 perTryTimeout: 8s retryOn: reset,connect-failure EOF # Local rate-limit on the frontend (1000 req/min per source IP) kubectl apply -n boutique -f - <<EOF apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: frontend-ratelimit namespace: boutique spec: workloadSelector: labels: app: frontend configPatches: - applyTo: HTTP_FILTER match: context: SIDECAR_INBOUND listener: filterChain: filter: name: envoy.filters.network.http_connection_manager subFilter: name: envoy.filters.http.router patch: operation: INSERT_BEFORE value: name: envoy.filters.http.local_ratelimit typed_config: "@type": type.googleapis.com/udpa.type.v1.TypedStruct type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit value: stat_prefix: http_local_rate_limiter token_bucket: max_tokens: 1000 tokens_per_fill: 1000 fill_interval: 60s filter_enabled: runtime_key: local_rate_limit_enabled default_value: {numerator: 100, denominator: HUNDRED} filter_enforced: runtime_key: local_rate_limit_enforced default_value: {numerator: 100, denominator: HUNDRED} EOF

Step 7 — Validate with Kiali and Prometheus

With traffic flowing and policies applied, open Kiali to confirm the topology is correctly represented and that mTLS padlocks appear on every edge. Then run a synthetic load test with k6 and watch the canary split in Grafana.

# Open Kiali (port-forward if not using an ingress) kubectl port-forward svc/kiali -n istio-system 20001:20001 & # Visit http://localhost:20001 — every edge should show a lock icon (mTLS) # Generate sustained load (k6 minimal script) cat > load.js <<EOF import http from 'k6/http'; import { sleep } from 'k6'; export const options = { vus: 50, duration: '5m' }; export default function () { http.get('http://FRONTEND_IP/'); sleep(0.1); } EOF k6 run load.js # In Prometheus — verify canary traffic split # Proportion of requests reaching v2: # sum(rate(istio_requests_total{destination_service_name="productcatalogservice",destination_version="v2"}[1m])) # / # sum(rate(istio_requests_total{destination_service_name="productcatalogservice"}[1m])) # Should return ~0.05 at the initial 5% split # Check mTLS is enforced — no plain-text traffic on any edge: # sum(istio_requests_total{connection_security_policy="none",destination_namespace="boutique"}) # Should return 0
Pin the Kiali and Grafana dashboards to your incident response runbook. During a canary incident the single most valuable panel is Request Success Rate by Version — it shows v1 and v2 side by side in real time, making the blast radius immediately visible without grepping logs.

Step 8 — AuthorizationPolicy: Zero-Trust East-West

mTLS proves identity; it does not restrict which identities may talk to which services. Add AuthorizationPolicy resources to enforce least-privilege: only the services that genuinely need to call productcatalogservice are allowed to. All other callers receive HTTP 403.

# Allow only frontend and recommendationservice to reach productcatalogservice kubectl apply -n boutique -f - <<EOF apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: productcatalogservice-allow namespace: boutique spec: selector: matchLabels: app: productcatalogservice action: ALLOW rules: - from: - source: principals: - cluster.local/ns/boutique/sa/frontend - cluster.local/ns/boutique/sa/recommendationservice --- # Deny-all baseline (evaluated after all ALLOW rules) apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: deny-all namespace: boutique spec: # empty selector = applies to all workloads in namespace {} EOF # Test: frontend can reach catalog (should 200) kubectl exec -n boutique deploy/frontend -c server -- \ wget -qO- http://productcatalogservice:3550/ | head -1 # Test: cartservice CANNOT reach catalog (should 403) kubectl exec -n boutique deploy/cartservice -c server -- \ wget -qO- http://productcatalogservice:3550/ 2>&1 | head -1 # Expected: wget: server returned error: HTTP/1.1 403 Forbidden

Production Lessons from This Project

Applying a service mesh to a real application surfaces several lessons that do not appear in documentation:

  • Inject before you lock down. Switch namespaces to PERMISSIVE first, confirm all traffic flows with sidecars injected, then move to STRICT. Flipping to STRICT before injection is complete silently breaks inter-service calls.
  • DestinationRule must exist before VirtualService references it. Apply the DestinationRule first; otherwise Envoy has no subset to route to and returns 503.
  • Canary rollback must be a one-command operation. Your GitOps pipeline should have a make rollback target that sets VirtualService weights to 100/0 in under 30 seconds. Measure and rehearse this before any canary goes to 5%.
  • AuthorizationPolicy deny-all breaks jobs and init containers. Apply namespace-wide deny-all incrementally, starting with the most critical services, and always check Kubernetes Jobs and CronJobs — they often use service accounts that are absent from the allow rules.
  • Sidecar resource budget. At Google-scale (thousands of pods per cluster), each Envoy sidecar consumes 50–100 MB of RAM and 0.1–0.2 CPU cores at idle. Budget this into your cluster capacity model before enabling mesh-wide injection.
At this scale — thousands of pods, dozens of services — the mesh config surface becomes its own reliability risk. Adopt istioctl analyze as a mandatory CI step on every mesh config change. It catches missing DestinationRules, conflicting VirtualServices, and invalid AuthorizationPolicies before they reach the cluster.

You have now completed the full Service Mesh tutorial: from first principles through architecture, traffic management, mTLS, resilience, observability, Linkerd, operations, and this capstone. The patterns here — sidecar injection, SPIFFE-based identity, declarative traffic routing, and zero-trust authorization — are the baseline for any serious microservices platform in 2025 and beyond.

ES
Edrees Salih
1 hour ago

We are still cooking the magic in the way!