Service Mesh: Istio & Linkerd

Project: Mesh a Microservices App

18 min Lesson 10 of 27

Project: Mesh a Microservices App

This capstone project wires together every concept from the tutorial into one end-to-end walkthrough: installing Istio on a real Kubernetes cluster, onboarding a multi-service application, enforcing mutual TLS across all communication, deploying a canary release with precise weight-based traffic splitting, layering resilience policies, and confirming everything via the observability stack. The goal is a production-grade procedure you can run verbatim or adapt to your own systems.

Target Application: Online Boutique

Google's Online Boutique (microservices-demo) is the canonical mesh test application: eleven polyglot services (Go, Python, C#, Java, Node.js), realistic gRPC and HTTP traffic, and a frontend that exercises every path. We will mesh it, enforce mTLS, then perform a canary rollout of the productcatalogservice from v1 to v2.

All commands assume Istio 1.22, Kubernetes 1.29+, and istioctl on your PATH. A single-node k3s cluster on a 4-core / 8 GB machine is sufficient for this walkthrough. Online Boutique manifests live at github.com/GoogleCloudPlatform/microservices-demo.

Step 1 — Install Istio with a Production-Grade IstioOperator

Use the default profile but override three settings critical for production: structured JSON access logs so your pipeline can parse them without regex, 5% trace sampling (raise to 100% during incidents), and explicit resource requests on istiod so it is not evicted under pressure.

cat > istio-install.yaml <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: production
spec:
  profile: default
  meshConfig:
    accessLogFile: /dev/stdout
    accessLogFormat: |
      {"ts":"%START_TIME%","method":"%REQ(:METHOD)%",
       "path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
       "status":"%RESPONSE_CODE%","ms":"%DURATION%",
       "upstream":"%UPSTREAM_CLUSTER%",
       "trace":"%REQ(X-B3-TRACEID)%"}
    defaultConfig:
      tracing:
        sampling: 5.0
  components:
    pilot:
      k8s:
        resources:
          requests: {cpu: 200m, memory: 256Mi}
          limits:   {cpu: "1",  memory: 512Mi}
EOF

istioctl install -f istio-install.yaml -y
istioctl verify-install
kubectl get pods -n istio-system

Step 2 — Inject Sidecars: Label the Namespace

Enable automatic sidecar injection on the application namespace, then deploy Online Boutique. Every pod will start with two containers: the app and the Envoy sidecar.

kubectl create namespace boutique
kubectl label namespace boutique istio-injection=enabled

# Deploy Online Boutique (strip out its bundled service mesh config if present)
kubectl apply -n boutique \
  -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

# Confirm 2/2 READY for every pod (app + istio-proxy)
kubectl get pods -n boutique
kubectl get pod -n boutique -l app=frontend -o jsonpath='{.items[0].spec.containers[*].name}'
# Expected output: server istio-proxy

Step 3 — Enforce Strict mTLS Across the Namespace

By default Istio runs in PERMISSIVE mode — it accepts both plain-text and mTLS traffic. This is useful during onboarding but must be tightened before you claim the namespace is secure. A namespace-scoped PeerAuthentication resource switches all services to STRICT mode: plain-text connections are rejected at the sidecar, not the application.

Strict mTLS: both sidecars present SPIFFE X.509 certificates issued by istiod; plain-text traffic is dropped at the receiving proxy.

# Lock down the entire namespace to STRICT mTLS
kubectl apply -n boutique -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: boutique
spec:
  mtls:
    mode: STRICT
EOF

# Verify: try to curl a service WITHOUT a sidecar (simulates a rogue pod)
# This pod has no sidecar — it should get connection reset
kubectl run test-no-mesh --image=curlimages/curl --restart=Never \
  --command -- curl -s http://productcatalogservice.boutique:3550/
# Expected: curl: (56) Recv failure: Connection reset by peer

# Verify mTLS is active from INSIDE the mesh
kubectl exec -n boutique deploy/frontend -c istio-proxy -- \
  curl -s http://productcatalogservice:3550/ -v 2>&1 | grep "TLS"

Run istioctl authn tls-check <pod> productcatalogservice.boutique.svc.cluster.local to see whether a specific client-server pair is using mTLS. The output shows the PeerAuthentication mode, the DestinationRule TLS mode, and whether they are compatible — far faster than reading raw Envoy config.

Step 4 — Canary Traffic Split for productcatalogservice

Label the existing Deployment version: v1, then deploy a v2 Deployment (with your new image). The DestinationRule defines two subsets. The VirtualService starts at 95/5 and you shift weight over multiple rollout stages.

# 1. Patch the existing deployment to carry a version label
kubectl patch deployment productcatalogservice -n boutique \
  --type=json \
  -p='[{"op":"add","path":"/spec/template/metadata/labels/version","value":"v1"}]'

# 2. Deploy the v2 variant (same image, env var NEW_FEATURES=true simulates the change)
kubectl apply -n boutique -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: productcatalogservice-v2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: productcatalogservice
      version: v2
  template:
    metadata:
      labels:
        app: productcatalogservice
        version: v2
    spec:
      serviceAccountName: productcatalogservice
      containers:
      - name: server
        image: gcr.io/google-samples/microservices-demo/productcatalogservice:v0.10.0
        env:
        - name: NEW_FEATURES
          value: "true"
        ports:
        - containerPort: 3550
        resources:
          requests: {cpu: 100m, memory: 64Mi}
EOF

# 3. DestinationRule — declare v1 and v2 subsets
kubectl apply -n boutique -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: productcatalogservice
  namespace: boutique
spec:
  host: productcatalogservice
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
EOF

# 4. VirtualService — start at 95% v1 / 5% v2
kubectl apply -n boutique -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: productcatalogservice
  namespace: boutique
spec:
  hosts:
  - productcatalogservice
  http:
  - route:
    - destination:
        host: productcatalogservice
        subset: v1
      weight: 95
    - destination:
        host: productcatalogservice
        subset: v2
      weight: 5
    timeout: 3s
    retries:
      attempts: 2
      perTryTimeout: 1s
      retryOn: gateway-error,connect-failure,retriable-4xx
EOF

Step 5 — Traffic Shift Automation and Rollback

In production, weight changes are driven by CI/CD (Argo Rollouts or Flux with a Kustomize patch) rather than manual kubectl patch commands. For this walkthrough, observe error rate and latency in Grafana or Kiali after each shift before advancing. A simple bash snippet encodes the promotion logic:

#!/usr/bin/env bash
# canary-promote.sh — query Prometheus; promote or rollback
set -euo pipefail

PROM="http://prometheus.istio-system:9090"
SVC="productcatalogservice"
NS="boutique"
WEIGHTS=(5 20 50 80 100)

error_rate() {
  local w=$1
  # fraction of 5xx from v2 in the last 2 minutes
  curl -sG "${PROM}/api/v1/query" \
    --data-urlencode "query=sum(rate(istio_requests_total{destination_service_name=\"${SVC}\",destination_version=\"v2\",response_code=~\"5..\"}[2m])) / sum(rate(istio_requests_total{destination_service_name=\"${SVC}\",destination_version=\"v2\"}[2m]))" \
    | python3 -c "import sys,json; d=json.load(sys.stdin); print(float(d['data']['result'][0]['value'][1]) if d['data']['result'] else 0)"
}

rollback() {
  echo "ERROR RATE TOO HIGH — rolling back to v1"
  kubectl patch virtualservice productcatalogservice -n ${NS} \
    --type=json \
    -p='[{"op":"replace","path":"/spec/http/0/route/0/weight","value":100},
         {"op":"replace","path":"/spec/http/0/route/1/weight","value":0}]'
  exit 1
}

for WEIGHT in "${WEIGHTS[@]}"; do
  echo "Setting v2 weight to ${WEIGHT}%"
  kubectl patch virtualservice productcatalogservice -n ${NS} \
    --type=json \
    -p="[{\"op\":\"replace\",\"path\":\"/spec/http/0/route/0/weight\",\"value\":$((100-WEIGHT))},
         {\"op\":\"replace\",\"path\":\"/spec/http/0/route/1/weight\",\"value\":${WEIGHT}}]"
  sleep 120   # observe for 2 minutes
  ERR=$(error_rate)
  echo "v2 error rate: ${ERR}"
  python3 -c "import sys; sys.exit(1) if float('${ERR}') > 0.01 else sys.exit(0)" || rollback
done

echo "Canary complete — v2 is serving 100% of traffic"
kubectl delete deployment productcatalogservice -n ${NS}
kubectl patch deployment productcatalogservice-v2 -n ${NS} \
  --type=json -p='[{"op":"replace","path":"/metadata/name","value":"productcatalogservice"}]'

Never delete the v1 Deployment until the VirtualService has been at 100% v2 for at least one full SLO window (typically 30 minutes to 1 hour). Deleting v1 early means rollback requires a new image push — losing the fast-rollback property that makes canary deployments safe.

Step 6 — Apply Resilience Policies

mTLS and canary splitting are in place. Now apply the resilience layer: circuit breaking via outlierDetection (already on the DestinationRule above), and a rate-limit for the frontend using an EnvoyFilter with the local rate-limit filter. Also add a per-request timeout on the most latency-sensitive downstream: the checkoutservice.

# Timeout and retry policy for checkoutservice
kubectl apply -n boutique -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: checkoutservice
  namespace: boutique
spec:
  hosts:
  - checkoutservice
  http:
  - route:
    - destination:
        host: checkoutservice
    timeout: 10s
    retries:
      attempts: 1
      perTryTimeout: 8s
      retryOn: reset,connect-failure
EOF

# Local rate-limit on the frontend (1000 req/min per source IP)
kubectl apply -n boutique -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: frontend-ratelimit
  namespace: boutique
spec:
  workloadSelector:
    labels:
      app: frontend
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: SIDECAR_INBOUND
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.http_connection_manager
            subFilter:
              name: envoy.filters.http.router
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.filters.http.local_ratelimit
        typed_config:
          "@type": type.googleapis.com/udpa.type.v1.TypedStruct
          type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
          value:
            stat_prefix: http_local_rate_limiter
            token_bucket:
              max_tokens: 1000
              tokens_per_fill: 1000
              fill_interval: 60s
            filter_enabled:
              runtime_key: local_rate_limit_enabled
              default_value: {numerator: 100, denominator: HUNDRED}
            filter_enforced:
              runtime_key: local_rate_limit_enforced
              default_value: {numerator: 100, denominator: HUNDRED}
EOF

Step 7 — Validate with Kiali and Prometheus

With traffic flowing and policies applied, open Kiali to confirm the topology is correctly represented and that mTLS padlocks appear on every edge. Then run a synthetic load test with k6 and watch the canary split in Grafana.

# Open Kiali (port-forward if not using an ingress)
kubectl port-forward svc/kiali -n istio-system 20001:20001 &
# Visit http://localhost:20001 — every edge should show a lock icon (mTLS)

# Generate sustained load (k6 minimal script)
cat > load.js <<EOF
import http from 'k6/http';
import { sleep } from 'k6';
export const options = { vus: 50, duration: '5m' };
export default function () {
  http.get('http://FRONTEND_IP/');
  sleep(0.1);
}
EOF
k6 run load.js

# In Prometheus — verify canary traffic split
# Proportion of requests reaching v2:
# sum(rate(istio_requests_total{destination_service_name="productcatalogservice",destination_version="v2"}[1m]))
#   /
# sum(rate(istio_requests_total{destination_service_name="productcatalogservice"}[1m]))
# Should return ~0.05 at the initial 5% split

# Check mTLS is enforced — no plain-text traffic on any edge:
# sum(istio_requests_total{connection_security_policy="none",destination_namespace="boutique"})
# Should return 0

Pin the Kiali and Grafana dashboards to your incident response runbook. During a canary incident the single most valuable panel is Request Success Rate by Version — it shows v1 and v2 side by side in real time, making the blast radius immediately visible without grepping logs.

Step 8 — AuthorizationPolicy: Zero-Trust East-West

mTLS proves identity; it does not restrict which identities may talk to which services. Add AuthorizationPolicy resources to enforce least-privilege: only the services that genuinely need to call productcatalogservice are allowed to. All other callers receive HTTP 403.

# Allow only frontend and recommendationservice to reach productcatalogservice
kubectl apply -n boutique -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: productcatalogservice-allow
  namespace: boutique
spec:
  selector:
    matchLabels:
      app: productcatalogservice
  action: ALLOW
  rules:
  - from:
    - source:
        principals:
        - cluster.local/ns/boutique/sa/frontend
        - cluster.local/ns/boutique/sa/recommendationservice
---
# Deny-all baseline (evaluated after all ALLOW rules)
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: boutique
spec:
  # empty selector = applies to all workloads in namespace
  {}
EOF

# Test: frontend can reach catalog (should 200)
kubectl exec -n boutique deploy/frontend -c server -- \
  wget -qO- http://productcatalogservice:3550/ | head -1

# Test: cartservice CANNOT reach catalog (should 403)
kubectl exec -n boutique deploy/cartservice -c server -- \
  wget -qO- http://productcatalogservice:3550/ 2>&1 | head -1
# Expected: wget: server returned error: HTTP/1.1 403 Forbidden

Production Lessons from This Project

Applying a service mesh to a real application surfaces several lessons that do not appear in documentation:

Inject before you lock down. Switch namespaces to PERMISSIVE first, confirm all traffic flows with sidecars injected, then move to STRICT. Flipping to STRICT before injection is complete silently breaks inter-service calls.
DestinationRule must exist before VirtualService references it. Apply the DestinationRule first; otherwise Envoy has no subset to route to and returns 503.
Canary rollback must be a one-command operation. Your GitOps pipeline should have a make rollback target that sets VirtualService weights to 100/0 in under 30 seconds. Measure and rehearse this before any canary goes to 5%.
AuthorizationPolicy deny-all breaks jobs and init containers. Apply namespace-wide deny-all incrementally, starting with the most critical services, and always check Kubernetes Jobs and CronJobs — they often use service accounts that are absent from the allow rules.
Sidecar resource budget. At Google-scale (thousands of pods per cluster), each Envoy sidecar consumes 50–100 MB of RAM and 0.1–0.2 CPU cores at idle. Budget this into your cluster capacity model before enabling mesh-wide injection.

At this scale — thousands of pods, dozens of services — the mesh config surface becomes its own reliability risk. Adopt istioctl analyze as a mandatory CI step on every mesh config change. It catches missing DestinationRules, conflicting VirtualServices, and invalid AuthorizationPolicies before they reach the cluster.

You have now completed the full Service Mesh tutorial: from first principles through architecture, traffic management, mTLS, resilience, observability, Linkerd, operations, and this capstone. The patterns here — sidecar injection, SPIFFE-based identity, declarative traffic routing, and zero-trust authorization — are the baseline for any serious microservices platform in 2025 and beyond.