mTLS & Mesh Security
mTLS & Mesh Security
Zero-trust networking inside a Kubernetes cluster is not a luxury — it is a hard production requirement at any organization that has passed a SOC 2 or PCI audit, or that runs multi-tenant workloads on shared infrastructure. A service mesh enforces zero-trust transparently: every connection is mutually authenticated and encrypted, authorization is declared as policy, and none of this requires a single line of application code to change. This lesson unpacks how Istio implements that guarantee and where it breaks down under production pressure.
Why mTLS at the Mesh Layer?
Without a mesh, east-west traffic inside a cluster is plaintext. A compromised pod can sniff any service it can reach, forge source IPs, and impersonate other workloads. Kubernetes NetworkPolicy can block Layer-3 flows, but it cannot verify identity at the application layer. mTLS solves the identity problem: both sides present X.509 certificates, the connection is rejected if either side cannot prove its identity, and all bytes are encrypted with TLS 1.3.
Istio encodes workload identity in the SPIFFE standard: each pod gets a certificate with a Subject Alternative Name (SAN) of the form spiffe://<trust-domain>/ns/<namespace>/sa/<service-account>. Istiod acts as the SPIFFE-compliant CA, minting short-lived certificates (default 24-hour TTL, configurable down to minutes) that are automatically rotated by the sidecar before expiry.
PeerAuthentication: Enforcing mTLS
Istio ships in PERMISSIVE mode by default — it accepts both plaintext and mTLS traffic so you can roll out incrementally. For production, you must flip to STRICT. The PeerAuthentication resource controls this at mesh, namespace, or per-workload granularity.
istioctl x authz check <pod> and kubectl get peerauthentication -A regularly. A namespace-scoped PERMISSIVE silently overrides the mesh default, and you will not notice until an audit or breach.
Authorization Policies: Layer-7 Access Control
AuthorizationPolicy is Istio's firewall at the application layer. Unlike NetworkPolicy (L3/L4), it can match on HTTP method, path, headers, JWT claims, and the SPIFFE principal of the caller. The evaluation order is: DENY rules evaluated first, then ALLOW rules. A request is denied if any DENY matches, or if no ALLOW matches when at least one ALLOW policy exists.
JWT Authentication with RequestAuthentication
For north-south traffic (ingress), combine RequestAuthentication (validates the JWT signature) with AuthorizationPolicy (enforces which claims are allowed). RequestAuthentication does not reject requests without a token — it only rejects requests with an invalid token. The DENY or ALLOW logic in AuthorizationPolicy is what enforces presence.
Certificate Rotation and CA Pluggability
Istiod's built-in CA is fine for single-cluster development, but production clusters at scale use an external CA. Options:
- Intermediate CA signing: give Istiod a corporate-signed intermediate cert; it issues workload certs chained to your PKI. Use
istio-ca-secretinistio-system. - cert-manager integration: use the
istio-csragent (cert-manager project) to forward all CSRs from Envoy to cert-manager, which can back off to Vault, AWS ACM PCA, or any RFC 5280 CA. - SPIRE: for multi-cluster or multi-platform identity, replace Istiod's CA entirely with a SPIRE server federation. All trust domains are managed centrally; workloads on VMs, bare metal, and Kubernetes all get consistent SPIFFE IDs.
meshConfig.defaultConfig.proxyMetadata.SECRET_TTL. Shorter TTLs increase Istiod load: at 1-hour TTL with 1,000 pods, Istiod handles ~0.28 cert renewals per second — well within its capacity. At 100,000 pods, plan for Istiod HA with resource tuning.
Production Failure Modes
The most common mTLS incidents in production:
- Injected pod calling uninjected pod: mTLS is STRICT on the caller side, the target has no sidecar. The connection hangs or returns a TLS error. Fix: inject the target pod, or add a per-workload PERMISSIVE override, or use a
DestinationRulewithtls.mode: DISABLEfor that specific host. - Node-level health probes: kubelet calls liveness/readiness probes directly without a sidecar. Istio automatically exempts these via the
rewriteAppHTTPProberswebhook. Confirm it is enabled:kubectl get mutatingwebhookconfiguration istio-sidecar-injector -o yaml | grep rewriteAppHTTPProbers. - Policy not applying:
AuthorizationPolicyselectors use pod labels; a typo silently makes the policy match nothing. Always verify withistioctl x authz check <pod-name> -n <namespace>. - Clock skew breaking JWT validation: JWT
nbf/expchecks require synchronized clocks. NTP drift > 60 seconds causes spurious 401s. Monitor node clock skew; Kubernetes already recommends NTP but does not enforce it.
PeerAuthentication on scrape ports, or inject them into the mesh first.
Verifying the Security Posture
Trust but verify — the mesh generates the metadata to audit its own security posture:
A mature mesh security posture at production scale means: mesh-wide STRICT mTLS, deny-all as the default in every namespace, ALLOW policies version-controlled in Git alongside the application manifests, certificate rotation under four hours, and Kiali (or a custom Prometheus query on istio_requests_total{connection_security_policy="mutual_tls"}) confirming >99.9% of intra-cluster calls are mTLS-encrypted at all times.