Cloud & Kubernetes Security Hardening

Zero Trust Architecture

18 min Lesson 8 of 28

Zero Trust Architecture

The perimeter model is dead. For decades, security engineering assumed that anything inside the corporate network was safe and anything outside was hostile. That assumption failed the moment employees started carrying laptops to coffee shops, the moment SaaS apps began holding more crown jewels than the data centre, and the moment a single phished credential gave an attacker the keys to the entire internal network. Google learned this lesson firsthand in Operation Aurora (2010): attackers who breached a single Windows workstation moved laterally across the corporate network to source repositories and user accounts worldwide. The internal network offered zero resistance after the perimeter was crossed.

Google's response was BeyondCorp — a complete rearchitecture of how employees access internal resources, published openly starting in 2014 and deployed at Google scale by 2017. The core insight is simple: network location must never be treated as a proxy for trust. Every access decision must be made based on identity, device state, and context — regardless of whether the request originates from a home office, an airport, or a datacenter rack sitting three feet from the server it is calling.

This lesson covers the three pillars that make Zero Trust concrete: identity-aware access, mutual TLS everywhere, and the BeyondCorp access-proxy pattern. You will leave with working configs and a clear mental model of how top-tier engineering organisations implement these in production Kubernetes and cloud environments.

Pillar 1: Identity-Aware Access

In a Zero Trust world, identity is the control plane. Every principal — human user, service account, Kubernetes pod, CI runner — must carry a verified, short-lived credential. Access decisions happen at the resource boundary, not the network edge.

The practical implications for a production Kubernetes environment:

Short-lived credentials everywhere. AWS IAM Roles for Service Accounts (IRSA), GKE Workload Identity, and Azure AD Workload Identity all allow pods to obtain time-limited cloud credentials via OIDC token exchange — replacing the disastrous practice of mounting long-lived access keys as Secrets.
Per-workload identity at the pod level. Kubernetes ServiceAccounts are the unit of identity. Each ServiceAccount should be bound to exactly the permissions it needs, nothing more. The default ServiceAccount in every namespace has no bound permissions by default in modern clusters — keep it that way and create dedicated accounts per workload.
Context-aware policy. An identity token is not enough on its own. At Google, BeyondCorp also evaluates device posture (is the device managed? is the OS up to date?), request time, geolocation, and risk signals. In cloud-native environments, Open Policy Agent (OPA) / Gatekeeper allows you to encode this logic as Rego policies evaluated on every admission or authorisation request.

The OIDC chain: Kubernetes issues OIDC tokens to pods via the projected service account token volume. The cloud provider's STS (AWS STS, Google STS) validates the token against the cluster's OIDC discovery endpoint and exchanges it for a short-lived cloud credential. This exchange happens inside the pod without any static secret ever leaving a vault.

Enabling IRSA on an EKS cluster requires annotating the ServiceAccount and creating an IAM role with the right trust policy:

# Create the IAM role trust policy (substitute your cluster OIDC issuer URL)
aws iam create-role \
  --role-name my-app-role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E:sub":
            "system:serviceaccount:production:my-app"
        }
      }
    }]
  }'

# Annotate the Kubernetes ServiceAccount
kubectl annotate serviceaccount my-app \
  -n production \
  eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/my-app-role

The StringEquals condition on the OIDC subject is critical — it scopes the role assumption to exactly one ServiceAccount in one namespace. A wildcard here would allow any pod in the cluster to assume the role.

Pillar 2: Mutual TLS Everywhere

Even with identity tokens in place, network traffic between services is still vulnerable to man-in-the-middle attacks if it runs over plain TCP or one-way TLS. Mutual TLS (mTLS) requires both sides of every connection to present a valid certificate issued by a trusted Certificate Authority. This means every service is cryptographically authenticated before a single byte of application data is exchanged.

In a Kubernetes cluster, implementing mTLS manually is impractical — it requires every application team to manage certificates, rotate them, and handle expired certs correctly. The right answer is a service mesh: Istio, Linkerd, or Cilium's eBPF-based network policies. The mesh runs a sidecar proxy (or, in Cilium's case, a kernel-level eBPF program) alongside each pod. The proxy intercepts all inbound and outbound traffic and handles the mTLS handshake transparently.

Zero Trust: every request passes an identity-aware access proxy; inside the mesh all pod-to-pod traffic uses mTLS with SPIFFE/SVID certificates enforced by AuthorizationPolicy.

Istio enforces mTLS cluster-wide with a single PeerAuthentication resource, and then restricts which identities may communicate with an AuthorizationPolicy:

# Enforce STRICT mTLS across the entire mesh (no plaintext allowed)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system   # applies mesh-wide
spec:
  mtls:
    mode: STRICT

---
# Allow only the order-service ServiceAccount to call the payment-service
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payment-allow-orders
  namespace: production
spec:
  selector:
    matchLabels:
      app: payment-service
  action: ALLOW
  rules:
  - from:
    - source:
        principals:
        - "cluster.local/ns/production/sa/order-service"
    to:
    - operation:
        methods: ["POST"]
        paths: ["/api/v1/charge"]

Do not start with PERMISSIVE mode in production. Istio's default mTLS mode is PERMISSIVE — it accepts both plaintext and mTLS connections, which means an attacker who compromises a non-mesh workload can still talk to mesh services without a certificate. Set STRICT immediately when you install Istio, before any workloads are deployed. Retrofitting STRICT onto a running cluster that has plaintext traffic is painful — services break and the blast radius is large.

Pillar 3: The BeyondCorp Access-Proxy Pattern

BeyondCorp flips the traditional VPN model on its head. Instead of a VPN gateway that checks "is this device on the right network?" and then grants broad access, BeyondCorp deploys an access proxy in front of every internal application. The proxy makes an authorisation decision per request, based on three factors:

Who are you? — verified identity from an IdP (Google Workspace, Okta, Azure AD). No anonymous access, ever.
What device are you using? — device inventory service checks: is this a managed device? Does it have the latest OS patches? Is disk encryption enabled? Does the MDM show any recent security alerts?
What are you asking for? — the specific resource and action, matched against a policy that defines which groups and device trust levels can access it.

The commercial implementations of this pattern are Google Cloud IAP (Identity-Aware Proxy), Cloudflare Access, and AWS Verified Access. All three terminate external TLS, validate identity via OIDC/SAML, check device posture, and forward requests to your backend only if policy allows it — with a signed JWT header that the backend can trust without reimplementing auth.

Combine the proxy with short-lived certificates (SPIFFE/SPIRE). SPIFFE (Secure Production Identity Framework For Everyone) issues X.509 certificates — called SVIDs — to every workload, rotated every hour. SPIRE is the reference implementation. When Istio is configured to use SPIRE as its CA, every pod in the mesh holds a cryptographic identity that is tied to its Kubernetes ServiceAccount, namespace, and cluster. This identity is independent of network location — it travels with the workload whether it runs in GKE, EKS, or an on-premises VM, making it the foundation of a genuine multi-cloud Zero Trust architecture.

Production Failure Modes

Zero Trust architectures introduce a new class of production incidents that on-premises teams have never encountered. Know them before they hit you on call:

Certificate expiry cascade. If the mesh CA cert or a SPIRE root cert expires unnoticed, every service-to-service call in the cluster fails with TLS handshake errors simultaneously. Monitor cert expiry as a first-class SLI. Alert at 30 days, page at 7 days.
OIDC issuer unreachable. If the EKS OIDC discovery endpoint is temporarily unreachable, pods trying to obtain IRSA credentials fail. Build retry logic with exponential backoff and use eks.amazonaws.com/token-expiration annotations to extend token lifetimes to tolerable levels.
AuthorizationPolicy deny-by-default locking out legitimate traffic. Istio's default action when no AuthorizationPolicy matches is ALLOW. The moment you create any AuthorizationPolicy in a namespace, the default flips to DENY for that workload. Teams that apply a policy for one service and forget to cover health check probes from the Kubernetes API server will trigger cascading 503s on the next rollout.
Clock skew breaking JWT validation. OIDC tokens carry iat and exp claims. A node whose clock drifts by more than five minutes will cause token validation to fail. Enforce NTP synchronisation (chrony or timesyncd) on every node as a non-negotiable cluster hygiene requirement.

Zero Trust is a journey, not a switch. Google took four years to migrate 60,000+ employees off the corporate VPN to BeyondCorp. Start with your most sensitive services, instrument everything with structured access logs, and expand incrementally. Every access log entry should record: who (identity), what (resource + action), why (policy that matched), and from where (device + IP). That log is your audit trail and your debugging surface simultaneously.