Cloud & Kubernetes Security Hardening

Kubernetes Hardening: Cluster

18 min Lesson 6 of 28

Kubernetes Hardening: Cluster

Pod-level controls stop a compromised workload from escaping its sandbox. Cluster-level hardening is a different discipline: it protects the control plane itself — the API server, etcd, the scheduler, and the admission pipeline. An attacker who reaches the control plane does not need to escape any container; they can create new Pods, read every Secret, and backdoor cluster objects at will. The blast radius is the entire cluster, across every tenant namespace.

This lesson covers the four control-plane pillars that production security teams audit first: API server exposure, RBAC least-privilege review, audit logging, and etcd encryption at rest. Each one has well-known default insecurities that teams commonly inherit from convenience configurations or cloud-provider defaults that prioritized ease of setup over security posture.

API Server Exposure: Shrink the Attack Surface

The Kubernetes API server is the single entry point to the cluster. Every kubectl command, every controller reconciliation loop, and every webhook call goes through it. Leaving it internet-reachable is equivalent to exposing your database's management port to the public internet — scanners find it within minutes, and once found, it becomes a persistent brute-force and exploit target.

In managed offerings (EKS, GKE, AKS) the API server runs in the provider's VPC and is fronted by a load balancer with a public IP by default. The first hardening step is to disable public access and restrict the endpoint to the cluster's own VPC plus specific corporate CIDR ranges used by CI runners and engineers.

# EKS: restrict API server to private VPC access only
# (run from a machine with AWS credentials and kubectl already configured)
aws eks update-cluster-config \
  --name production-cluster \
  --resources-vpc-config \
    endpointPrivateAccess=true,endpointPublicAccess=false

# If you must keep public access (e.g. no VPN), whitelist specific CIDRs only:
aws eks update-cluster-config \
  --name production-cluster \
  --resources-vpc-config \
    endpointPublicAccess=true,\
publicAccessCidrs="10.0.0.0/8,203.0.113.42/32"

# Verify the resulting configuration
aws eks describe-cluster --name production-cluster \
  --query "cluster.resourcesVpcConfig"

Beyond network exposure, the API server itself should be hardened at the flag level. Self-managed clusters (kubeadm, k3s, Rancher) inherit whatever flags the installer sets. The most important flags to audit in /etc/kubernetes/manifests/kube-apiserver.yaml are:

--anonymous-auth=false — disables the system:anonymous user, preventing unauthenticated requests from reaching the authz layer.
--insecure-port=0 — disables the legacy HTTP port (default 8080). Kubernetes 1.20+ deprecated this; ensure it is explicitly set to 0.
--enable-admission-plugins — must include NodeRestriction (prevents kubelets from modifying other nodes' objects) and AlwaysPullImages (forces fresh credential checks on every Pod start, preventing stale credential reuse).
--authorization-mode=Node,RBAC — never AlwaysAllow, which grants every authenticated request full access.

Production pitfall — the CIS Benchmark drift: Managed clusters (EKS, GKE) manage the API server flags for you and apply sensible defaults, but they do not give you direct flag access. Run the kube-bench CIS scanner against your cluster to detect which checks pass and which are provider-managed. Never assume a managed cluster is fully hardened out of the box — endpoint exposure and RBAC configuration remain your responsibility.

RBAC Least-Privilege Review

Role-Based Access Control (RBAC) in Kubernetes is expressive — and easy to misuse. The three most dangerous patterns seen repeatedly in production security reviews are:

Wildcard verbs or resources: verbs: ["*"] or resources: ["*"] in a ClusterRole gives the subject permission to do anything to everything. This appears frequently in CI service accounts because it was "the simplest way to make the pipeline work."
Cluster-scoped bindings for namespace-scoped work: A ClusterRoleBinding attaches a role to a subject across every namespace — present and future. Most application service accounts should use RoleBindings scoped to a single namespace.
Default service account misuse: Every Pod gets the default service account in its namespace if no serviceAccountName is specified. If anything has been bound to that default service account, every Pod in the namespace inherits those permissions.

# Audit: find all ClusterRoleBindings and their subjects
kubectl get clusterrolebindings -o json \
  | jq -r '.items[] | "\(.metadata.name)\t\(.roleRef.name)\t\(.subjects[]?.name)"'

# Audit: find any role granting wildcard verbs or resources
kubectl get clusterroles,roles -A -o json \
  | jq -r '
    .items[] |
    select(.rules[]? | (.verbs[]? == "*") or (.resources[]? == "*")) |
    "\(.metadata.namespace // "cluster")\t\(.metadata.name)"
  '

# Check what a specific service account can do across the cluster
kubectl auth can-i --list \
  --as=system:serviceaccount:production:my-app-sa \
  --all-namespaces

The remediation pattern is to build roles from first principles: enumerate the exact API groups, resources, and verbs the workload actually needs. A typical web application deployment controller needs get, list, watch, and patch on Deployments only in its own namespace — nothing more. Use the kubectl auth can-i --list output as your baseline, then prune everything not exercised in production.

Pro practice — automated RBAC analysis: Tools like rbac-tool (kubectl rbac-tool lookup <subject>) and rakkess produce access matrices that show exactly what each subject can do to each resource. Run these in your audit pipeline and diff the output across releases to catch privilege creep before it reaches production.

Kubernetes Audit Logging

Audit logs are the answer to: "Who changed that ClusterRoleBinding at 2 AM, and what else did they touch?" Without audit logs you are flying blind during an incident — you cannot determine what credentials were used, which resources were accessed, or whether persistence was established. This is the number-one forensic gap in clusters that security teams encounter after a breach.

The API server supports four audit levels — None, Metadata, Request, and RequestResponse — applied via an audit policy file. The production pattern is to log Metadata for high-volume read paths (reduces storage cost) and RequestResponse for all mutation operations (writes, exec, port-forward).

# /etc/kubernetes/audit-policy.yaml — production-grade audit policy
apiVersion: audit.k8s.io/v1
kind: Policy
omitStages:
  - RequestReceived   # skip the initial receive event; only log responses
rules:
  # Never log secrets data (prevents secrets from leaking into logs)
  - level: None
    resources:
      - group: ""
        resources: ["secrets"]
    verbs: ["get", "list", "watch"]

  # Log all mutations at RequestResponse level (includes request and response body)
  - level: RequestResponse
    verbs: ["create", "update", "patch", "delete", "deletecollection"]

  # Log exec/attach/port-forward at RequestResponse — critical for forensics
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods/exec", "pods/attach", "pods/portforward"]

  # Log auth-related events at RequestResponse
  - level: RequestResponse
    resources:
      - group: "rbac.authorization.k8s.io"
        resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]

  # Everything else at Metadata (who, what resource, result code — no bodies)
  - level: Metadata

Apply the policy by adding flags to the API server manifest and shipping logs to a centralized SIEM (Splunk, Elastic, or a cloud-native service like AWS CloudWatch or GCP Cloud Audit Logs). The two flags are --audit-log-path=/var/log/kubernetes/audit.log and --audit-policy-file=/etc/kubernetes/audit-policy.yaml. For managed clusters, audit logging is enabled per-provider: on EKS, enable the audit log type in the cluster logging configuration; on GKE, Admin Activity and Data Access audit logs are configurable under Cloud Audit Logs.

The four cluster hardening pillars — all control-plane traffic flows through the API server, which is protected by RBAC, audited, and backed by an encrypted etcd store, all behind a private network boundary.

etcd Encryption at Rest

etcd is the Kubernetes data store — it contains every Secret, every ConfigMap, every object in the cluster. By default, Kubernetes Secrets are stored in etcd as base64-encoded plaintext. Anyone with read access to the etcd data volume (a compromised etcd node, a snapshot restored to a dev machine, or a misconfigured backup bucket) can decode every Secret in the cluster with a single base64 -d call.

etcd encryption at rest is configured via an EncryptionConfiguration file, referenced in the API server with --encryption-provider-config. The recommended provider is aescbc (AES-256 in CBC mode with HMAC-SHA1 for integrity) or secretbox (XSalsa20-Poly1305, faster). For production at scale, use the KMS provider (AWS KMS, GCP CKMS, Azure Key Vault) so the data-encryption key is itself envelope-encrypted by a hardware-managed key — the cluster never holds the master key in memory.

# /etc/kubernetes/encryption-config.yaml
# AES-CBC provider for Secrets and ConfigMaps
# Generate a 32-byte key: head -c 32 /dev/urandom | base64
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
      - configmaps
    providers:
      - aescbc:
          keys:
            - name: key1
              # Replace with output of: head -c 32 /dev/urandom | base64
              secret: <BASE64_ENCODED_32_BYTE_KEY>
      - identity: {}    # fallback: reads existing unencrypted objects

# After applying the config and restarting the API server,
# force-rewrite ALL existing secrets so they are encrypted at rest:
kubectl get secrets --all-namespaces -o json \
  | kubectl replace -f -

# Verify: read a secret directly from etcd to confirm it is no longer plaintext
# (run on the etcd node — requires etcdctl and the etcd certs)
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  get /registry/secrets/production/my-secret | hexdump -C | head
# Output should begin with "k8s:enc:aescbc" — NOT a readable JSON blob

Key rotation matters: Add a second key entry at the top of the providers list (new key first, old key second) to rotate. The API server encrypts writes with the first key and can still decrypt with any key in the list. After forcing a rewrite of all Secrets, remove the old key and restart the API server again. This zero-downtime rotation pattern is the same used in AWS KMS CMK rotations.

Putting It Together: The Cluster Hardening Checklist

Each of these controls addresses a distinct failure mode. They compound: RBAC without audit logs means you cannot detect when a misconfigured binding is exploited. Audit logs without etcd encryption mean log entries reference secrets that are stored in plaintext on disk. Run all four together as a cohesive posture, and validate them continuously with tools like kube-bench (CIS benchmark scanner), Trivy (misconfiguration scanner with Kubernetes support), and Falco (runtime rule engine that can alert on suspicious API server activity in real time).

Pro practice — automate the audit: Wire kube-bench into your CI/CD pipeline as a post-deploy step. Set it to fail the pipeline if any FAIL-level CIS checks appear for your cluster profile (EKS, GKE, or generic). This prevents hardening regressions from shipping silently — a misconfigured admission plugin or a new ClusterRoleBinding with wildcard verbs gets caught before it reaches production.