Compliance & Policy as Code

Gatekeeper & Kyverno

18 min Lesson 5 of 27

Gatekeeper & Kyverno

Every Kubernetes cluster is a policy enforcement surface. Without guardrails, developers can deploy containers running as root, images pulled from untrusted registries, or workloads with no resource limits that starve other pods. The Kubernetes admission controller mechanism was designed exactly for this: any request to the API server passes through a chain of admission webhooks before it is persisted to etcd. Two projects dominate production policy enforcement at this layer — OPA Gatekeeper (backed by the Open Policy Agent engine) and Kyverno (Kubernetes-native, no separate policy language). Knowing when to reach for each, and the failure modes of both, is essential at big-tech scale.

How Kubernetes Admission Works

When kubectl apply hits the API server, the request flows through authentication, authorization (RBAC), and then two classes of admission webhooks: Mutating (can rewrite the object) and Validating (can only allow or deny). Both Gatekeeper and Kyverno register themselves as these webhooks. If the webhook is unavailable and the failurePolicy is set to Fail, the resource is rejected — that is the safe default for production. If it is set to Ignore, the policy is silently bypassed when the webhook is down. Choose carefully.

Kubernetes admission controller request flow kubectl apply API Server Authn / Authz Admission Mutating Webhooks Validating Webhooks etcd persisted OPA Gatekeeper ConstraintTemplate + Rego Kyverno ClusterPolicy (YAML) Deny → 403 to caller failurePolicy: Fail (safe) / Ignore (bypass risk)
Kubernetes admission request flow: mutating webhooks run first, then validating. Gatekeeper and Kyverno both register as validating webhooks; Kyverno also registers a mutating webhook for its mutation rules.

OPA Gatekeeper — ConstraintTemplates and Constraints

Gatekeeper extends Kubernetes with two custom resource types. A ConstraintTemplate defines a reusable policy schema backed by a Rego rule. A Constraint is an instance of that template applied to specific resource types and namespaces. The separation means platform teams own templates; product teams can instantiate constraints with different parameters.

Install Gatekeeper with Helm and apply a constraint that blocks containers running as root:

# Install Gatekeeper helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts helm install gatekeeper gatekeeper/gatekeeper \ --namespace gatekeeper-system --create-namespace \ --set replicas=3 \ --set controllerManager.resources.requests.memory=512Mi # ConstraintTemplate: defines the Rego rule and CRD schema cat <<'EOF' | kubectl apply -f - apiVersion: templates.gatekeeper.sh/v1 kind: ConstraintTemplate metadata: name: k8snoroot spec: crd: spec: names: kind: K8sNoRoot targets: - target: admission.k8s.gatekeeper.sh rego: | package k8snoroot violation[{"msg": msg}] { container := input.review.object.spec.containers[_] not container.securityContext.runAsNonRoot msg := sprintf("Container %v must set runAsNonRoot: true", [container.name]) } violation[{"msg": msg}] { container := input.review.object.spec.containers[_] container.securityContext.runAsUser == 0 msg := sprintf("Container %v must not run as UID 0", [container.name]) } EOF # Constraint: enforce that template on all pods in non-system namespaces cat <<'EOF' | kubectl apply -f - apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sNoRoot metadata: name: deny-root-containers spec: enforcementAction: deny match: kinds: - apiGroups: [""] kinds: ["Pod"] excludedNamespaces: - kube-system - gatekeeper-system EOF
enforcementAction modes: deny blocks the request; warn allows it but returns a warning in the API response (useful during rollout); dryrun records violations without blocking (visible via kubectl get constraint deny-root-containers -o yaml under status.violations). Always start with dryrun in production, validate violation counts drop to zero, then switch to deny.

Kyverno — Kubernetes-Native Policy

Kyverno treats Kubernetes resources as the policy language itself. Policies are YAML documents that pattern-match on incoming resource manifests using match/exclude blocks, then apply validate, mutate, or generate rules. Engineers who already know Kubernetes manifests can read and write Kyverno policies without learning Rego. This dramatically lowers the barrier for platform teams.

# Install Kyverno helm repo add kyverno https://kyverno.github.io/kyverno/ helm install kyverno kyverno/kyverno \ --namespace kyverno --create-namespace \ --set replicaCount=3 # ClusterPolicy: validate that all pods set resource limits cat <<'EOF' | kubectl apply -f - apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-resource-limits annotations: policies.kyverno.io/title: Require Resource Limits policies.kyverno.io/severity: medium spec: validationFailureAction: Enforce # Audit = warn only background: true # also scan existing resources rules: - name: check-limits match: any: - resources: kinds: [Pod] exclude: any: - resources: namespaces: [kube-system, kyverno] validate: message: "All containers must define CPU and memory limits." pattern: spec: containers: - name: "*" resources: limits: memory: "?*" cpu: "?*" EOF # Kyverno mutation: auto-inject a default security context if absent cat <<'EOF' | kubectl apply -f - apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: add-default-securitycontext spec: rules: - name: add-secctx match: any: - resources: kinds: [Pod] mutate: patchStrategicMerge: spec: containers: - (name): "*" securityContext: +(runAsNonRoot): true +(allowPrivilegeEscalation): false +(readOnlyRootFilesystem): true EOF

The +(field) syntax in Kyverno mutation means "add this field only if it does not already exist" — it will not override explicit values set by the developer. This makes mutations safe to apply fleet-wide without breaking apps that already configure security contexts correctly.

Gatekeeper vs Kyverno — When to Choose Each

Both are CNCF projects, both are production-grade. The right choice depends on your team and use case:

  • Choose Gatekeeper if you already use OPA in your stack (Terraform, application authorization), want a single policy language across all enforcement points, or need highly complex logic that benefits from Rego's datalog-style evaluation. The constraint template / constraint split also maps well to a platform-team-owns-policy, product-team-instantiates model.
  • Choose Kyverno if your platform team wants policies that any Kubernetes-literate engineer can read and modify, or if you need built-in mutation and generate capabilities without writing a separate mutating webhook. Kyverno also ships a policy library of 200+ ready-to-use policies at kyverno.io/policies.
Production HA setup: both tools require multiple replicas behind a Kubernetes Service. A single-replica policy webhook is a cluster-wide single point of failure — if the pod crashes during a deployment and failurePolicy: Fail is set, every new pod in the cluster is blocked. Run at least 3 replicas spread across nodes with podAntiAffinity, and use a PodDisruptionBudget with minAvailable: 2 to protect against node drain.

Audit Mode and Policy Exceptions

Both tools support scanning existing resources (not just new ones) against policies. In Gatekeeper, set enforcementAction: dryrun and query violations with kubectl get constraint -o jsonpath='{.items[*].status.violations}'. In Kyverno, set validationFailureAction: Audit and check PolicyReport resources (a CNCF standard): kubectl get polr -A.

When a specific workload legitimately needs an exception — a legacy app that truly must run as root — both tools provide scoped exclusion mechanisms. In Gatekeeper, add to the constraint's spec.match.excludedNamespaces or use spec.match.labelSelector. In Kyverno, use the exclude block with a label selector. Document every exception in a comment in the policy YAML, commit it to git, and review exceptions quarterly — unchecked exceptions accumulate into a compliance liability.

Webhook timeout is a production outage vector: both Gatekeeper and Kyverno webhooks have a default timeout of 10 seconds. If the webhook pods are under heavy load and response time exceeds this threshold, the API server treats it as a failure and applies failurePolicy. In a cluster under load during a deploy, this can cascade: new pods cannot start, which increases load on existing pods, which increases webhook latency further. Monitor webhook p99 latency (gatekeeper_webhook_duration_seconds / kyverno_admission_requests_total) and set resource requests high enough that webhook pods are never throttled.

ES
Edrees Salih
1 hour ago

We are still cooking the magic in the way!