Compliance & Policy as Code

Drift Detection & Continuous Compliance

18 min Lesson 9 of 27

Drift Detection & Continuous Compliance

Policy as Code defines what your infrastructure should look like. Drift detection answers the harder question: does it still look that way, right now? In production, the gap between desired state and actual state widens constantly — engineers apply manual hotfixes, cloud providers update default settings, auto-scaling creates resources that were never in Terraform, and vendor APIs silently change behavior. Continuous compliance closes this loop by detecting, alerting, and optionally remediating drift before it becomes a security finding or an audit failure.

What Drift Is and Why It Happens

Drift is any deviation between the declared desired state (Terraform state file, Kubernetes manifest, golden AMI, OPA policy) and the live resource. There are three root causes you will see repeatedly in production environments:

Emergency changes: An on-call engineer applies a manual AWS Console change at 2 AM to stop a production outage. The change is never codified.
Automation side-effects: Auto-scaling, Lambda warm-up, Kubernetes node auto-provisioner, and RDS parameter group updates all create or mutate resources outside of IaC control.
External mutation: A cloud provider deprecates a TLS version, changes a default encryption setting, or rotates underlying hardware, altering the effective state of your resource without any action on your part.

Drift is not only an infrastructure problem — it is a compliance problem. Every audit framework (SOC 2, ISO 27001, PCI DSS) requires that your environment matches your documented controls. Silent drift means your attestations are wrong, which is worse than having no attestation at all.

Terraform Drift Detection in Practice

Terraform's terraform plan is the simplest drift detector for IaC-managed resources. Run it in read-only mode against the live provider API and it reports the diff between state and reality. In CI/CD pipelines, the idiomatic pattern is a scheduled drift plan job that uses -detailed-exitcode (exit code 2 means drift detected) and fires an alert if non-zero.

# .github/workflows/drift-detection.yml
name: Terraform Drift Detection

on:
  schedule:
    - cron: '0 */4 * * *'   # every 4 hours
  workflow_dispatch:

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    permissions:
      id-token: write   # OIDC — no long-lived credentials
      contents: read

    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/TerraformReadOnly
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.8.0

      - name: Terraform Init
        run: terraform init -input=false
        working-directory: ./infra

      - name: Detect Drift
        id: plan
        # exit 0 = no changes, 1 = error, 2 = drift detected
        run: |
          set +e
          terraform plan -detailed-exitcode -input=false -out=drift.tfplan
          echo "exit_code=$?" >> "$GITHUB_OUTPUT"
        working-directory: ./infra

      - name: Alert on Drift
        if: steps.plan.outputs.exit_code == '2'
        run: |
          curl -X POST "$SLACK_WEBHOOK" \
            -H 'Content-type: application/json' \
            --data '{"text":":rotating_light: Terraform drift detected in prod. Review: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"}'
        env:
          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK_URL }}

Never run terraform apply automatically as a response to drift. First, verify that the drift is unintentional — it might be a legitimate out-of-band change that should be imported, not overwritten. Blind auto-apply can delete production resources that were intentionally created outside IaC scope.

AWS Config for Continuous Resource Compliance

Terraform drift detection only covers resources that Terraform manages. AWS Config covers everything in your account — resources created by other tools, by the AWS Console, or by AWS itself. Config continuously records configuration changes and evaluates them against rules you define. When a resource falls out of compliance, Config fires a finding that can route to Security Hub, SNS, or a Lambda remediation function.

Continuous compliance pipeline: AWS Config evaluates live resources against rules, routes findings to Security Hub, triggers EventBridge for auto-remediation via Lambda, and pages on-call teams.

The two most impactful AWS Config rules at scale are required-tags (enforce cost allocation and ownership tagging on all resources) and encrypted-volumes (detect unencrypted EBS volumes). Both are AWS-managed rules that require zero Lambda authoring.

# Terraform: deploy a Config rule with auto-remediation via SSM Automation
resource "aws_config_config_rule" "encrypted_volumes" {
  name = "encrypted-volumes"

  source {
    owner             = "AWS"
    source_identifier = "ENCRYPTED_VOLUMES"
  }

  depends_on = [aws_config_configuration_recorder.main]
}

# Remediation: when NON_COMPLIANT, invoke SSM document to encrypt the volume
resource "aws_config_remediation_configuration" "encrypt_volumes" {
  config_rule_name = aws_config_config_rule.encrypted_volumes.name
  target_type      = "SSM_DOCUMENT"
  target_id        = "AWSConfigRemediation-EncryptUnencryptedVolume"

  automatic                  = true
  maximum_automatic_attempts = 3
  retry_attempt_seconds      = 60

  parameter {
    name           = "AutomationAssumeRole"
    static_value   = aws_iam_role.config_remediation.arn
  }
  parameter {
    name           = "VolumeId"
    resource_value = "RESOURCE_ID"
  }
}

Kubernetes: Reconciliation as Drift Control

In Kubernetes, drift is handled differently because the control plane continuously reconciles. If you manually kubectl edit a Deployment that is managed by Flux or ArgoCD, the GitOps controller will detect the discrepancy on its next sync cycle and revert the resource to the state declared in Git — usually within 30–90 seconds. This makes GitOps the strongest drift-control mechanism for Kubernetes workloads.

For cases where drift must be detected rather than automatically reverted (e.g., RBAC or NetworkPolicy objects that someone deleted manually), use kubectl diff -f manifests/ in a scheduled CI job to compare live cluster state against the manifests in Git.

# Scheduled drift check for Kubernetes — compare Git manifests vs live cluster
# Run via a CronJob in a management cluster or a CI pipeline

#!/bin/bash
set -e

KUBECONFIG=/run/secrets/kubeconfig
MANIFEST_DIR=./k8s/production

echo "=== Kubernetes Drift Report $(date -u) ==="

# kubectl diff exits 0 if identical, 1 if there are differences
kubectl --kubeconfig="$KUBECONFIG" diff -f "$MANIFEST_DIR" | tee /tmp/drift-report.txt
EXIT_CODE=${PIPESTATUS[0]}

if [ "$EXIT_CODE" -eq 1 ]; then
  echo "DRIFT DETECTED — sending alert"
  # Ship to your alerting pipeline
  curl -s -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\":\"⚠️ Kubernetes drift detected in *production*. Review drift-report artifact.\"}"
  exit 1
fi

echo "No drift detected."

At Google, SRE teams treat every un-reconciled diff as a severity-3 incident that must be closed within 24 hours. The discipline is not the tooling — it is the cultural expectation that a diverged environment is an active risk, not background noise. Bake that expectation into your on-call runbooks.

Continuous Compliance Reporting

Drift detection without reporting is incomplete. Auditors need to see a time-series record: when was the environment last checked, what was found, who remediated it, and how long did the non-compliant state persist? Build this into your pipeline from day one:

Compliance score over time: Track the ratio of compliant-to-total resources in a time-series metric. AWS Config's compliance timeline and Security Hub's findings trends expose this natively. For Kubernetes, tools like Falco and Polaris export Prometheus metrics you can graph in Grafana.
Mean Time to Remediation (MTTR): Alert when any finding stays open longer than your SLA (e.g., critical = 24 h, high = 72 h). This is a key metric for SOC 2 Type II audits.
Change traceability: Every remediation action must be tied to a ticket (JIRA, Linear) and linked to the specific finding. Lambda remediation functions should write a structured log record with the resource ID, finding ID, action taken, and timestamp to CloudWatch Logs Insights for queryability.

Continuous compliance is a feedback loop, not a point-in-time gate. The goal is to reduce the dwell time of non-compliant state — the window between drift occurring and it being remediated — to minutes rather than days. That is the metric that matters to auditors and to your security posture.