We are still cooking the magic in the way!
Drift Detection & Continuous Compliance
Drift Detection & Continuous Compliance
Policy as Code defines what your infrastructure should look like. Drift detection answers the harder question: does it still look that way, right now? In production, the gap between desired state and actual state widens constantly — engineers apply manual hotfixes, cloud providers update default settings, auto-scaling creates resources that were never in Terraform, and vendor APIs silently change behavior. Continuous compliance closes this loop by detecting, alerting, and optionally remediating drift before it becomes a security finding or an audit failure.
What Drift Is and Why It Happens
Drift is any deviation between the declared desired state (Terraform state file, Kubernetes manifest, golden AMI, OPA policy) and the live resource. There are three root causes you will see repeatedly in production environments:
- Emergency changes: An on-call engineer applies a manual AWS Console change at 2 AM to stop a production outage. The change is never codified.
- Automation side-effects: Auto-scaling, Lambda warm-up, Kubernetes node auto-provisioner, and RDS parameter group updates all create or mutate resources outside of IaC control.
- External mutation: A cloud provider deprecates a TLS version, changes a default encryption setting, or rotates underlying hardware, altering the effective state of your resource without any action on your part.
Terraform Drift Detection in Practice
Terraform's terraform plan is the simplest drift detector for IaC-managed resources. Run it in read-only mode against the live provider API and it reports the diff between state and reality. In CI/CD pipelines, the idiomatic pattern is a scheduled drift plan job that uses -detailed-exitcode (exit code 2 means drift detected) and fires an alert if non-zero.
terraform apply automatically as a response to drift. First, verify that the drift is unintentional — it might be a legitimate out-of-band change that should be imported, not overwritten. Blind auto-apply can delete production resources that were intentionally created outside IaC scope.
AWS Config for Continuous Resource Compliance
Terraform drift detection only covers resources that Terraform manages. AWS Config covers everything in your account — resources created by other tools, by the AWS Console, or by AWS itself. Config continuously records configuration changes and evaluates them against rules you define. When a resource falls out of compliance, Config fires a finding that can route to Security Hub, SNS, or a Lambda remediation function.
The two most impactful AWS Config rules at scale are required-tags (enforce cost allocation and ownership tagging on all resources) and encrypted-volumes (detect unencrypted EBS volumes). Both are AWS-managed rules that require zero Lambda authoring.
Kubernetes: Reconciliation as Drift Control
In Kubernetes, drift is handled differently because the control plane continuously reconciles. If you manually kubectl edit a Deployment that is managed by Flux or ArgoCD, the GitOps controller will detect the discrepancy on its next sync cycle and revert the resource to the state declared in Git — usually within 30–90 seconds. This makes GitOps the strongest drift-control mechanism for Kubernetes workloads.
For cases where drift must be detected rather than automatically reverted (e.g., RBAC or NetworkPolicy objects that someone deleted manually), use kubectl diff -f manifests/ in a scheduled CI job to compare live cluster state against the manifests in Git.
Continuous Compliance Reporting
Drift detection without reporting is incomplete. Auditors need to see a time-series record: when was the environment last checked, what was found, who remediated it, and how long did the non-compliant state persist? Build this into your pipeline from day one:
- Compliance score over time: Track the ratio of compliant-to-total resources in a time-series metric. AWS Config's compliance timeline and Security Hub's findings trends expose this natively. For Kubernetes, tools like Falco and Polaris export Prometheus metrics you can graph in Grafana.
- Mean Time to Remediation (MTTR): Alert when any finding stays open longer than your SLA (e.g., critical = 24 h, high = 72 h). This is a key metric for SOC 2 Type II audits.
- Change traceability: Every remediation action must be tied to a ticket (JIRA, Linear) and linked to the specific finding. Lambda remediation functions should write a structured log record with the resource ID, finding ID, action taken, and timestamp to CloudWatch Logs Insights for queryability.