Advanced Terraform & IaC Patterns

Drift, Imports & Brownfield IaC

18 min Lesson 9 of 28

Drift, Imports & Brownfield IaC

One of the most uncomfortable realities in infrastructure engineering is that the cloud does not wait for your Terraform. Engineers click through the console to fix an outage at 2 AM, a security team patches a security group rule directly via the AWS CLI, an auto-scaling event creates resources Terraform never knew about. Over time, the gap between what Terraform thinks exists and what actually exists — called configuration drift — grows silently until it causes a production incident. This lesson teaches how top-tier DevOps organizations detect drift proactively, how to import existing (brownfield) infrastructure into Terraform control, and how to run a full brownfield adoption campaign without breaking production.

What Is Drift?

Drift is any difference between the desired state recorded in Terraform state and the actual state in the cloud provider. It falls into three categories:

Attribute drift — a resource exists in both state and the cloud but a property was changed out-of-band: someone bumped an EC2 instance type from t3.medium to t3.large in the console.
Missing resource drift — Terraform state says a resource exists but it was deleted outside Terraform (accidental console delete, or the resource was replaced by another process).
Unmanaged resource drift — a resource exists in the cloud that Terraform has never seen: a manually created S3 bucket, a legacy RDS instance, a security group from the pre-IaC era.

Drift is a correctness problem, not just an aesthetic one. A drifted security group rule may be the only thing allowing your monitoring agents to reach prod. When Terraform next applies, it silently reverts that rule — and your on-call dashboard goes dark. This is why drift detection must be automated, not manual.

Detecting Drift: The Tools

Terraform provides two native mechanisms for drift detection. The first is terraform plan -refresh-only, introduced in Terraform 0.15.4. It performs a full provider API refresh against every resource in state and generates a plan showing only what changed in the real world — without proposing any configuration changes. The output is a pure diff of reality vs. state.

# Refresh-only plan — safe read-only operation
terraform plan -refresh-only -out=drift.tfplan

# Review the drift report
terraform show drift.tfplan

# Optionally apply to update state to match reality
# (does NOT change real infrastructure — only updates the state file)
terraform apply drift.tfplan

The second mechanism is terraform plan -detailed-exitcode, which exits with code 2 if there are changes (either config changes or drift), code 0 if no changes, and code 1 on error. This exit code is the hook for automation.

Automated drift detection pipeline: a scheduled CI job runs refresh-only plans every 6 hours and fires alerts when drift is found.

Automating Drift Detection in CI

The production pattern is a scheduled CI job — not a human-triggered one. The job runs every 6 hours, covers every root module, and pages the on-call team when drift is found. In GitHub Actions:

# .github/workflows/drift-detection.yml
name: Drift Detection

on:
  schedule:
    - cron: '0 */6 * * *'
  workflow_dispatch:

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [production, staging]
        layer: [layer1-foundation, layer2-platform]
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/terraform-read-only
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.8.x

      - name: Terraform Init
        working-directory: ${{ matrix.layer }}/${{ matrix.environment }}
        run: terraform init -input=false

      - name: Detect Drift
        id: drift
        working-directory: ${{ matrix.layer }}/${{ matrix.environment }}
        run: |
          set +e
          terraform plan -refresh-only -detailed-exitcode \
            -out=drift.tfplan -no-color 2>&1 | tee drift-output.txt
          EXIT_CODE=$?
          echo "exit_code=${EXIT_CODE}" >> $GITHUB_OUTPUT
          exit 0

      - name: Post Drift Alert
        if: steps.drift.outputs.exit_code == '2'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {"text": ":rotating_light: Drift in ${{ matrix.layer }}/${{ matrix.environment }}"}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_DRIFT_WEBHOOK }}

Use a read-only IAM role for drift detection. The drift job only reads the provider API — give it a role with ReadOnlyAccess and S3 backend read. This limits the blast radius if the scheduled job is ever compromised, and prevents a runaway job from making unintended changes.

Importing Existing Resources

When you discover an unmanaged resource — one that exists in the cloud but has no Terraform state — you have two options: delete and recreate via Terraform (disruptive for production), or import it into state (preferred). terraform import reads the real resource from the provider API and writes it into the state file. It does not generate HCL for you — you must write the matching configuration first, then import, then iterate until the plan diff is zero.

Terraform 1.5 introduced the import block, which makes imports declarative, code-reviewable, and reproducible — the modern preferred approach over the CLI command.

# Modern approach: import block (Terraform 1.5+)
# imports.tf

import {
  to = aws_s3_bucket.audit_logs
  id = "acme-audit-logs-prod"
}

import {
  to = aws_security_group.legacy_app
  id = "sg-0a1b2c3d4e5f"
}

# Terraform 1.7+: auto-generate HCL from the real resource
# terraform plan -generate-config-out=generated.tf
# Review and clean up generated.tf, then:
terraform apply

# Classic CLI import (still valid, less auditable)
terraform import aws_s3_bucket.audit_logs acme-audit-logs-prod

Import does not validate your HCL against reality. After importing, always run terraform plan and check for a zero-diff result. If the plan shows changes, Terraform will apply those changes on the next run — potentially overwriting production configuration. Fix every attribute mismatch before you let automation apply.

The Brownfield Adoption Playbook

Adopting Terraform for an existing production environment (brownfield IaC) is one of the most high-stakes operations a platform team performs. At large companies this is a months-long campaign. The safe pattern is strangler-fig adoption: import resources one layer at a time, run in plan-only mode for weeks before enabling apply, and maintain a rollback path at every step.

Inventory and triage. Use AWS Config, the AWS CLI with describe-* commands, or a tool like terraformer to enumerate every resource in the account. Categorize: managed vs. unmanaged. Prioritize by risk — import networking last (highest blast radius), start with stateless compute.
Write HCL before importing. Write the resource configuration, commit it, get it reviewed. Import is irreversible in the sense that the state file now tracks the resource — a mistake in HCL that is then applied can destroy or replace the real resource.
Import in a branch, plan in CI. Create a PR with the import block and HCL. The CI plan run will show the diff. Merge only when the plan is zero-diff (or all diffs are acceptable, non-destructive attribute normalization like tag casing).
Enable apply gradually. Start with plan-only CI runs for 2 weeks on the newly imported resources. Verify drift detection finds no unexpected changes. Only then open the apply gate.
Document every import. Add a comment in HCL or a git commit message explaining why this resource was brownfield-imported: the original ticket, who created the resource manually, and when it was imported. This institutional memory prevents future teams from thinking the resource is safe to delete.

Use lifecycle { prevent_destroy = true } on every brownfield-imported resource for the first 90 days. This guard prevents an accidental terraform destroy or a misconfigured for_each from deleting a production database that predates IaC. Remove it only after the team has full confidence in the HCL configuration.

The lifecycle.ignore_changes Escape Hatch

Some attributes are legitimately managed outside Terraform: the desired count of an ECS service (managed by autoscaling), the AMI of an EC2 instance (managed by an AMI-baking pipeline), the password of an RDS instance (rotated by AWS Secrets Manager). For these, use ignore_changes to prevent Terraform from reverting the external change:

resource "aws_ecs_service" "api" {
  name            = "payments-api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 3

  lifecycle {
    ignore_changes = [
      desired_count,          # managed by Application Auto Scaling
      task_definition,        # updated by deploy pipeline, not Terraform
    ]
  }
}

resource "aws_db_instance" "postgres" {
  identifier = "acme-prod-postgres"
  # ... other config ...

  lifecycle {
    ignore_changes  = [password]   # rotated by Secrets Manager
    prevent_destroy = true         # never allow accidental deletion
  }
}

ignore_changes should be used sparingly and always documented with a comment explaining why the attribute is managed externally. Overuse leads to silent configuration creep where Terraform stops being the source of truth for important attributes.

Terraform Moved Blocks for Safe Refactoring

When you refactor HCL — renaming a resource, moving it into a module, changing a count to for_each — Terraform by default sees the old resource as destroyed and the new one as created. For production resources this means a delete/recreate cycle. The moved block, introduced in Terraform 1.1, tells Terraform that the same real resource is now referenced by a different address:

# Before: resource "aws_instance" "web" { ... }
# After refactor: resource "aws_instance" "web_servers" { ... }

# Add to moves.tf to prevent destroy/recreate:
moved {
  from = aws_instance.web
  to   = aws_instance.web_servers
}

# For module moves:
moved {
  from = aws_security_group.legacy_sg
  to   = module.networking.aws_security_group.app
}

After applying, the moved block can be kept permanently as documentation of the refactor history, or removed once the team has confirmed the change is stable. At big-tech scale, keeping them for one release cycle (one sprint) and then removing in a follow-up PR is the standard practice.

Production Failure Modes

The most common brownfield disasters follow predictable patterns. Understanding them protects you:

Import then apply without plan review. The imported HCL has a wrong attribute (wrong vpc_id, wrong engine_version). The next terraform apply replaces the resource. Always get a zero-diff plan before enabling apply.
Drifted security groups silently reverted. A security team added a hotfix rule to a security group after a DDoS incident. Terraform reverts it on the next apply. The rule was the only thing blocking the attack vector. Now the attack resumes. Run drift detection before every apply.
Mass resource destruction from for_each key change. A brownfield resource was imported with count. Someone refactors to for_each with string keys. Terraform sees all count-indexed resources as destroyed and creates new for_each ones. Use moved blocks to map old addresses to new ones.

Drift detection and brownfield import are the two skills that separate teams that use Terraform for compliance theater from teams that use it as the genuine operational source of truth. Master both, automate drift detection from day one, and your infrastructure state will remain trustworthy even in the messiest production environments.