Terraform Fundamentals

Remote State & Backends

18 min Lesson 6 of 30

Remote State & Backends

In the previous lesson you learned what Terraform state is and why it exists. By default, Terraform writes that state to a file named terraform.tfstate on your local disk. That works for learning — and for absolutely nothing else. The moment a second engineer touches the same infrastructure, or a CI runner executes a plan, local state causes split-brain: two operators each hold a different view of reality, and the next apply can silently destroy resources the other person created. Remote backends solve this by storing state in a shared, durable location and — critically — by adding a locking mechanism so that only one operation can mutate state at a time.

This lesson covers the two backends you will encounter in nearly every production organisation (S3 with DynamoDB locking, and HTTP backends), how state locking works and what happens when it fails, and how to handle the sensitive data that Terraform inevitably writes into state.

Why Remote State Is Non-Negotiable in Teams

Local state breaks in three distinct ways that each take a painful incident to learn:

No sharing: A second engineer cloning the repo has no state file. Their first terraform plan shows every resource as "will be created" — infrastructure that already exists in the cloud.
No locking: Two CI jobs running simultaneously can both read the same state, both compute a plan, and then both write back — with the second write silently overwriting the first. Resources get orphaned with no record in state.
No durability: A laptop drive failure or a corrupted .git repo that someone force-pushed state into means the state is gone. Reconciling what Terraform thinks exists versus what the cloud actually has is a multi-day forensic exercise.

Industry standard: Every team running Terraform in production uses remote state. At companies like Stripe, Shopify, and Cloudflare the state backend is provisioned before any other infrastructure — it is a prerequisite, not an afterthought. The backend itself is typically outside Terraform management (a chicken-and-egg problem) and bootstrapped once by a platform team script.

The S3 + DynamoDB Backend

This is the de-facto standard for AWS-based infrastructure. State is stored as a JSON object in an S3 bucket (with versioning and server-side encryption enabled). Locking is provided by a DynamoDB table with a single string attribute named LockID. When a Terraform operation starts, it writes a lock item to DynamoDB; when it finishes (success or failure), it deletes the item. Any concurrent operation that tries to write the same lock item gets a DynamoDB conditional-check failure and Terraform exits with an error rather than proceeding without the lock.

# 1. Bootstrap the backend infrastructure (do this ONCE, manually or via a separate root module)
aws s3api create-bucket \
  --bucket acme-terraform-state-prod \
  --region us-east-1

# Enable versioning — essential for state history and rollback
aws s3api put-bucket-versioning \
  --bucket acme-terraform-state-prod \
  --versioning-configuration Status=Enabled

# Enable server-side encryption (AES-256)
aws s3api put-bucket-encryption \
  --bucket acme-terraform-state-prod \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "AES256"
      }
    }]
  }'

# Block all public access — state buckets must NEVER be public
aws s3api put-public-access-block \
  --bucket acme-terraform-state-prod \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

# Create the DynamoDB lock table
aws dynamodb create-table \
  --table-name acme-terraform-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

With the bucket and table provisioned, configure the backend in your Terraform root module. Backend configuration lives in a terraform {} block and cannot reference variables or locals — the values must be static strings. This is intentional: Terraform needs to resolve the backend before it can evaluate anything else in the configuration.

# backend.tf  — root module backend configuration
terraform {
  required_version = ">= 1.6"

  backend "s3" {
    bucket         = "acme-terraform-state-prod"
    key            = "services/api-gateway/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true                        # enforce SSE even if bucket default is unset
    dynamodb_table = "acme-terraform-locks"      # locking table
    profile        = "prod"                      # AWS CLI profile (omit in CI; use IAM roles)

    # Recommended: force TLS for all S3 API calls
    # (also enforce via S3 bucket policy)
    # workspace_key_prefix = "env"              # used when Terraform workspaces are active
  }
}

# After editing backend config, ALWAYS run:
#   terraform init
# Terraform will detect the new backend and offer to migrate existing local state.

Key-naming convention: The key parameter is the S3 object path within the bucket. A flat naming scheme (prod.tfstate) becomes unmaintainable at scale. Use a hierarchy that mirrors your service tree: <team>/<service>/<environment>/terraform.tfstate. Many organisations also separate network, compute, and data tiers into distinct state files so a broken compute module cannot corrupt the network state. This is the "state isolation" principle and it is one of the most impactful structural decisions you will make on a Terraform project.

State Locking: How It Works and What to Do When It Breaks

Every Terraform command that could modify state — apply, destroy, state mv, import — acquires a lock before starting. Commands that only read state — plan, output, show — do NOT acquire a lock by default (though they can with -lock=true). The lock record stored in DynamoDB contains the operation type, the machine hostname, the Terraform version, and a timestamp.

# Example: what a DynamoDB lock item looks like (retrieved with AWS CLI)
aws dynamodb get-item \
  --table-name acme-terraform-locks \
  --key '{"LockID": {"S": "acme-terraform-state-prod/services/api-gateway/terraform.tfstate"}}' \
  --region us-east-1

# Output (when a lock is held):
# {
#   "Item": {
#     "LockID":   { "S": "acme-terraform-state-prod/services/api-gateway/terraform.tfstate" },
#     "Info":     { "S": "{\"ID\":\"a1b2c3d4...\",\"Operation\":\"OperationTypeApply\",
#                          \"Who\":\"ci-runner@github-actions\",\"Version\":\"1.7.5\",
#                          \"Created\":\"2025-03-15T14:22:10.411Z\",\"Path\":\"...\"}" }
#   }
# }

# --- STALE LOCK RECOVERY ---
# A lock can become stale if a CI runner crashes mid-apply or a laptop loses network
# connectivity. Terraform will refuse to run until the lock is cleared.

# Step 1: Identify the lock ID from the error message Terraform printed,
#         OR from the DynamoDB item above.

# Step 2: Force-unlock (requires human judgment — verify the locking process is truly dead)
terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890

# Step 3: Verify the state file is not corrupt after an interrupted apply
terraform plan     # should show only legitimate drift, not phantom changes

Never force-unlock a live operation. If you run terraform force-unlock while another apply is genuinely in progress, you have removed the only concurrency guard. The next operation will read stale state, compute an incorrect plan, and could delete or recreate resources that the first operation was in the middle of modifying. Always confirm the locking process is dead (check CI job status, ping the engineer) before unlocking. At minimum, wait 10 minutes past the lock timestamp.

HTTP Backends (GitLab, Terraform Cloud, Custom)

The HTTP backend is a generic interface: Terraform performs GET, POST (update), DELETE (unlock) requests against any HTTP server that implements the protocol. GitLab CI/CD has a built-in HTTP state backend (one per project, per environment), which makes it the default choice for organisations already on GitLab. Terraform Cloud and HCP Terraform also use an HTTP-compatible protocol under the hood, though they expose a richer API.

# GitLab-managed Terraform state backend
# GitLab provides a per-project HTTP endpoint; authentication uses a project access token.

terraform {
  backend "http" {
    address        = "https://gitlab.example.com/api/v4/projects/42/terraform/state/production"
    lock_address   = "https://gitlab.example.com/api/v4/projects/42/terraform/state/production/lock"
    unlock_address = "https://gitlab.example.com/api/v4/projects/42/terraform/state/production/lock"
    lock_method    = "POST"
    unlock_method  = "DELETE"
    retry_wait_min = 5

    # Credentials are passed via environment variables, NOT hardcoded here:
    # TF_HTTP_USERNAME  = "gitlab-ci-token"
    # TF_HTTP_PASSWORD  = "$CI_JOB_TOKEN"      # injected by GitLab CI automatically
  }
}

# In .gitlab-ci.yml, the init step passes credentials via env vars:
# variables:
#   TF_HTTP_USERNAME: "gitlab-ci-token"
#   TF_HTTP_PASSWORD: $CI_JOB_TOKEN
#
# terraform init      # reads backend config and authenticates
# terraform plan -out plan.tfplan
# terraform apply plan.tfplan

S3 remote state backend: the Engineer acquires a DynamoDB lock before reading or writing state; the CI Runner is blocked until the lock is released. Resources are provisioned via the AWS API while state tracks them in S3.

Sensitive Data in State: The Production Reality

Terraform state is not a simple inventory. It stores all attributes of every managed resource — including the ones your cloud provider marks as sensitive. A freshly created RDS instance writes the master password in plaintext to state. An IAM access key writes the secret in plaintext. A TLS certificate resource writes the private key. This is not a Terraform bug; it is an unavoidable consequence of idempotent infrastructure management: Terraform must know what the current value is to decide whether it needs to change.

Encrypt state at rest: Always enable S3 SSE (AES256 or AWS KMS with a customer-managed key). For highly regulated environments, use a KMS CMK so you can audit and rotate the encryption key independently.
Restrict access with IAM: Only the roles that run Terraform should have s3:GetObject, s3:PutObject, and dynamodb:PutItem on the state bucket and lock table. Engineers should NOT have direct S3 access to production state — they should interact via CI pipelines only.
Never commit state to git: Add *.tfstate and *.tfstate.backup to .gitignore on every Terraform project. Use git-secrets or a pre-commit hook to block accidental commits.
Use sensitive = true on outputs: Mark any output that contains a secret value as sensitive. Terraform will redact it from CLI output and plan files — but it will still be in state. Sensitivity in Terraform is a UX guard, not a security boundary.

# Marking outputs as sensitive so they are redacted in CLI output
output "db_password" {
  value     = aws_db_instance.main.password
  sensitive = true    # Terraform prints "(sensitive value)" instead of the actual password
}

output "api_key" {
  value     = aws_iam_access_key.deployer.secret
  sensitive = true
}

# Retrieving a sensitive output explicitly (necessary for passing to other tools)
terraform output -raw db_password     # bypasses the redaction; pipe carefully
terraform output -json | jq -r '.db_password.value'

# Best practice: never store long-lived secrets in Terraform state at all.
# Use aws_secretsmanager_secret or aws_ssm_parameter (SecureString) as the store,
# and reference them by ARN/path from application config — not by value through state.
#
# Example: generate a random password and store it in Secrets Manager
resource "random_password" "db" {
  length  = 32
  special = true
}

resource "aws_secretsmanager_secret" "db_password" {
  name = "prod/api-db/password"
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = random_password.db.result
}

# Now the RDS instance references Secrets Manager, not Terraform state directly
resource "aws_db_instance" "main" {
  password = random_password.db.result   # still in state, but rotatable via Secrets Manager
}

State encryption with KMS (production hardening): Replace encrypt = true (AES256) with a KMS CMK for production workloads: add kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/mrk-..." to your backend config. This gives you CloudTrail-audited key usage, key rotation, and the ability to revoke access to all historical state by disabling the key — a control that AES256 with AWS-managed keys cannot provide.

Migrating Between Backends

When you change the backend configuration — for example, moving from local to S3, or changing the S3 key — Terraform detects the change on the next terraform init and prompts you to migrate existing state. Always run terraform plan immediately after migration to confirm the migrated state matches what the cloud actually has. A successful migration shows zero planned changes.

# After editing backend.tf to point to a new S3 bucket or key:
terraform init -migrate-state

# Terraform will print:
#   Initializing the backend...
#   Do you want to copy existing state to the new backend? (yes/no)
# Type "yes" to migrate.

# Immediately verify the migration was clean:
terraform plan
# Expected output: "No changes. Your infrastructure matches the configuration."

# If you see unexpected changes, the state was not migrated cleanly.
# Stop immediately, restore from the previous backend, and investigate.

Summary

Remote state backends are the operational foundation of any team-based Terraform workflow. The S3 + DynamoDB combination provides object storage durability, versioned history, encryption at rest, and atomic locking — covering every failure mode of local state. HTTP backends (GitLab, Terraform Cloud) offer the same guarantees through a standardised protocol. Understanding state locking — how it is acquired, how to safely recover from stale locks, and why concurrent operations without locking cause data corruption — is knowledge that separates engineers who use Terraform from engineers who operate it safely at scale. Finally, treating state as a sensitive artifact (encrypting it, restricting access, never committing it to git) is not optional: your state file is a partial dump of every secret your infrastructure holds.