Structuring Terraform at Scale
Structuring Terraform at Scale
At a startup you can get away with a single main.tf and a shared state file. At 50 engineers across 10 product teams managing six AWS accounts, that approach collapses within weeks: state lock contention, blast-radius failures, and on-call incidents caused by the wrong team touching the wrong resource. This lesson teaches the repo layout, environment separation, and layered state strategy that top-tier engineering organizations use to keep infrastructure changes safe, reviewable, and autonomous across teams.
The Core Constraint: Blast Radius
Every structural decision in large-scale Terraform flows from one question: if this terraform apply goes wrong, what is the maximum damage? A monolithic root module that manages networking, IAM, RDS, and Kubernetes in one state file can take down production with a single misplaced count. The remedy is state isolation — splitting infrastructure into layers where each layer is a separate Terraform root module with its own backend state.
terraform apply. Design your layers so that losing any single state file affects only one logical scope (e.g., one app's compute, not the entire VPC).
The Three-Layer Model
The industry-standard pattern divides infrastructure into three layers, applied from the bottom up. Each layer can only reference outputs from layers below it — never sideways or upward.
Lower layers expose outputs — VPC IDs, subnet IDs, cluster endpoints — via terraform_remote_state or (preferred at scale) a parameter store / SSM pattern, so app teams never need read access to the foundation state file.
Monorepo vs. Polyrepo
Both models work. The decision is organizational, not technical.
- Monorepo — all Terraform in one repo, directories per layer/environment. Great for discoverability and atomic cross-layer PRs. Requires strong CODEOWNERS rules and per-path CI triggers so a change to
layer1/does not trigger plan for every app module. - Polyrepo — each layer (or team) owns its own repo. Natural security boundary: product teams literally cannot see foundation HCL. Harder to trace cross-layer dependencies. Common at large enterprises with separate security-compliance ownership of foundation infra.
Canonical Monorepo Layout
The layout below is production-hardened across hundreds of AWS environments. Every path is deliberate:
Environment Separation Strategies
There are three approaches to separating environments in Terraform. Understanding the trade-offs prevents costly migrations later.
- Directory-per-environment (shown above) — each environment is a separate root module directory with its own
backend.tfand.tfvars. This is the safest and most explicit approach. You cannot accidentally apply staging config to production. The cost: some HCL duplication, mitigated by shared modules. - Workspaces — one root module, multiple named workspaces, one state file per workspace. Works for small, truly-identical environments (dev/test). Breaks down when environments diverge: different instance sizes, different subnets, different DNS zones. Avoid for production/staging at scale — the temptation to add
terraform.workspace == "production" ? ... : ...conditionals metastasizes into unmaintainable code. - Separate accounts — the AWS Well-Architected standard for regulated industries. Production, staging, sandbox, and security-tooling each live in separate AWS accounts linked under AWS Organizations. Each account has its own layer1 root module. This is the gold standard for SaaS companies with SOC-2 or PCI requirements.
terraform destroy run against staging with a shared state file has deleted production RDS instances at multiple companies. State isolation is non-negotiable. One backend key = one environment.
Backend Configuration at Scale
At scale, every team configures their S3 backend with the same three non-negotiables: versioning, encryption, and DynamoDB locking. The backend key must encode the layer, service name, and environment so state files are self-describing:
The SSM parameter pattern is superior to terraform_remote_state for cross-team consumption because it decouples state file access from value consumption. The foundation team writes outputs to SSM; app teams read from SSM. No IAM permissions to the foundation S3 bucket are needed for app teams, and the foundation can refactor its internals without changing the SSM key contracts.
CODEOWNERS and Per-Path CI
The directory layout only enforces team boundaries if your CI/CD system enforces it too. A production-grade setup combines:
.github/CODEOWNERS—layer1/requires approval from@infra-platform;layer3-apps/payments-api/requires@team-payments.- Per-path CI triggers — GitHub Actions
on.push.pathsor Atlantis per-directory plans so only the affected root modules runterraform planon each PR. - Protected branch rules — no direct pushes to
main; plan output posted as PR comment;terraform applyonly runs after merge on the CI runner, never from a developer laptop in production.
terraform apply from a developer laptop against production. CI runners should hold the production credentials; developers hold only read-only roles that allow terraform plan. This single policy prevents the most common class of human-error production incidents.
Common Failure Modes
At this stage teams routinely make three structural mistakes:
- God module — one module that creates everything. The
module.appcall takes 80 inputs and manages 400 resources. Refactoring it mid-flight is a multi-week state surgery project. Decompose early. - Hardcoded account IDs in shared modules — shared modules should never reference specific account IDs or region strings. Pass them as variables. A module with a hardcoded
123456789012is impossible to reuse across accounts. - Missing state locking — two engineers run
terraform applysimultaneously, the second overwrites the first's state, and you lose track of which resources Terraform knows about. Always configure a DynamoDB lock table. It costs pennies and prevents catastrophic state corruption.
The layout and discipline established here is the foundation on which the rest of this tutorial builds: workspaces, advanced modules, testing, and Terragrunt all assume you already have clean layer separation and isolated state per environment.