Terraform Fundamentals

Data Sources & References

18 min Lesson 7 of 30

Data Sources & References

Every non-trivial Terraform configuration needs to read the world before it can change it. You need the ID of the latest Amazon Linux AMI before launching an EC2 instance. You need the ARN of a certificate managed by another team before attaching it to your load balancer. You need the CIDR blocks of a VPC that was created six months ago — long before your module existed. This is exactly what data sources solve.

A data source is a read-only query against a provider's API or state. Declare it with a data block, reference its attributes exactly as you would a resource, and Terraform builds the correct dependency edge automatically. Understanding data sources — and how Terraform's resource graph uses them — is what separates engineers who write toy configurations from engineers who manage production infrastructure at scale.

The data Block

The data block has the same structure as resource: a type, a local name, and a body of filter arguments. The provider resolves the query at plan time and exposes every attribute of the matching object as data.<type>.<name>.<attribute>.

# Fetch the most recent Amazon Linux 2023 AMI owned by Amazon
data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

# Use it — Terraform resolves the AMI ID at plan time
resource "aws_instance" "web" {
  ami           = data.aws_ami.al2023.id
  instance_type = "t3.medium"

  tags = { Name = "web-server" }
}

# --- Look up an existing VPC by tag (created outside Terraform) ---
data "aws_vpc" "shared" {
  tags = {
    Name        = "platform-vpc"
    Environment = var.environment
  }
}

# Use a VPC data source to derive the subnet CIDR list
data "aws_subnets" "private" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.shared.id]
  }

  tags = { Tier = "private" }
}

resource "aws_db_subnet_group" "main" {
  name       = "main-db-subnet-group"
  subnet_ids = data.aws_subnets.private.ids
}

Data sources execute at plan time, not apply time. This means if the queried resource does not exist yet — for example, a secret that another Terraform stack will create later — the plan will fail. Design your stacks so that data sources always target already-existing objects. Use depends_on or split your apply into ordered stages when cross-stack ordering is required.

Common Data Source Patterns

The three patterns you will use on every project are: (1) fetching dynamic IDs like AMIs and certificates, (2) importing shared VPC topology managed by a platform team, and (3) reading outputs from another Terraform state file via terraform_remote_state.

# Pattern 1 — Dynamic certificate lookup (ACM)
data "aws_acm_certificate" "api" {
  domain      = "api.example.com"
  statuses    = ["ISSUED"]
  most_recent = true
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.api.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = data.aws_acm_certificate.api.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.api.arn
  }
}

# Pattern 2 — Read platform team's remote state
data "terraform_remote_state" "network" {
  backend = "s3"

  config = {
    bucket = "mycompany-terraform-state"
    key    = "network/terraform.tfstate"
    region = "us-east-1"
  }
}

# Now access any output the network stack exported
locals {
  vpc_id          = data.terraform_remote_state.network.outputs.vpc_id
  private_subnets = data.terraform_remote_state.network.outputs.private_subnet_ids
}

# Pattern 3 — Caller identity (useful for naming and IAM)
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}

locals {
  account_id = data.aws_caller_identity.current.account_id
  region     = data.aws_region.current.name
  # Build an ARN prefix without hardcoding account/region
  log_group_arn = "arn:aws:logs:${local.region}:${local.account_id}:log-group:/app/web"
}

Implicit Dependencies and the Resource Graph

Terraform does not execute resources top-to-bottom. Instead, it builds a directed acyclic graph (DAG) — the resource graph — by parsing every reference in your configuration. When resource B references resource_a.foo.id, Terraform draws an edge from A to B, guaranteeing A is created before B. This happens automatically from references; you almost never need to state it explicitly.

The graph has three node types: provider nodes (initialize the AWS, GCP, or Vault API client), resource nodes (real infrastructure objects), and data nodes (read-only queries). Terraform walks the graph in parallel: independent nodes run concurrently; dependent nodes wait. On a large configuration this parallelism — controlled by -parallelism=N (default 10) — is what makes Terraform fast despite managing hundreds of resources.

Terraform resource graph for a typical web stack — data nodes (blue) are queried first; resource nodes (green) are created in dependency order; the RDS instance waits for both the subnet group and the EC2 instance.

Inspecting the Graph

You can materialize the graph at any time with terraform graph, which outputs DOT format. Pipe it into Graphviz to produce a PNG and review node ordering before a high-risk apply:

# Render the dependency graph as a PNG (requires graphviz)
terraform graph | dot -Tpng -o graph.png

# Inspect just the plan-phase graph (data reads + create actions)
terraform graph -type=plan | dot -Tsvg -o plan-graph.svg

# In large configs, filter to a specific resource and its ancestors
terraform graph | grep -E '(aws_db_instance|aws_db_subnet_group)' | dot -Tpng -o db-subgraph.png

Use terraform graph during code review on every large PR. A missing dependency edge means two resources that should be sequential will run in parallel, causing intermittent race-condition failures that are very difficult to reproduce. The graph makes invisible ordering assumptions explicit. At Google and Amazon, platform teams require graph review as part of the Terraform module acceptance checklist.

Explicit Dependencies with depends_on

Terraform's graph infers dependencies from references, but not from side effects. If resource B requires that resource A has been applied — even though B does not directly reference any of A's attributes — you must express that with depends_on. The classic example is an IAM role policy that must propagate before a Lambda function can execute.

resource "aws_iam_role_policy_attachment" "lambda_exec" {
  role       = aws_iam_role.lambda.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

resource "aws_lambda_function" "processor" {
  function_name = "event-processor"
  role          = aws_iam_role.lambda.arn
  # ...

  # IAM changes propagate asynchronously — the function can fail
  # to start if the policy is not fully attached yet.
  # depends_on forces Terraform to wait even though we already reference
  # aws_iam_role.lambda.arn (which does NOT guarantee the attachment exists).
  depends_on = [aws_iam_role_policy_attachment.lambda_exec]
}

# depends_on also works on data sources — force a data source
# to re-read after a specific resource has been created:
data "aws_secretsmanager_secret_version" "db_pass" {
  secret_id = aws_secretsmanager_secret.db.id

  depends_on = [aws_secretsmanager_secret_version.db_initial]
}

Overusing depends_on defeats Terraform's parallelism. Every unnecessary edge serializes work that could run concurrently and inflates your apply time. More critically, depends_on on a module forces every resource inside that module to wait — even resources with no logical relationship to the dependency. Add explicit edges only when Terraform genuinely cannot infer the ordering from references alone.

Production Failure Mode: Stale Data Sources

Data sources are re-evaluated on every plan and apply. If the upstream object changes between two applies — for example, a security team rotates the ACM certificate or a platform team changes VPC CIDR allocations — your next plan will see the new value. This is usually correct, but it can cause surprises: a new AMI ID returned by most_recent = true will force replacement of every EC2 instance that references it. In production, pin AMI IDs by adding a name_regex that captures a specific patch level, or use filter on the image-id tag set by your golden-image pipeline. Never use most_recent = true on AMIs in production without a tested rollback plan.

In the next lesson you explore meta-arguments — count, for_each, and lifecycle — which let you express iteration and resource lifecycle policies inside a single block, eliminating repetition at scale.