Cloud Architecture & Landing Zones

Network Architecture at Org Scale

18 min Lesson 5 of 28

Network Architecture at Org Scale

When a startup grows into a multi-team, multi-account AWS organization, the naive approach — one VPC per workload, each peered ad-hoc — collapses under its own weight. Peering connections form an O(n²) mesh, route tables become unmanageable, and security teams lose visibility into inter-service traffic. The enterprise answer is a deliberate hub-and-spoke topology built around a centralized Transit Gateway, shared service VPCs, and a single choke point for internet egress.

This lesson covers the three pillars of org-scale networking: the hub-and-spoke model, shared VPCs via AWS Resource Access Manager (RAM), and centralized egress through an inspection VPC. You will leave with Terraform and AWS CLI patterns you can run in a real organization.

The Hub-and-Spoke Model

In a hub-and-spoke design, a central Transit Gateway (TGW) acts as the hub. Every spoke VPC — one per account, per environment, or per business unit — connects to the TGW via a TGW attachment. Spokes never peer with each other directly; all traffic transits the hub. This gives you:

Linear attachment scaling — AWS TGW supports up to 5,000 VPC attachments per gateway; no peering mesh.
Centralized route control — TGW Route Tables define which spokes can reach which. Isolated RTs prevent prod/dev cross-contamination without per-VPC ACL duplication.
Transitive routing — spoke A can reach spoke B only if the TGW RT permits it. VPC peering lacks this; TGW enables it natively.
Inspection insertion — you can steer all East–West or North–South traffic through a centralized firewall VPC without changing any spoke.

AWS Transit Gateway vs VPC Peering: Peering is free (data transfer charges still apply) and has lower latency. TGW adds ~$0.05/attachment-hour plus $0.02/GB. At org scale the operational savings — one route table, one firewall policy, one logging stream — far outweigh the cost. At fewer than ~5 VPCs, peering may still be the right call.

Hub-and-spoke topology: all VPCs attach to a central Transit Gateway; internet egress is funneled through a single inspection VPC.

Terraform: Transit Gateway and Spoke Attachment

The following Terraform creates a TGW in the network AWS account, then attaches a spoke VPC (owned by a workload account) using RAM sharing. This is the pattern AWS Landing Zone Accelerator and Control Tower use internally.

# network-account/tgw.tf
resource "aws_ec2_transit_gateway" "main" {
  description                     = "Org-wide Transit Gateway"
  amazon_side_asn                 = 64512
  default_route_table_association = "disable"  # We manage RTs explicitly
  default_route_table_propagation = "disable"
  auto_accept_shared_attachments  = "enable"   # RAM-shared spokes auto-accept

  tags = {
    Name = "org-tgw"
    Env  = "shared"
  }
}

# Share the TGW across the entire org via RAM
resource "aws_ram_resource_share" "tgw" {
  name                      = "tgw-org-share"
  allow_external_principals = false
}

resource "aws_ram_resource_association" "tgw" {
  resource_arn       = aws_ec2_transit_gateway.main.arn
  resource_share_arn = aws_ram_resource_share.tgw.arn
}

resource "aws_ram_principal_association" "org" {
  principal          = "arn:aws:organizations::MGMT_ACCOUNT_ID:organization/o-ORGID"
  resource_share_arn = aws_ram_resource_share.tgw.arn
}

# --- In the workload (spoke) account ---
resource "aws_ec2_transit_gateway_vpc_attachment" "spoke" {
  transit_gateway_id = "tgw-0abc1234"   # ID from network account (or data source)
  vpc_id             = aws_vpc.prod.id
  subnet_ids         = aws_subnet.private[*].id

  transit_gateway_default_route_table_association = false
  transit_gateway_default_route_table_propagation = false

  tags = { Name = "prod-spoke-attach" }
}

Shared VPCs via Resource Access Manager

A Shared VPC (also called VPC Sharing) lets you own one VPC in a central network account and share individual subnets into multiple AWS accounts via RAM. Workload accounts launch EC2 instances and ECS tasks directly into the shared subnets — they never need their own VPC or NAT Gateway. This pattern dramatically reduces NAT costs at scale.

The central account retains control of routing, NACLs, and flow logs.
Each participant account controls security groups within the shared subnet — they cannot modify the route table.
IAM Service Control Policies in the org can prevent participant accounts from creating their own VPCs entirely, enforcing the shared model.

# Share a specific subnet from the network account into a workload account
aws ram create-resource-share \
  --name "shared-private-subnet-useast1a" \
  --resource-arns "arn:aws:ec2:us-east-1:NETWORK_ACCT:subnet/subnet-0abc1234" \
  --principals "arn:aws:organizations::MGMT_ACCT:ou/o-ORG/ou-ROOT-OUID" \
  --permission-arns "arn:aws:ram::aws:permission/AWSRAMDefaultPermissionSubnet"

# Verify from the workload account — shared subnets appear as owned
aws ec2 describe-subnets \
  --filters "Name=owner-id,Values=NETWORK_ACCT_ID" \
  --query "Subnets[*].{ID:SubnetId,CIDR:CidrBlock,AZ:AvailabilityZone}"

Cost optimization at scale: With shared subnets you deploy one NAT Gateway per AZ in the network account. Ten workload accounts all egress through the same NAT — saving ~$130/month per AZ per account vs each spinning its own. At 50 workload accounts across 3 AZs that is over $195,000 per year in NAT savings alone.

Centralized Egress and Traffic Inspection

Allowing each spoke VPC to egress directly to the internet creates blind spots: no unified threat detection, no single FQDN allowlist, and firewall rules scattered across 50 accounts. The solution is to route all outbound internet traffic through a dedicated Egress VPC (sometimes called a Security VPC) that houses:

AWS Network Firewall (or a third-party NGFW) — stateful packet inspection, FQDN filtering, IDS/IPS.
NAT Gateways — a small pool of stable Elastic IPs that you can add to vendor allowlists.
VPC Flow Logs → S3 / CloudWatch — single stream for SIEM ingestion.

The TGW Route Table wires this up: spoke VPCs have a default route (0.0.0.0/0) pointing at the TGW, and the TGW inspects the attachment's route table to forward to the Egress VPC. The Egress VPC does NAT and sends the packet to the internet gateway.

# TGW Route Table for spoke accounts — default route to Egress VPC attachment
resource "aws_ec2_transit_gateway_route_table" "spoke_rt" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  tags               = { Name = "spoke-rt" }
}

resource "aws_ec2_transit_gateway_route" "default_egress" {
  destination_cidr_block         = "0.0.0.0/0"
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.egress.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.spoke_rt.id
}

# Associate all spoke attachments with this RT
resource "aws_ec2_transit_gateway_route_table_association" "spoke" {
  for_each                       = var.spoke_attachment_ids
  transit_gateway_attachment_id  = each.value
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.spoke_rt.id
}

# AWS Network Firewall in Egress VPC
resource "aws_networkfirewall_firewall_policy" "egress" {
  name = "org-egress-policy"
  firewall_policy {
    stateless_default_actions          = ["aws:forward_to_sfe"]
    stateless_fragment_default_actions = ["aws:forward_to_sfe"]
    stateful_rule_group_reference {
      resource_arn = aws_networkfirewall_rule_group.allowed_domains.arn
    }
  }
}

resource "aws_networkfirewall_rule_group" "allowed_domains" {
  capacity = 100
  name     = "fqdn-allowlist"
  type     = "STATEFUL"
  rule_group {
    rules_source {
      rules_source_list {
        generated_rules_type = "ALLOWLIST"
        target_types         = ["HTTP_HOST", "TLS_SNI"]
        targets = [
          ".amazonaws.com",
          ".github.com",
          ".docker.io",
          ".pypi.org",
          "registry.npmjs.org",
        ]
      }
    }
  }
}

Production Failure Modes

Asymmetric routing breaks stateful firewalls. If a packet enters the Egress VPC from the TGW via one Availability Zone but the return path exits through a different AZ's NAT Gateway, a stateful firewall drops it (session state is per-AZ). Always deploy NAT Gateways and Network Firewall endpoints in each AZ, and configure AZ-affinity in the TGW attachment (appliance_mode_support = "enable"). This single flag is the most common missed step in new org-scale network builds.

TGW bandwidth limits — each attachment is capped at 50 Gbps burst. High-throughput data pipelines (S3 bulk transfers, Spark EMR clusters) should use VPC Endpoints or S3 Gateway Endpoints inside the spoke to bypass the TGW entirely.
CIDR overlap — plan your org-wide IP space before attaching spokes. TGW rejects attachments with overlapping CIDRs in the same route domain. Use AWS IPAM (IP Address Manager) to allocate non-overlapping /16s to each OU.
DNS resolution across accounts — Route 53 Resolver endpoints in the Shared Services VPC, with forwarding rules shared via RAM, are the standard solution. Without it, private hosted zones in one account are invisible to workloads in another even when the network path exists.

AWS IPAM for zero-overlap guarantees: Enable IPAM at the org level, define pools per region and per OU, and enforce allocation via SCP. When a new workload account needs a VPC, it calls aws ec2 allocate-ipam-pool-cidr — the system assigns a non-overlapping block automatically. This eliminates the most painful source of org-scale network re-architecture.

Key Takeaways

Hub-and-spoke via Transit Gateway gives you O(n) scaling, centralized routing control, and traffic inspection insertion without touching any spoke.
VPC Sharing (RAM) consolidates NAT Gateways and keeps network ownership in one account while letting dozens of teams deploy into shared subnets.
Centralized egress with AWS Network Firewall provides a single FQDN allowlist, unified flow logs, and a stable set of Elastic IPs — critical for compliance and incident response.
Enable appliance_mode_support on TGW attachments going to inspection appliances to prevent AZ-asymmetry firewall drops.
Use AWS IPAM from day one to eliminate CIDR overlap as the organization grows.