Cloud Architecture & Landing Zones

Hybrid Connectivity

18 min Lesson 6 of 28

Hybrid Connectivity

Every organization that moves to the cloud still has data-center workloads — legacy mainframes, on-prem databases holding regulated data, manufacturing SCADA systems, or simply bandwidth-heavy workloads where egress cost makes pure-cloud uneconomical. Hybrid connectivity is the engineering discipline that joins those two worlds reliably, securely, and at the bandwidth and latency your applications actually need.

In this lesson we cover the two primary AWS hybrid-connectivity primitives — Site-to-Site VPN and AWS Direct Connect — along with the DNS plumbing that makes the entire hybrid estate feel like a single coherent namespace.

Site-to-Site VPN

AWS Site-to-Site VPN terminates an IPsec tunnel between your on-premises Customer Gateway (CGW) — a physical or software router — and an AWS Virtual Private Gateway (VGW) or a Transit Gateway (TGW). AWS provisions two tunnels per VPN connection across separate availability zones for redundancy; your on-premises device must be able to bring up both and failover between them.

Throughput: each tunnel is rate-limited to ~1.25 Gbps; practical aggregate is ~2.5 Gbps per connection using ECMP over both tunnels. TGW supports ECMP across multiple VPN connections, so you can bond 4–8 connections for ~5–10 Gbps.
Latency: traverses the public internet — expect 20–80 ms additional latency depending on geography and ISP path quality. Unsuitable for latency-sensitive synchronous DB replication or real-time trading.
Cost: ~$0.05/hr per connection + $0.09/GB data transfer out. For moderate traffic (<1 TB/month) it is almost always cheaper than Direct Connect.
Use cases: branch offices, disaster-recovery standby, dev/test environments, initial migration runway before Direct Connect is provisioned.

Provisioning a VPN attachment on a Transit Gateway via Terraform:

# customer_gateway.tf
resource "aws_customer_gateway" "dc1" {
  bgp_asn    = 65001        # your on-prem ASN
  ip_address = "203.0.113.1" # your public egress IP
  type       = "ipsec.1"
  tags = { Name = "dc1-cgw" }
}

resource "aws_vpn_connection" "dc1_tgw" {
  customer_gateway_id = aws_customer_gateway.dc1.id
  transit_gateway_id  = aws_ec2_transit_gateway.core.id
  type                = "ipsec.1"
  static_routes_only  = false   # use BGP

  tunnel1_preshared_key = var.vpn_psk_tunnel1
  tunnel2_preshared_key = var.vpn_psk_tunnel2

  tags = { Name = "dc1-to-tgw" }
}

# Download the device config from the console or CLI after creation:
# aws ec2 describe-vpn-connections \
#   --vpn-connection-ids <id> \
#   --query 'VpnConnections[0].CustomerGatewayConfiguration'

BGP is mandatory at scale. Static routes require manual updates every time a CIDR is added in either estate; BGP propagates routes automatically. Use BGP ASNs in the private range (64512–65534) for your on-prem CGW unless you own a real public ASN.

AWS Direct Connect

Direct Connect (DX) is a dedicated Layer 2 circuit between your facility (or a colocation partner) and an AWS Direct Connect Location. Traffic never traverses the public internet — which means predictable latency, consistent throughput, and a dramatically simpler security posture (no need for IPsec on the data path).

Port speeds: 1 Gbps, 10 Gbps, 100 Gbps. Sub-1Gbps (50/100/200/300/400/500 Mbps) is available via Hosted Connections from APN partners.
Latency: deterministic — typically 1–5 ms to the nearest AWS region from a co-located DX location.
Virtual Interfaces (VIFs): Private VIF → connects to a VPC via VGW or TGW; Transit VIF → connects to a TGW (preferred at org scale — one DX feeds all VPCs through the TGW mesh); Public VIF → reaches AWS public endpoints (S3, DynamoDB) over RFC 7938 BGP.
Resilience SLA: a single DX port has no AWS SLA. For 99.99% availability AWS recommends two ports in different DX locations connected to two separate on-prem routers (four BGP sessions total).

VPN uses the public internet (encrypted); Direct Connect uses a dedicated fiber circuit through an AWS DX Location — both terminate on the Transit Gateway.

Direct Connect alone is not encrypted. The physical circuit is private, but it is not cryptographically protected. For compliance requirements such as HIPAA, PCI-DSS, or FedRAMP High, run a VPN (MACsec or IPsec) over your Direct Connect connection. MACsec (Layer 2 encryption, supported on 10G/100G dedicated connections) is the modern choice — zero-overhead hardware encryption without a software tunnel.

Choosing Between VPN and Direct Connect

In practice the choice is not binary — most production organizations run both:

Direct Connect as primary: production workloads, bulk data transfer, latency-sensitive services.
VPN as backup / DR path: automatically fails over via BGP if DX goes down. Set lower BGP local-preference or MED on the VPN routes so DX is always preferred.
VPN only: branch offices, small satellite sites, dev/test, or any site where DX lead time (typically 30–90 days for fiber provisioning) is not acceptable.

DNS Across the Hybrid Estate

Hybrid connectivity solves Layer 3 routing; DNS solves naming. Without DNS integration, engineers must hard-code IP addresses or maintain brittle hostfiles — both are unacceptable at production scale.

The AWS-native solution is Route 53 Resolver, which provides two serverless endpoints per VPC:

Inbound Resolver Endpoint: ENIs in your VPC subnets that accept DNS queries from on-premises. Your on-prem resolver forwards queries for *.aws.internal (or any private zone) to these ENIs over DX or VPN.
Outbound Resolver Endpoint: ENIs that make DNS queries toward on-premises resolvers. A Resolver Rule says "forward queries for corp.example.com to 10.0.1.53".

# route53_resolver.tf — inbound + outbound endpoints + forwarding rule

resource "aws_route53_resolver_endpoint" "inbound" {
  name      = "hybrid-inbound"
  direction = "INBOUND"
  security_group_ids = [aws_security_group.r53_inbound.id]

  ip_address {
    subnet_id = aws_subnet.private_a.id
  }
  ip_address {
    subnet_id = aws_subnet.private_b.id
  }
}

resource "aws_route53_resolver_endpoint" "outbound" {
  name      = "hybrid-outbound"
  direction = "OUTBOUND"
  security_group_ids = [aws_security_group.r53_outbound.id]

  ip_address {
    subnet_id = aws_subnet.private_a.id
  }
  ip_address {
    subnet_id = aws_subnet.private_b.id
  }
}

resource "aws_route53_resolver_rule" "on_prem_forward" {
  domain_name          = "corp.example.com"
  name                 = "forward-to-on-prem"
  rule_type            = "FORWARD"
  resolver_endpoint_id = aws_route53_resolver_endpoint.outbound.id

  target_ip {
    ip   = "10.10.1.53"   # primary on-prem DNS
    port = 53
  }
  target_ip {
    ip   = "10.10.2.53"   # secondary on-prem DNS
    port = 53
  }
}

resource "aws_route53_resolver_rule_association" "core_vpc" {
  resolver_rule_id = aws_route53_resolver_rule.on_prem_forward.id
  vpc_id           = aws_vpc.core.id
}

# Share the rule to all VPCs via RAM (Resource Access Manager):
resource "aws_ram_resource_share" "resolver_rules" {
  name                      = "resolver-rules"
  allow_external_principals = false
}

resource "aws_ram_resource_association" "rule" {
  resource_arn       = aws_route53_resolver_rule.on_prem_forward.arn
  resource_share_arn = aws_ram_resource_share.resolver_rules.arn
}

Share Resolver Rules via RAM, not per-VPC. In a multi-account org, create your Resolver Endpoints and Rules in the Shared Services (or Network) account, then use AWS Resource Access Manager to share the rules into every spoke VPC. This means one source of truth for forwarding configuration — not dozens of per-VPC copies to keep in sync.

Production Failure Modes to Know

BGP session flap: a misconfigured hold-timer or MTU mismatch on the DX circuit causes intermittent BGP resets. Always confirm the interface MTU is set to 8500 bytes on the DX logical interface (jumbo frames are supported and recommended).
Asymmetric routing: when both VPN and DX are active, return traffic may take a different path than forward traffic. Ensure your on-prem firewall is stateful and that NAT is not applied on the VPN path when DX is primary.
DNS split-brain: an EC2 instance resolves a name to a private IP via Route 53, but the on-prem host resolves the same name to a public IP. Happens when private hosted zones are not associated with all relevant VPCs. Audit with aws route53 list-vpc-association-authorizations.
DX failover delay: BGP convergence on a DX failure can take 90 seconds if BFD (Bidirectional Forwarding Detection) is not configured. Always enable BFD on your DX Virtual Interface — it detects link failures in under one second and triggers immediate BGP withdrawal.

Test your failover before you need it. Simulate a DX failure quarterly by setting the DX BGP session to administrative down on the router and confirming that VPN takes over within your target RTO (usually <60 s with BFD). Document the test in your runbook.

Summary

Site-to-Site VPN is fast to provision and cost-effective for low-to-medium bandwidth; Direct Connect is the backbone for production workloads requiring consistent latency and high throughput — always encrypted with MACsec for regulated industries. The two are complementary: DX as primary, VPN as warm standby. Route 53 Resolver bridges DNS across the hybrid estate without hard-coded IPs. Share Resolver Rules via RAM, enable BFD on DX sessions, and test failover on a schedule — these are the habits that separate hobby architectures from enterprise-grade hybrid networks.