Hybrid Connectivity
Hybrid Connectivity
Every organization that moves to the cloud still has data-center workloads — legacy mainframes, on-prem databases holding regulated data, manufacturing SCADA systems, or simply bandwidth-heavy workloads where egress cost makes pure-cloud uneconomical. Hybrid connectivity is the engineering discipline that joins those two worlds reliably, securely, and at the bandwidth and latency your applications actually need.
In this lesson we cover the two primary AWS hybrid-connectivity primitives — Site-to-Site VPN and AWS Direct Connect — along with the DNS plumbing that makes the entire hybrid estate feel like a single coherent namespace.
Site-to-Site VPN
AWS Site-to-Site VPN terminates an IPsec tunnel between your on-premises Customer Gateway (CGW) — a physical or software router — and an AWS Virtual Private Gateway (VGW) or a Transit Gateway (TGW). AWS provisions two tunnels per VPN connection across separate availability zones for redundancy; your on-premises device must be able to bring up both and failover between them.
- Throughput: each tunnel is rate-limited to ~1.25 Gbps; practical aggregate is ~2.5 Gbps per connection using ECMP over both tunnels. TGW supports ECMP across multiple VPN connections, so you can bond 4–8 connections for ~5–10 Gbps.
- Latency: traverses the public internet — expect 20–80 ms additional latency depending on geography and ISP path quality. Unsuitable for latency-sensitive synchronous DB replication or real-time trading.
- Cost: ~$0.05/hr per connection + $0.09/GB data transfer out. For moderate traffic (<1 TB/month) it is almost always cheaper than Direct Connect.
- Use cases: branch offices, disaster-recovery standby, dev/test environments, initial migration runway before Direct Connect is provisioned.
Provisioning a VPN attachment on a Transit Gateway via Terraform:
AWS Direct Connect
Direct Connect (DX) is a dedicated Layer 2 circuit between your facility (or a colocation partner) and an AWS Direct Connect Location. Traffic never traverses the public internet — which means predictable latency, consistent throughput, and a dramatically simpler security posture (no need for IPsec on the data path).
- Port speeds: 1 Gbps, 10 Gbps, 100 Gbps. Sub-1Gbps (50/100/200/300/400/500 Mbps) is available via Hosted Connections from APN partners.
- Latency: deterministic — typically 1–5 ms to the nearest AWS region from a co-located DX location.
- Virtual Interfaces (VIFs): Private VIF → connects to a VPC via VGW or TGW; Transit VIF → connects to a TGW (preferred at org scale — one DX feeds all VPCs through the TGW mesh); Public VIF → reaches AWS public endpoints (S3, DynamoDB) over RFC 7938 BGP.
- Resilience SLA: a single DX port has no AWS SLA. For 99.99% availability AWS recommends two ports in different DX locations connected to two separate on-prem routers (four BGP sessions total).
Choosing Between VPN and Direct Connect
In practice the choice is not binary — most production organizations run both:
- Direct Connect as primary: production workloads, bulk data transfer, latency-sensitive services.
- VPN as backup / DR path: automatically fails over via BGP if DX goes down. Set lower BGP local-preference or MED on the VPN routes so DX is always preferred.
- VPN only: branch offices, small satellite sites, dev/test, or any site where DX lead time (typically 30–90 days for fiber provisioning) is not acceptable.
DNS Across the Hybrid Estate
Hybrid connectivity solves Layer 3 routing; DNS solves naming. Without DNS integration, engineers must hard-code IP addresses or maintain brittle hostfiles — both are unacceptable at production scale.
The AWS-native solution is Route 53 Resolver, which provides two serverless endpoints per VPC:
- Inbound Resolver Endpoint: ENIs in your VPC subnets that accept DNS queries from on-premises. Your on-prem resolver forwards queries for
*.aws.internal(or any private zone) to these ENIs over DX or VPN. - Outbound Resolver Endpoint: ENIs that make DNS queries toward on-premises resolvers. A Resolver Rule says "forward queries for
corp.example.comto 10.0.1.53".
Production Failure Modes to Know
- BGP session flap: a misconfigured hold-timer or MTU mismatch on the DX circuit causes intermittent BGP resets. Always confirm the interface MTU is set to 8500 bytes on the DX logical interface (jumbo frames are supported and recommended).
- Asymmetric routing: when both VPN and DX are active, return traffic may take a different path than forward traffic. Ensure your on-prem firewall is stateful and that NAT is not applied on the VPN path when DX is primary.
- DNS split-brain: an EC2 instance resolves a name to a private IP via Route 53, but the on-prem host resolves the same name to a public IP. Happens when private hosted zones are not associated with all relevant VPCs. Audit with
aws route53 list-vpc-association-authorizations. - DX failover delay: BGP convergence on a DX failure can take 90 seconds if BFD (Bidirectional Forwarding Detection) is not configured. Always enable BFD on your DX Virtual Interface — it detects link failures in under one second and triggers immediate BGP withdrawal.
Summary
Site-to-Site VPN is fast to provision and cost-effective for low-to-medium bandwidth; Direct Connect is the backbone for production workloads requiring consistent latency and high throughput — always encrypted with MACsec for regulated industries. The two are complementary: DX as primary, VPN as warm standby. Route 53 Resolver bridges DNS across the hybrid estate without hard-coded IPs. Share Resolver Rules via RAM, enable BFD on DX sessions, and test failover on a schedule — these are the habits that separate hobby architectures from enterprise-grade hybrid networks.