Network Troubleshooting on AWS
Network Troubleshooting on AWS
Production connectivity failures cost real money. An engineer at a top-tier company does not start by randomly toggling security group rules — they systematically capture evidence, use the purpose-built AWS diagnostic toolchain, and form a hypothesis before touching any resource. This lesson teaches that discipline: VPC Flow Logs for traffic forensics, VPC Reachability Analyzer for path-level logical analysis, and the mental model for the five categories of AWS connectivity failures you will encounter repeatedly in production.
The Diagnostic Pyramid
Before touching any control plane, always gather data. AWS offers three layers of network observability:
- Reachability Analyzer — logical path analysis. Instantly tells you whether a packet can reach its destination given the current resource configuration, without sending actual traffic.
- VPC Flow Logs — actual data-plane evidence. Captures metadata for every accepted and rejected flow through an ENI, subnet, or entire VPC.
- CloudWatch Metrics / Network Monitor — aggregate health signals (dropped packets, latency, error rates).
These are complementary, not substitutes. Reachability Analyzer tells you why traffic cannot reach a target by analyzing your configuration model. Flow Logs tell you what traffic actually traversed (or was dropped) on the wire. Use Analyzer first for fast logical validation; use Flow Logs to confirm real-world behavior or investigate intermittent issues.
VPC Reachability Analyzer
Reachability Analyzer performs a deterministic symbolic traversal of your AWS network configuration graph. It models every VPC component — security groups, NACLs, route tables, NAT gateways, VPN/TGW attachments, peering — and returns either a fully detailed forwarding path (with each hop annotated) or an exact explanation of the blocking component. It does not inject test packets; it reads your control-plane state.
Create a path using the CLI:
When NetworkPathFound is false, the Explanations array contains one or more objects with ExplanationCode values like SECURITY_GROUP_RULE_BLOCKED, ROUTE_TABLE_MISSING_ROUTE, or ACL_RULE_BLOCKED. Each object identifies the exact resource ARN — the specific security group or NACL rule ID, the route table, the subnet. This alone eliminates 80 % of the finger-pointing in a live incident.
VPC Flow Logs
Flow Logs capture a metadata record for each network flow: source/destination IP and port, protocol, bytes, packets, and — critically — the action field: ACCEPT or REJECT. They can be scoped to an ENI, a subnet, or an entire VPC. Records are delivered to CloudWatch Logs, S3, or Kinesis Data Firehose.
Enable flow logs at the VPC level with custom fields for richer forensics:
Key fields to always include: ${tcp-flags} (distinguishes a SYN-only REJECT from an RST — useful for diagnosing asymmetric routing), ${traffic-path} (tells you whether a packet entered through a NAT Gateway, an IGW, or a VPN), and ${pkt-srcaddr} / ${pkt-dstaddr} (the pre-NAT original addresses, unlike srcaddr/dstaddr which reflect post-NAT values). These distinctions matter significantly in multi-AZ and Transit Gateway topologies.
Anatomy of the Five Common Failure Categories
Most AWS connectivity incidents fall into one of five buckets. Knowing them makes your troubleshooting deterministic rather than exploratory:
- Security Group missing rule — the most common. A new microservice or port was not added to the relevant SG. Reachability Analyzer returns
SECURITY_GROUP_RULE_BLOCKEDinstantly. Fix: add an inbound rule referencing the source SG ID (not CIDR where possible). - NACL asymmetry — NACLs are stateless, so you must have both an inbound rule and a matching outbound rule for ephemeral ports (1024–65535 for return traffic). A NACL that blocks outbound ephemeral responses looks exactly like a one-way connection from the client side. Flow Logs will show
REJECTon the outbound direction. - Missing or blackholed route — a route table is missing a route to the target CIDR, or a route points to a deleted or stopped resource (e.g., a stopped NAT instance). Reachability Analyzer returns
ROUTE_TABLE_MISSING_ROUTEorROUTE_TABLE_BLACKHOLE_ROUTE. - MTU / fragmentation issues — common on paths through VPN or Transit Gateway attachments. The network path has a lower MTU (often 1500 → 8500 for inter-region TGW peering, or 1500 → 1436 for VPN). Applications using large TCP window sizes experience stalls or timeouts. Symptom: connection established but data transfer stalls. Solution: tune MSS clamping on the VPN/TGW or set
net.ipv4.tcp_mtu_probing=1on the instances. Flow Logs alone will not reveal this — you need active path-MTU probing. - DNS resolution failures — technically not network-layer but account for a large fraction of "the service is unreachable" tickets. Private hosted zones not associated with the correct VPC, Route 53 Resolver not forwarding to on-premises DNS, or
enableDnsSupport/enableDnsHostnamesdisabled on the VPC. Always verify withdigornslookupfrom inside the VPC before assuming a routing or firewall issue.
Troubleshooting Workflow Diagram
Practical Runbook: "Instance Cannot Reach RDS"
Here is the exact sequence a senior SRE follows when an application reports it cannot connect to its database:
Cost and Operational Considerations
Flow Logs at high traffic volumes generate significant CloudWatch Logs ingestion costs. Production best practices: send Flow Logs to S3 for cost efficiency (log delivery costs ~$0.50/GB vs. CWL's $0.50/GB ingestion plus $0.03/GB storage), enable Parquet format for S3 delivery to enable Athena queries without transformation, and scope to REJECT-only traffic unless you need full forensic coverage. Reachability Analyzer charges per analysis run ($0.10 per analysis), which is negligible compared to the engineering time it saves — run it freely.