AWS Networking & Identity

Network Troubleshooting on AWS

18 min Lesson 9 of 28

Network Troubleshooting on AWS

Production connectivity failures cost real money. An engineer at a top-tier company does not start by randomly toggling security group rules — they systematically capture evidence, use the purpose-built AWS diagnostic toolchain, and form a hypothesis before touching any resource. This lesson teaches that discipline: VPC Flow Logs for traffic forensics, VPC Reachability Analyzer for path-level logical analysis, and the mental model for the five categories of AWS connectivity failures you will encounter repeatedly in production.

The Diagnostic Pyramid

Before touching any control plane, always gather data. AWS offers three layers of network observability:

  • Reachability Analyzer — logical path analysis. Instantly tells you whether a packet can reach its destination given the current resource configuration, without sending actual traffic.
  • VPC Flow Logs — actual data-plane evidence. Captures metadata for every accepted and rejected flow through an ENI, subnet, or entire VPC.
  • CloudWatch Metrics / Network Monitor — aggregate health signals (dropped packets, latency, error rates).

These are complementary, not substitutes. Reachability Analyzer tells you why traffic cannot reach a target by analyzing your configuration model. Flow Logs tell you what traffic actually traversed (or was dropped) on the wire. Use Analyzer first for fast logical validation; use Flow Logs to confirm real-world behavior or investigate intermittent issues.

VPC Reachability Analyzer

Reachability Analyzer performs a deterministic symbolic traversal of your AWS network configuration graph. It models every VPC component — security groups, NACLs, route tables, NAT gateways, VPN/TGW attachments, peering — and returns either a fully detailed forwarding path (with each hop annotated) or an exact explanation of the blocking component. It does not inject test packets; it reads your control-plane state.

Create a path using the CLI:

# Create a reachability path: EC2 instance to RDS in the same VPC aws ec2 create-network-insights-path \ --source i-0abc123456789def0 \ --destination i-0rds456789abc0def \ --protocol TCP \ --destination-port 5432 \ --region us-east-1 # Run the analysis (returns a NetworkInsightsAnalysisId) aws ec2 start-network-insights-analysis \ --network-insights-path-id nip-0a1b2c3d4e5f67890 \ --region us-east-1 # Poll until Status = succeeded aws ec2 describe-network-insights-analyses \ --network-insights-analysis-ids nia-0a1b2c3d4e5f67890 \ --region us-east-1 \ --query 'NetworkInsightsAnalyses[0].{Status:Status,Reachable:NetworkPathFound,Explanations:Explanations}'

When NetworkPathFound is false, the Explanations array contains one or more objects with ExplanationCode values like SECURITY_GROUP_RULE_BLOCKED, ROUTE_TABLE_MISSING_ROUTE, or ACL_RULE_BLOCKED. Each object identifies the exact resource ARN — the specific security group or NACL rule ID, the route table, the subnet. This alone eliminates 80 % of the finger-pointing in a live incident.

Production pattern: Integrate Reachability Analyzer into your IaC pipeline. Run an analysis as a post-deployment smoke test whenever a security group or route table changes. A failed analysis can gate a Terraform apply or block a deployment pipeline stage before a customer ever hits the broken path.

VPC Flow Logs

Flow Logs capture a metadata record for each network flow: source/destination IP and port, protocol, bytes, packets, and — critically — the action field: ACCEPT or REJECT. They can be scoped to an ENI, a subnet, or an entire VPC. Records are delivered to CloudWatch Logs, S3, or Kinesis Data Firehose.

Enable flow logs at the VPC level with custom fields for richer forensics:

# Enable VPC Flow Logs to CloudWatch Logs with an extended field set aws ec2 create-flow-logs \ --resource-type VPC \ --resource-ids vpc-0abc12345def67890 \ --traffic-type ALL \ --log-destination-type cloud-watch-logs \ --log-group-name /vpc/flow-logs/prod \ --deliver-logs-permission-arn arn:aws:iam::123456789012:role/FlowLogsRole \ --log-format '${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status} ${vpc-id} ${subnet-id} ${instance-id} ${tcp-flags} ${pkt-srcaddr} ${pkt-dstaddr} ${flow-direction} ${traffic-path}' \ --region us-east-1 # Query rejected traffic to a specific destination using CloudWatch Logs Insights aws logs start-query \ --log-group-name /vpc/flow-logs/prod \ --start-time $(date -d '1 hour ago' +%s) \ --end-time $(date +%s) \ --query-string 'fields @timestamp, srcaddr, dstaddr, dstport, action | filter action = "REJECT" and dstaddr = "10.0.4.22" | sort @timestamp desc | limit 50'

Key fields to always include: ${tcp-flags} (distinguishes a SYN-only REJECT from an RST — useful for diagnosing asymmetric routing), ${traffic-path} (tells you whether a packet entered through a NAT Gateway, an IGW, or a VPN), and ${pkt-srcaddr} / ${pkt-dstaddr} (the pre-NAT original addresses, unlike srcaddr/dstaddr which reflect post-NAT values). These distinctions matter significantly in multi-AZ and Transit Gateway topologies.

Production pitfall: Flow Logs have a delivery delay of roughly 5–15 minutes for CloudWatch Logs and up to 10 minutes for S3. They are not a real-time firewall log. During an active incident, use Reachability Analyzer for instant configuration analysis. Flow Logs are your post-event evidence and forensics layer, not a live feed.

Anatomy of the Five Common Failure Categories

Most AWS connectivity incidents fall into one of five buckets. Knowing them makes your troubleshooting deterministic rather than exploratory:

  1. Security Group missing rule — the most common. A new microservice or port was not added to the relevant SG. Reachability Analyzer returns SECURITY_GROUP_RULE_BLOCKED instantly. Fix: add an inbound rule referencing the source SG ID (not CIDR where possible).
  2. NACL asymmetry — NACLs are stateless, so you must have both an inbound rule and a matching outbound rule for ephemeral ports (1024–65535 for return traffic). A NACL that blocks outbound ephemeral responses looks exactly like a one-way connection from the client side. Flow Logs will show REJECT on the outbound direction.
  3. Missing or blackholed route — a route table is missing a route to the target CIDR, or a route points to a deleted or stopped resource (e.g., a stopped NAT instance). Reachability Analyzer returns ROUTE_TABLE_MISSING_ROUTE or ROUTE_TABLE_BLACKHOLE_ROUTE.
  4. MTU / fragmentation issues — common on paths through VPN or Transit Gateway attachments. The network path has a lower MTU (often 1500 → 8500 for inter-region TGW peering, or 1500 → 1436 for VPN). Applications using large TCP window sizes experience stalls or timeouts. Symptom: connection established but data transfer stalls. Solution: tune MSS clamping on the VPN/TGW or set net.ipv4.tcp_mtu_probing=1 on the instances. Flow Logs alone will not reveal this — you need active path-MTU probing.
  5. DNS resolution failures — technically not network-layer but account for a large fraction of "the service is unreachable" tickets. Private hosted zones not associated with the correct VPC, Route 53 Resolver not forwarding to on-premises DNS, or enableDnsSupport / enableDnsHostnames disabled on the VPC. Always verify with dig or nslookup from inside the VPC before assuming a routing or firewall issue.

Troubleshooting Workflow Diagram

AWS Network Troubleshooting Decision Flow Report: Connectivity Failure Gather source, dest, port, protocol Run Reachability Analyzer Logical path — no traffic injected Blocked Inspect Explanation SG / NACL / Route fix Reachable Query VPC Flow Logs Filter: REJECT on src/dst IP + port REJECT NACL Asymmetry Add ephemeral return rule ACCEPT Investigate App / DNS / MTU dig, tracepath, curl -v, MSS tuning Issue Resolved
Structured decision flow: Reachability Analyzer → Flow Logs → application/DNS/MTU layer.

Practical Runbook: "Instance Cannot Reach RDS"

Here is the exact sequence a senior SRE follows when an application reports it cannot connect to its database:

# 1. Confirm the instance and RDS ENI identifiers aws ec2 describe-instances --filters "Name=tag:Name,Values=app-server-prod" \ --query 'Reservations[].Instances[].{ID:InstanceId,IP:PrivateIpAddress,SG:SecurityGroups[].GroupId}' aws rds describe-db-instances --db-instance-identifier prod-postgres \ --query 'DBInstances[0].{Endpoint:Endpoint.Address,Port:Endpoint.Port,SG:VpcSecurityGroups[].VpcSecurityGroupId}' # 2. Run Reachability Analyzer (source=app ENI, dest=RDS ENI, port 5432) # First find the ENI IDs APP_ENI=$(aws ec2 describe-network-interfaces \ --filters "Name=attachment.instance-id,Values=i-0abc123456789def0" \ --query 'NetworkInterfaces[0].NetworkInterfaceId' --output text) RDS_ENI=$(aws ec2 describe-network-interfaces \ --filters "Name=description,Values=*prod-postgres*" \ --query 'NetworkInterfaces[0].NetworkInterfaceId' --output text) aws ec2 create-network-insights-path \ --source $APP_ENI --destination $RDS_ENI \ --protocol TCP --destination-port 5432 \ --tag-specifications 'ResourceType=network-insights-path,Tags=[{Key=incident,Value=INC-0042}]' # 3. If Reachable=true, pull Flow Logs for the RDS ENI specifically aws logs start-query \ --log-group-name /vpc/flow-logs/prod \ --start-time $(date -d '30 minutes ago' +%s) \ --end-time $(date +%s) \ --query-string 'fields @timestamp, srcaddr, dstaddr, srcport, dstport, action, tcp-flags | filter interface-id = "eni-0rds456789abc0def" | filter dstport = 5432 or srcport = 5432 | sort @timestamp desc | limit 100' # 4. Test DNS resolution from inside the VPC (requires SSM Session Manager) aws ssm start-session --target i-0abc123456789def0 \ --document-name AWS-StartInteractiveCommand \ --parameters command="dig prod-postgres.cluster-abc123.us-east-1.rds.amazonaws.com +short"
Tag your insights paths with incident IDs. Reachability Analyzer paths and analyses persist in your account. In regulated environments (SOC 2, PCI-DSS) this creates an automatic audit trail showing exactly what configuration existed at the time of an incident and what the logical path looked like. Store the analysis ID in your incident ticket.

Cost and Operational Considerations

Flow Logs at high traffic volumes generate significant CloudWatch Logs ingestion costs. Production best practices: send Flow Logs to S3 for cost efficiency (log delivery costs ~$0.50/GB vs. CWL's $0.50/GB ingestion plus $0.03/GB storage), enable Parquet format for S3 delivery to enable Athena queries without transformation, and scope to REJECT-only traffic unless you need full forensic coverage. Reachability Analyzer charges per analysis run ($0.10 per analysis), which is negligible compared to the engineering time it saves — run it freely.