Networking Essentials for DevOps

Project: Diagnose a Connectivity Incident

18 min Lesson 10 of 30

Project: Diagnose a Connectivity Incident

Every networking concept you studied in this tutorial — TCP/IP layers, IP addressing, DNS, HTTP/HTTPS, TLS, load balancers, firewalls, NAT, and proxies — converges here. Real incidents do not arrive with labels. They arrive as a vague symptom: "the API is down," "deployments are broken," or "users in one region can't log in." This project walks a complete, layered debugging session from first alert to confirmed root cause, exactly as a senior SRE at a large tech company would do it.

The layered debugging principle: always start at the lowest OSI layer where you have evidence of a problem, resolve it, then move up. Skipping layers wastes time and masks cascading failures. The sequence is: physical/routing → IP reachability → DNS → TCP port → TLS → HTTP application → service logic.

The Incident

Your monitoring fires at 14:23 UTC. The alert reads: "api.example.com — 5xx error rate 34%, p99 latency 8 s, up from baseline 120 ms." You are the on-call engineer. Here is how you work through it systematically.

Layered debugging ladder: start at IP reachability, climb toward the application until the break is found.

Step 1 — Establish Scope (First 2 Minutes)

Before touching any command, answer three questions: Is this affecting all users or a subset? All regions or one? All endpoints or one path? This scoping determines your first hypothesis.

# Check your monitoring dashboards — what does the error breakdown look like?
# Typical Prometheus query to see error rate per pod or upstream:
# rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Quickly check if the site is reachable from outside your network
curl -o /dev/null -s -w "HTTP %{http_code}  time_total=%{time_total}s  time_connect=%{time_connect}s\n" \
  https://api.example.com/health

# Check from multiple vantage points using a public probe (replace with your monitoring tool)
# ping from multiple regions via your uptime service (Pingdom, Checkly, etc.)

# Is the issue DNS? Resolve from an external nameserver
dig @8.8.8.8 api.example.com +short
dig @1.1.1.1 api.example.com +short

# Compare to what your internal resolver returns
dig api.example.com +short

If the external DNS returns a different IP than expected, you have a DNS-layer problem — a bad record, a TTL that propagated an old value, or a hijack. If both resolvers agree and return the expected IP, DNS is cleared; move to TCP.

Step 2 — IP Reachability and Routing

Can packets even reach the destination? A firewall rule change, a routing table update, or a cloud security group modification can black-hole traffic silently.

# Basic ICMP reachability (note: some hosts block ICMP — absence of response is not conclusive)
ping -c 4 api.example.com

# Trace the path — where do packets stop?
traceroute -n api.example.com          # Linux
tracert api.example.com               # Windows
mtr --report --report-cycles 10 api.example.com  # richer: combines ping + traceroute

# Check if the TCP port is open at the network level (bypasses application layer)
nc -zv api.example.com 443            # -z: scan mode, -v: verbose
# or
timeout 5 bash -c "echo > /dev/tcp/api.example.com/443" && echo "open" || echo "closed"

# From inside a Kubernetes cluster — check if the pod can reach its upstream
kubectl exec -it <pod-name> -- nc -zv upstream-svc 8080
kubectl exec -it <pod-name> -- curl -sf http://upstream-svc:8080/health

Use mtr for intermittent routing problems. A single traceroute shows one path at one instant. mtr --report-cycles 60 runs for a minute and shows per-hop packet loss percentages, which surfaces flapping routes and lossy middle-mile links that a single trace would miss entirely.

Step 3 — DNS Deep Dive

DNS failures present in subtle ways: intermittent errors (partial cache poisoning), slow responses (resolver overload), or stale IPs after a failover that did not propagate because of a long TTL.

# Check current TTL — is it low enough for fast failover?
dig api.example.com +ttl | grep -E "^api|IN"

# Trace the full DNS delegation chain from root
dig +trace api.example.com

# Check if DNSSEC validation is failing
dig api.example.com +dnssec +cd    # +cd disables validation; compare to without +cd

# Query each nameserver directly — are all authoritative servers in agreement?
for ns in $(dig NS api.example.com +short); do
  echo "=== $ns ==="
  dig @$ns api.example.com +short
done

# Look for negative caching (NXDOMAIN) on an intermediate resolver
dig api.example.com @your-internal-resolver +norecurse

A common failure pattern: you failed over your load balancer IP but the DNS TTL was 3600 s (one hour). Old IPs are cached in resolvers worldwide. Production TTLs for critical records should be 60–300 s during normal operation; lower them to 30 s before a planned failover, not during the incident.

Step 4 — TLS and Certificate Inspection

If TCP connects but HTTPS fails, TLS is the suspect. Certificate expiry, missing intermediates, and hostname mismatches are the top three causes.

# One-liner: expiry date + SANs + chain depth from a live host
echo | openssl s_client -servername api.example.com \
  -connect api.example.com:443 2>/dev/null \
  | openssl x509 -noout -dates -ext subjectAltName

# Is the full chain being served? Count PEM blocks (should be 2 or 3)
openssl s_client -connect api.example.com:443 -showcerts 2>/dev/null \
  | grep -c "BEGIN CERTIFICATE"

# What TLS version and cipher suite is being negotiated?
curl -vI https://api.example.com 2>&1 | grep -E "TLS|SSL|cipher"

# Force TLS 1.2 to test compatibility (replace api.example.com with your host)
curl --tlsv1.2 --tls-max 1.2 -I https://api.example.com

# From inside the cluster — is the internal CA trusted?
curl --cacert /etc/ssl/certs/internal-ca.crt https://api.internal:8443/health

Step 5 — HTTP-Layer Inspection

TCP is up, TLS handshakes succeed, but the application returns 5xx. Now you are in the application layer. Check load balancer access logs, upstream health, and request headers.

# Verbose curl: shows request headers sent and response headers received
curl -vvI https://api.example.com/health 2>&1

# Send a realistic request — include auth headers that production clients send
curl -s -w "\n%{http_code} %{time_total}s" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Accept: application/json" \
  https://api.example.com/v1/users | tail -1

# Check nginx upstream response times in access log
tail -200 /var/log/nginx/access.log | awk '{print $NF}' | sort -n | tail -20

# For a load-balanced service — which backend is returning errors?
# Nginx upstream log format: $upstream_addr $upstream_status $upstream_response_time
grep " 502 " /var/log/nginx/access.log | awk '{print $8}' | sort | uniq -c | sort -rn

# Check application logs directly
journalctl -u myapp --since "10 minutes ago" --no-pager | grep -i "error\|exception\|timeout"
kubectl logs -l app=api --since=10m --prefix | grep -i "error\|5[0-9][0-9]"

Step 6 — Correlate with Recent Changes

In the absence of a clear smoking gun at any layer, the most powerful question is: what changed? The vast majority of production incidents are caused by a recent deployment, configuration push, certificate rotation, dependency update, or infrastructure change — not random hardware failure.

# Check recent deployments in Kubernetes
kubectl rollout history deployment/api

# What changed in the last 30 minutes? (git-based deploys)
git log --oneline --since="30 minutes ago"

# Did a firewall or security group rule change? (AWS example)
aws ec2 describe-security-groups --group-ids sg-12345 \
  --query "SecurityGroups[*].IpPermissions"

# Check systemd service recent restarts
systemctl status myapp --no-pager
journalctl -u myapp -n 50 --no-pager | grep "Started\|Stopped\|Failed"

# Certificate rotation check — when did the cert change?
echo | openssl s_client -connect api.example.com:443 2>/dev/null \
  | openssl x509 -noout -startdate

Rollback is often the fastest mitigation, not the root-cause fix. If a deployment correlates with the incident start time, roll back immediately to restore service — then investigate the root cause at leisure. Spending 20 minutes debugging a broken deployment while users are impacted is the wrong trade-off. Rollback first; post-mortem second.

Putting It Together — The Anatomy of a Real Finding

In this project scenario, the investigation reveals: dig @8.8.8.8 api.example.com returns the correct load balancer IP. TCP connects on port 443 in 2 ms. TLS handshake succeeds. But grep " 502 " /var/log/nginx/access.log shows 100% of 502s pointing to a single backend IP — 10.0.1.45. That pod was restarted during a rolling deployment 12 minutes ago. Its health check endpoint was not returning 200 yet (the JVM was still warming up), but the load balancer had already added it back to the pool after only 5 s — before the application was actually ready. The fix: increase the initialDelaySeconds in the Kubernetes readiness probe from 5 to 30 seconds. Deploy, watch error rate drop to zero within 90 seconds of the new pod becoming ready.

# The readiness probe fix in the Deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
        - name: api
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30   # was 5 -- not enough for JVM warm-up
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 15

Document your findings in the incident ticket before closing. Write: the alert that fired, the exact commands you ran and their output, the layer at which you found the break, the root cause, the mitigation applied, and the permanent fix. This becomes the post-mortem input and trains the next on-call engineer. At Google-scale, every P1/P2 incident has a written post-mortem published to the team within 48 hours — no blame, just facts and action items.

The Debugging Toolkit at a Glance

Build muscle memory for these tools — they are your first responders at every layer:

Scope: curl -w "%{http_code}", uptime monitoring dashboards, error-rate graphs
IP / routing: ping, mtr, traceroute, ip route get
DNS: dig +trace, dig @resolver, checking TTLs, querying each authoritative NS
TCP / ports: nc -zv, ss -tlnp, telnet host port
TLS / certs: openssl s_client -showcerts, openssl x509 -dates -ext subjectAltName, curl -vI
HTTP / app: curl -vv, nginx/app access logs, upstream error codes, request-level tracing
Change correlation: deployment history, git log, cloud audit logs, cert rotation timestamps

Systematic layered debugging is the skill that separates senior engineers from juniors. Juniors guess; seniors instrument. The goal is to have evidence at each layer before moving to the next — so when you do find the break, you can prove it, not just suspect it.