Failure Modes of Distributed Systems
Failure Modes of Distributed Systems
Before you can break things on purpose, you must know how things actually break. Distributed systems fail in ways that are qualitatively different from single-process programs. A thread crash in a monolith is obvious — the process dies, an alert fires, the root cause is in one stack trace. In a distributed system, nodes fail independently, network links partition silently, clocks diverge, and the resulting failure surfaces only as mysteriously elevated latency or subtly wrong data — often hours after the root cause occurred. Understanding failure modes is the prerequisite to designing chaos experiments that expose real risk instead of theater.
The Eight Fallacies and What They Cost You
Each fallacy is an assumption that feels reasonable during local development and causes catastrophic surprises in production.
- The network is reliable. Packets are dropped, NICs fail, switches reboot, and cloud hypervisors migrate VMs mid-request. TCP retransmits hide many of these events, but retransmits add latency — a 5 ms call becomes a 2 s call under congestion. Every RPC call must tolerate failure and retry with exponential backoff plus jitter.
- Latency is zero. A service that calls a database, a cache, and two downstream APIs on every request accumulates latency: 20 ms + 35 ms + 15 ms + 40 ms = 110 ms minimum, before GC pauses and tail latency. Measure and budget every cross-process call; never assume it is negligible.
- Bandwidth is infinite. Fetching 50 MB of configuration on every startup, or fanning out a single request to 1,000 shards, is fine in a benchmark. In production during a rolling restart, it saturates uplinks, triggers congestion control, and causes cascading timeouts. Back-pressure, pagination, and streaming are production requirements, not optimizations.
- The network is secure. In a microservices mesh, services communicate over the same network segment as external traffic unless mTLS is enforced. An attacker who compromises one pod can move laterally without network policy. Security must be designed in — it cannot be assumed from the underlying substrate.
- Topology does not change. IP addresses change. Pods are rescheduled. Instances are replaced. A service that hard-codes a peer IP breaks silently the moment Kubernetes reschedules that peer. Use DNS or a service mesh — never embed addresses in config files.
- There is one administrator. In a microservices organization, 30 teams own 30 services with different deployment cadences and incident procedures. Assuming a single authority can coordinate a multi-team incident in real time is a fantasy. Design for autonomous runbooks and automated circuit breakers that do not require human coordination.
- Transport cost is zero. JSON serialization, TLS handshakes, and connection setup consume CPU and wall-clock time. A service making 10,000 small RPC calls per second can spend a significant fraction of its CPU on serialization alone. Batching, connection pooling, and binary protocols (gRPC/Protobuf) directly address this.
- The network is homogeneous. A request path typically crosses the Kubernetes overlay, the cloud VPC, possibly a VPN to another region, and the client's ISP. MTU differences cause fragmentation; each segment has a different loss profile. A single average latency figure masks this heterogeneity completely.
The Major Failure Classes in Production
Operational experience at companies running thousands of services has crystallized the failure landscape into a handful of recurring classes. Each class has a canonical signature in your observability signals and a specific blast radius profile that chaos experiments must be designed to surface.
1. Crash Failures
A process or node stops responding entirely. This is the most benign failure class — it is detectable (health checks fail, connections are refused) and consistent (the crashed component does nothing). The danger is not the crash itself but callers that hang indefinitely waiting for a response that will never arrive. Aggressive connection timeouts on every outbound call are the only reliable defense. In Kubernetes, a pod OOM-kill is a crash failure — the container is replaced, but in-flight requests are dropped.
2. Omission Failures
A node receives a request but produces no response — it silently drops messages. This is indistinguishable from a crash from the caller's perspective, except the server is still alive and consuming resources. Causes include overwhelmed receive buffers, kernel TCP backlog saturation, and buggy async code that loses tasks when a channel is full. Omission failures are dangerous because the server may be partially processing requests, producing inconsistent state that no caller detects until business logic fails.
3. Byzantine Failures
A node responds with incorrect data — wrong results, corrupted payloads, or responses that appear syntactically valid but are semantically wrong. A Byzantine node might return stale data from a failed cache flush, silently truncate a write, or return HTTP 200 to an operation that never persisted. This is the hardest failure class to detect: health checks pass, error rates are nominal, p99 latency is normal. The signature appears only in business metrics — wrong account balances, missing order line items, duplicate notifications. End-to-end data consistency probes are the chaos experiment category that targets this class specifically.
4. Timing Failures
A node responds correctly but outside the expected time window — too slowly. Slow responses cause upstream timeouts that the caller treats as failures, triggering retries that amplify load on an already-struggling downstream — the classic death spiral. Timing failures are the most common class at scale and are the primary target of latency-injection chaos experiments. The failure is not that the node is wrong; it is that its response arrives after the caller has already given up and retried, now executing the operation twice.
5. Network Partition
The network splits: some nodes can reach each other but cannot reach others. The CAP theorem guarantees that under partition you must choose between consistency (refuse writes that cannot be fully replicated) and availability (accept writes that may diverge). Most production systems — Kafka in async mode, DynamoDB with eventual consistency, most databases with async replicas — choose availability and tolerate temporary divergence. Chaos experiments that simulate partitions reveal whether divergence is handled gracefully or produces split-brain data corruption.
The Cascading Failure Pattern
In production, single-node failures rarely stay contained. The pattern that takes down services at scale is the cascading failure: a slow downstream causes upstream threads to block, exhausting the thread pool, causing a queue backup, triggering memory pressure, triggering GC pauses, making the upstream node itself slow — which causes its upstream to block. The failure propagates up the entire call stack. Netflix analyzed this pattern after their 2011 Christmas Eve outage and built Hystrix to address it; the Resilience4j patterns are the current-generation solution.
Resource Exhaustion and the Thundering Herd
Two additional failure modes deserve attention because they are reliably triggered by chaos experiments and by real incidents alike.
Resource exhaustion occurs when a shared resource — thread pool, connection pool, file descriptor limit, heap — is consumed entirely. The first sign is increased latency (queuing for the resource), not errors. By the time errors appear, the exhaustion has been building for minutes. ulimit -n, JVM heap, and database connection pool sizes are the most common exhaustion points in production microservices.
The thundering herd occurs when a large number of clients all retry at the same instant — typically after a cache miss, a deployment restart, or a brief network hiccup that caused synchronized timeouts. Without jitter in retry logic, thousands of clients hammer the recovering service simultaneously, pushing it back into failure. This is why every retry implementation must include randomized jitter, not just exponential backoff alone.
Why Failure Mode Taxonomy Matters for Chaos Design
Matching your chaos experiment to the right failure class is what separates a signal-generating experiment from noise. Killing a pod tests crash-failure resilience. Adding 200 ms of latency to a downstream call tests timing-failure propagation. Corrupting a subset of API responses tests Byzantine detection. Dropping packets between two availability zones tests partition handling.
Without this taxonomy, teams default to the easiest experiment — pod kills — and spend months building confidence in crash resilience while their system has undetected Byzantine bugs and no jitter in retries. The fallacies are the theoretical foundation; the failure classes are the targeting system for where your experiments should point.