Chaos Experiments: Application
Chaos Experiments: Application
Infrastructure chaos (killing nodes, draining pods) tests platform resilience. Application-layer chaos tests whether your service code handles the messy reality of distributed systems: slow dependencies, partial failures, and unexpected error responses. These are the faults that actually page you at 3 AM — not a dead host, but a payment gateway returning 503 at 60% of calls while the other 40% succeed.
This lesson covers three application-level fault classes: latency injection, dependency failures, and error injection. Each targets a distinct failure mode, and each requires a different experimental approach.
1. Latency Injection
Latency is the silent killer of distributed systems. A downstream service that takes 3s instead of 30ms does not return an error — your code happily waits, holding a goroutine, a database connection, and a thread-pool slot. Under traffic, that compounds into a queue pile-up, then a cascade. This is the "gray failure" profile: everything appears up, but the system drains.
The goal of latency injection is to verify that every network call in your service has a timeout, that the timeout is shorter than your SLO, and that a slow dependency does not exhaust your resources.
Injecting latency with Toxiproxy: Toxiproxy is a programmable TCP proxy purpose-built for chaos. You wire it between your service and a dependency, then toggle faults via its HTTP API without touching either service.
In Kubernetes with Istio: A VirtualService with a fault.delay spec injects latency at the mesh layer without touching any application code or proxy configuration.
Apply this during a load test and observe: does your service respect its own timeout? Does it shed load gracefully, or does it queue requests until memory is exhausted?
2. Dependency Failures
Every service has dependencies: databases, caches, message queues, third-party APIs. Dependency failure experiments answer the question: "What does our service return when X is completely unavailable?"
The expected answer for non-critical dependencies is: the service degrades gracefully, returning a partial response or cached data, while logging and alerting. The expected answer for a critical dependency is: the service returns a meaningful error quickly (fast-fail), not after a 30-second timeout chain.
Simulating a dependency outage with Istio fault injection (HTTP abort):
Run this against a non-critical read path and verify your service returns a cached or degraded response. Run it against a critical write path and confirm the error surfaces to the caller cleanly rather than silently dropping data.
3. Error Injection
Error injection goes beyond "service is down" and tests how your code responds to specific bad responses: 500 Internal Server Error, 429 Too Many Requests, malformed JSON, authentication failures, or truncated responses. Real production incidents often involve a dependency that is technically up but responding incorrectly — an expired certificate, a schema migration that broke the response format, or a rate-limit triggered by a noisy neighbour.
For HTTP services, Istio's fault.abort covers status codes. For more nuanced corruption — wrong content-type headers, partial payloads, oversized responses — use a purpose-built fault proxy or a service mesh extension.
Chaos Monkey for Spring Boot (application-level SDK): For services where you own the code, in-process SDKs inject faults at the method level, without any network proxy. This is useful for testing retry logic in your own client code and for fault injection in environments where you cannot run a sidecar.
Connecting Experiments to Observability
An application chaos experiment is only as good as its observability. Before you inject a single fault, confirm you have:
- A defined steady state: p99 latency < 200ms, error rate < 0.1%, queue depth < 50.
- Dashboards open: Your service's RED metrics (Rate, Errors, Duration) and the dependency's health.
- A kill switch ready: A single command or one-click that removes the fault immediately.
- Alerting muted appropriately: Or not — running experiments with real alerting active is a valid test of your on-call runbooks.
After each experiment, document what broke, what held, and what needs fixing. The finding is not the failure — the finding is the gap between your assumed resilience and the measured reality.