Chaos Engineering & Resilience

Chaos Experiments: Application

18 min Lesson 5 of 27

Chaos Experiments: Application

Infrastructure chaos (killing nodes, draining pods) tests platform resilience. Application-layer chaos tests whether your service code handles the messy reality of distributed systems: slow dependencies, partial failures, and unexpected error responses. These are the faults that actually page you at 3 AM — not a dead host, but a payment gateway returning 503 at 60% of calls while the other 40% succeed.

This lesson covers three application-level fault classes: latency injection, dependency failures, and error injection. Each targets a distinct failure mode, and each requires a different experimental approach.

1. Latency Injection

Latency is the silent killer of distributed systems. A downstream service that takes 3s instead of 30ms does not return an error — your code happily waits, holding a goroutine, a database connection, and a thread-pool slot. Under traffic, that compounds into a queue pile-up, then a cascade. This is the "gray failure" profile: everything appears up, but the system drains.

The goal of latency injection is to verify that every network call in your service has a timeout, that the timeout is shorter than your SLO, and that a slow dependency does not exhaust your resources.

The Tail Latency Trap: p99 latency from an upstream service can be orders of magnitude higher than p50. If your timeout is set to "average + buffer," you will time out the worst 1% of requests — but under load, 1% of millions of requests per second is a fire.

Injecting latency with Toxiproxy: Toxiproxy is a programmable TCP proxy purpose-built for chaos. You wire it between your service and a dependency, then toggle faults via its HTTP API without touching either service.

# Start Toxiproxy (Docker, for local or staging)
docker run --rm -d \
  --name toxiproxy \
  -p 8474:8474 \
  -p 5432:5432 \
  ghcr.io/shopify/toxiproxy

# Create a proxy: local:5432 → real Postgres at db:5432
toxiproxy-cli create postgres \
  --listen 0.0.0.0:5432 \
  --upstream db:5432

# Inject 800ms of latency on downstream writes (jitter ±200ms)
toxiproxy-cli toxic add postgres \
  --type latency \
  --attribute latency=800 \
  --attribute jitter=200 \
  --toxicName slow_db

# Run your load test while the toxic is active, then remove it
toxiproxy-cli toxic remove postgres --toxicName slow_db

In Kubernetes with Istio: A VirtualService with a fault.delay spec injects latency at the mesh layer without touching any application code or proxy configuration.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payments-chaos
spec:
  hosts:
    - payments-svc
  http:
    - fault:
        delay:
          percentage:
            value: 10          # 10% of requests
          fixedDelay: 2s       # 2-second added latency
      route:
        - destination:
            host: payments-svc

Apply this during a load test and observe: does your service respect its own timeout? Does it shed load gracefully, or does it queue requests until memory is exhausted?

2. Dependency Failures

Every service has dependencies: databases, caches, message queues, third-party APIs. Dependency failure experiments answer the question: "What does our service return when X is completely unavailable?"

The expected answer for non-critical dependencies is: the service degrades gracefully, returning a partial response or cached data, while logging and alerting. The expected answer for a critical dependency is: the service returns a meaningful error quickly (fast-fail), not after a 30-second timeout chain.

Without a circuit breaker, a failed dependency stalls threads until the whole service cascades. With one, calls fail fast and the service degrades gracefully.

Simulating a dependency outage with Istio fault injection (HTTP abort):

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: inventory-chaos
spec:
  hosts:
    - inventory-svc
  http:
    - fault:
        abort:
          percentage:
            value: 100         # total blackout
          httpStatus: 503
      route:
        - destination:
            host: inventory-svc

Run this against a non-critical read path and verify your service returns a cached or degraded response. Run it against a critical write path and confirm the error surfaces to the caller cleanly rather than silently dropping data.

Test fallback logic, not just error handling. Many teams have a circuit breaker configured but no fallback behind it. When the breaker opens, the service should either return stale cached data, a sensible default, or a clear error — not a null pointer exception from an uninitialized cache.

3. Error Injection

Error injection goes beyond "service is down" and tests how your code responds to specific bad responses: 500 Internal Server Error, 429 Too Many Requests, malformed JSON, authentication failures, or truncated responses. Real production incidents often involve a dependency that is technically up but responding incorrectly — an expired certificate, a schema migration that broke the response format, or a rate-limit triggered by a noisy neighbour.

For HTTP services, Istio's fault.abort covers status codes. For more nuanced corruption — wrong content-type headers, partial payloads, oversized responses — use a purpose-built fault proxy or a service mesh extension.

Chaos Monkey for Spring Boot (application-level SDK): For services where you own the code, in-process SDKs inject faults at the method level, without any network proxy. This is useful for testing retry logic in your own client code and for fault injection in environments where you cannot run a sidecar.

# chaos-monkey-spring-boot: enable via application properties
management.endpoint.chaosmonkey.enabled=true
chaos.monkey.enabled=true

# Assault configuration: throw RuntimeException on 30% of calls
# to any @Service-annotated bean
chaos.monkey.assaults.level=5
chaos.monkey.assaults.latencyActive=false
chaos.monkey.assaults.exceptionsActive=true
chaos.monkey.assaults.exception.type=java.lang.RuntimeException
chaos.monkey.assaults.exception.arguments[0].type=java.lang.String
chaos.monkey.assaults.exception.arguments[0].value=chaos: simulated error
chaos.monkey.watcher.service=true

Never run unguarded error injection in production. Always gate experiments behind a feature flag or a chaos-specific namespace. Injecting 503s at 100% to a critical path in prod — even briefly — can trigger real customer impact before your rollback completes. Start in staging, validate steady-state observability, then run production experiments with blast radius limited to 1-5% of traffic and an automated kill switch.

Connecting Experiments to Observability

An application chaos experiment is only as good as its observability. Before you inject a single fault, confirm you have:

A defined steady state: p99 latency < 200ms, error rate < 0.1%, queue depth < 50.
Dashboards open: Your service's RED metrics (Rate, Errors, Duration) and the dependency's health.
A kill switch ready: A single command or one-click that removes the fault immediately.
Alerting muted appropriately: Or not — running experiments with real alerting active is a valid test of your on-call runbooks.

After each experiment, document what broke, what held, and what needs fixing. The finding is not the failure — the finding is the gap between your assumed resilience and the measured reality.