Chaos Engineering & Resilience

Project: Design a Chaos Program

18 min Lesson 10 of 27

Project: Design a Chaos Program

This final lesson puts everything from the tutorial into practice. You will design a complete, hypothesis-driven chaos program for a realistic production system — not a toy demo, but the kind of structured program that Netflix, Google, and Amazon run against their live infrastructure. The output is a chaos experiment plan document that any senior engineer on your team could pick up, execute, and learn from.

The Sample System

The target system is a multi-region e-commerce checkout service. It is a representative big-tech architecture that exercises every failure mode covered in this tutorial. The system has the following components:

API Gateway — NGINX ingress in two AWS regions (us-east-1, eu-west-1), fronted by Route 53 latency-based routing.
Checkout Service — Kubernetes Deployment (6 replicas per region), written in Go. Calls three downstream services: Inventory, Payment, and Fraud.
Inventory Service — stateless, 4 replicas per region, Redis-backed cache with a PostgreSQL read replica fallback.
Payment Service — calls an external Stripe API with a 2-second timeout and a circuit breaker.
Fraud Service — an ML inference service with p99 latency of 180 ms under normal load.
PostgreSQL — primary in us-east-1 with a synchronous standby in eu-west-1 managed by Patroni for automatic failover.
Redis Cluster — 3-shard, 1-replica-per-shard cluster used for session state and inventory cache.
Observability Stack — Prometheus/Grafana for metrics, Jaeger for traces, Loki for logs. SLO: 99.9% checkout success rate over a 28-day rolling window.

Step 1: Define the Steady State

Every chaos experiment needs a measurable, unambiguous baseline. Vague steady states like "the system is healthy" are useless — you cannot detect deviation from a baseline you cannot measure. Define your steady state in terms of SLIs that map directly to the SLO:

# Steady-state SLI definitions (PromQL)

# 1. Checkout success rate (primary SLO signal)
sum(rate(checkout_requests_total{status="success"}[5m]))
/
sum(rate(checkout_requests_total[5m]))
# Target: >= 0.999 (99.9%)

# 2. p99 checkout latency
histogram_quantile(0.99, sum by (le) (rate(checkout_duration_seconds_bucket[5m])))
# Target: <= 2.0 seconds

# 3. Payment circuit breaker state
payment_circuit_breaker_state{state="open"}
# Target: == 0 (breaker closed)

# 4. Inventory cache hit rate
sum(rate(inventory_cache_hits_total[5m]))
/
sum(rate(inventory_cache_requests_total[5m]))
# Target: >= 0.85 (85% hit rate)

# 5. DB replication lag (for region failover experiments)
postgres_replication_lag_seconds{replica="eu-west-1"}
# Target: <= 5.0 seconds

Key principle: Your steady state is a CONTRACT. Write it in a shared document before the experiment, not after. If the experiment breaks it, you have found a real weakness. If the experiment does not break it, you have gained real confidence. Either outcome is valuable — the only failure is an experiment you cannot interpret because the baseline was undefined.

Step 2: The Experiment Backlog — Prioritized by Risk

Do not run experiments in the order that seems interesting. Run them in the order that addresses the highest business risk first. Score each experiment on two dimensions: likelihood (how often does this failure mode actually occur?) and impact (how bad is the user experience if it happens?). Multiply for a risk score. High-score experiments go first.

Chaos experiment risk matrix: prioritize experiments in the top-right quadrant (high likelihood, high business impact) first.

Step 3: Writing the Hypothesis Document

Each experiment must have a written hypothesis document before a single command is run. The format is: Given [system state], when [failure event], then [measurable outcome], because [resilience mechanism]. Here is the first experiment in full:

# EXPERIMENT-001: Payment Service Latency Injection
# Priority: P0 (High likelihood, direct revenue impact)
# Owner: payments-team
# Scheduled: Tuesday 2025-09-09 14:00 UTC (business hours)
# Blast radius: 10% of traffic in us-east-1 canary slice

## Hypothesis
Given: checkout service is operating at steady state (>= 99.9% success, p99 <= 2.0s)
When:  Payment Service response time degrades to 4 seconds (2x normal timeout) for 10% of calls
Then:  checkout success rate stays >= 99.5% (graceful degradation budget),
       payment_circuit_breaker_state flips to "open" within 30 seconds,
       users on affected slice receive a "try again shortly" message, NOT a 500
Because: the circuit breaker (threshold: 5 failures in 10s) should open and return
         a cached "payment unavailable" response; retry logic routes unaffected users normally

## Rollback trigger
- checkout success rate drops below 99.0% for > 60 seconds
- any manual observation of data corruption in payment_events table

## Rollback procedure
kubectl exec -n chaos deploy/chaos-mesh-controller -- chaosctl recover EXPERIMENT-001
# Chaos Mesh cleans up the NetworkChaos resource; circuit breaker closes within 30s

## Metrics to capture (save dashboard snapshot before + during + after)
- checkout_requests_total{status="success"} rate
- checkout_duration_seconds p99
- payment_circuit_breaker_state
- payment_client_timeout_total rate
- downstream_payment_latency_seconds p50/p99

## Chaos Mesh manifest
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: experiment-001-payment-latency
  namespace: chaos-testing
spec:
  action: delay
  mode: fixed-percent
  value: "10"
  selector:
    namespaces: [production]
    labelSelectors:
      app: checkout-service
  direction: to
  target:
    selector:
      namespaces: [production]
      labelSelectors:
        app: payment-service
    mode: all
  delay:
    latency: "4000ms"
    correlation: "50"
    jitter: "500ms"
  duration: "15m"

Big-tech standard: At Google and Amazon, every chaos experiment has a named owner, a linked incident ticket, and a Slack channel pinned for the duration. The owner is responsible for hitting the rollback trigger without waiting for approval. Chaos without a named human watching metrics in real time is production recklessness, not engineering discipline.

Step 4: The Full Experiment Backlog (Summary Table)

A mature chaos program maintains a backlog of experiments ranked by risk score. Below is the complete experiment plan for this system, structured as you would present it in an engineering design review:

ID            | Failure Mode                     | Scope          | Risk  | Hypothesis Summary
------------------------------------------------------------------------------------------
EXPERIMENT-001 | Payment API +4s latency          | 10% canary     | P0    | Circuit breaker opens; graceful degradation
EXPERIMENT-002 | Redis cluster: kill primary shard | 1 of 3 shards  | P0    | Replica promotes; cache miss spike < 30s
EXPERIMENT-003 | Checkout pod kill (3 of 6)        | us-east-1 only | P1    | K8s reschedules; no SLO breach
EXPERIMENT-004 | PostgreSQL primary failover       | us-east-1      | P1    | Patroni promotes standby < 45s; no data loss
EXPERIMENT-005 | Fraud service 100% failure        | all traffic    | P1    | Checkout proceeds (fraud is non-blocking)
EXPERIMENT-006 | Availability zone network split   | us-east-1 AZ-b | P1    | Route 53 + ALB drain AZ-b; no SLO breach
EXPERIMENT-007 | CPU saturation (90%) on checkout  | 2 of 6 pods    | P2    | HPA triggers; scale-out within 60s
EXPERIMENT-008 | DNS resolution failure (Fraud)    | canary 5%      | P2    | Fraud timeout path; checkout not blocked
EXPERIMENT-009 | Memory exhaustion + OOMKill       | 1 pod          | P2    | K8s restarts pod; no traffic dropped
EXPERIMENT-010 | Regional failover (full us-east-1)| staging mirror | P3    | Route 53 fails over to eu-west-1 < 2min

Step 5: Run-Book Integration and Alerting Validation

Every chaos experiment is also an alerting validation test. As you inject the failure, verify that your on-call alerts fire within the expected SLA. If EXPERIMENT-003 kills 3 of 6 pods and your PagerDuty alert does not fire within 90 seconds, your alerting is broken — and that is just as important a finding as the resilience behavior itself.

After every experiment, update the run-book for the failure mode you tested. A run-book entry validated by a chaos experiment is worth ten times a run-book entry written from first principles. Include: observed TTD (time to detect), TTR (time to recover), which monitoring signals fired first, and what the on-call engineer should do that differs from theory.

Step 6: Documenting Findings and Tracking Improvements

For each experiment, record the outcome against the hypothesis. Three outcomes are possible:

Confirmed: The system behaved exactly as hypothesized. Document as evidence of resilience. Schedule re-run after next major dependency change.
Refuted — expected direction: The system failed worse than hypothesized (e.g., circuit breaker took 90s to open instead of 30s). File a bug with a severity matching the risk score. Block the next sprint until fixed.
Refuted — unexpected direction: The system failed in a way you did not predict (e.g., the circuit breaker opened but caused a cascade that took down Fraud as well). This is the most valuable outcome. Write a post-mortem, not just a bug ticket — something about your mental model of the system was fundamentally wrong.

Production pitfall: The most common chaos program failure is running experiments but not tracking the improvement work that comes out of them. If EXPERIMENT-002 finds that Redis shard promotion takes 45 seconds instead of the hypothesized 10 seconds, and that finding is not linked to a tracked engineering work item with a due date, the chaos program has produced entropy, not improvement. Every refuted hypothesis must generate a ticket with an owner, a due date, and a verification plan (which is usually running the same chaos experiment again after the fix).

Chaos Program Maturity — Where to Go Next

Chaos engineering maturity model: start with manual experiments, evolve to continuous production chaos over 12-18 months.

A healthy first-year chaos program for a 50-engineer organization looks like: eight manual experiments per quarter (Level 1), two Game Days per year (Level 2), and chaos gates on the most critical services in the CI pipeline by the end of year one (early Level 3). Continuous production chaos (Level 4) requires mature observability, a blameless culture, and years of accumulated experiment results that build organizational confidence. Do not skip the levels.

Capstone action: Take this experiment plan and adapt it to a system you operate right now. Identify your five riskiest failure modes using the risk matrix, write the hypothesis document for the first experiment, and schedule it for next Tuesday at 2 pm. The most important step in any chaos program is the first experiment — not because the result will be perfect, but because it makes the practice real. Every Google SRE team runs at least one chaos experiment before any new service is called production-ready. Make that your new standard too.