Chaos Engineering & Resilience

Chaos Tooling

18 min Lesson 6 of 27

Chaos Tooling

Chaos engineering is only as rigorous as the tooling that executes it. Running kill -9 in a shell script is not chaos engineering — it is an outage. Professional chaos tooling provides blast-radius controls, automated steady-state checks, audit logs, rollback hooks, and integration with your CI/CD and observability stack. This lesson surveys the four tools that dominate production chaos programs in 2025: Chaos Monkey (the Netflix lineage that started it all), Litmus (the CNCF-graduated Kubernetes-native choice), Chaos Mesh (ByteDance's more feature-rich Kubernetes operator), and AWS Fault Injection Service (the managed option for AWS-heavy organizations). You will leave knowing not only what each tool does, but when to reach for each one and how to operate it safely.

Chaos Monkey & the Netflix Lineage

Netflix open-sourced Chaos Monkey in 2012 as a daemon that randomly terminated EC2 instances during business hours. The philosophy was blunt: if you are afraid of random instance termination, you have not built the redundancy your SLAs require — so fix that before a hurricane does it for you. The original Chaos Monkey has since been rewritten in Go and lives inside the Simian Army and later the ChAP (Chaos Automation Platform) ecosystem, but the core idea is unchanged.

Netflix runs Chaos Monkey in production continuously. Each service team opts in (or is opted in by policy). The daemon calls the AWS API to terminate instances in ASGs, respecting a configurable minimum healthy percentage so it never kills more than one instance at a time in a group below the safety threshold. Every termination is tagged with a reason and correlated with Atlas (Netflix's internal Prometheus-compatible metrics store) so the blast is visible immediately in dashboards.

The key lesson from the Netflix lineage: chaos must run continuously and automatically, not just during game days. Netflix teams stopped fearing random termination precisely because it happens every week — they are forced to build for it. Run chaos on a schedule, not as a one-off exercise.

The open-source chaos-monkey Go binary is configurable via a REST API and a Spinnaker integration. For teams not on Netflix infrastructure the mental model transfers directly: terminate random pods in your deployments during business hours on a recurring cron, alert on it, and measure mean-time-to-recover. The Kubernetes equivalent is a simple CronJob that calls kubectl delete pod --field-selector=status.phase=Running -n <ns> --no-headers | shuf -n 1 | awk '{print $1}' | xargs kubectl delete pod -n <ns> — crude, but effective as a starting point before you need experiment-level controls.

Litmus: CNCF-Graduated Kubernetes-Native Chaos

Litmus (now LitmusChaos, graduated from CNCF sandbox in 2022) is the reference platform for Kubernetes chaos. It models every experiment as a ChaosEngine custom resource. The control plane (LitmusPortal) runs in-cluster and provides a UI, scheduling, RBAC, and GitOps integration. Experiments are versioned, sharable ChaosExperiment CRDs backed by a public hub at hub.litmuschaos.io.

A typical Litmus setup installs the operator and creates two custom resource definitions, then references pre-built experiments by name. The engine spec wires the experiment to a target workload, injects steady-state probes, and enforces a verdict: pass (steady state held throughout) or fail (SLO breached during the experiment window).

# Install Litmus 3.x via Helm
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install chaos litmuschaos/litmus \
  --namespace=litmus --create-namespace \
  --set portal.frontend.service.type=ClusterIP

# Apply a pre-built pod-delete experiment from the ChaosHub
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/pod-delete/experiment.yaml \
  -n litmus

# ChaosEngine targeting the payments service
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payments-pod-delete
  namespace: payments
spec:
  appinfo:
    appns: payments
    applabel: "app=payments-api"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"          # seconds
            - name: CHAOS_INTERVAL
              value: "20"          # delete every N seconds
            - name: FORCE
              value: "false"       # graceful delete (SIGTERM first)
            - name: PODS_AFFECTED_PERC
              value: "50"          # kill at most 50% of matching pods
  steadyStateHypothesis:
    title: "Payments P99 latency under 300 ms"
    probes:
      - name: prom-latency-probe
        type: promProbe
        mode: Continuous
        runProperties:
          probeTimeout: 5
          interval: 10
          retry: 2
        promProbe/inputs:
          endpoint: "http://prometheus.monitoring:9090"
          query: 'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="payments"}[1m])) by (le))'
          comparator:
            type: float
            criteria: "<="
            value: "0.3"

Probes are the most important part of a Litmus experiment. Without a promProbe or httpProbe verifying your SLO throughout the chaos window, the experiment tells you nothing — it just deletes pods. Always define at least one continuous probe tied to a real user-facing metric.

Litmus integrates natively with Argo Workflows for pipeline-driven chaos (run experiment → await verdict → gate the next deployment stage). It also supports chaos scheduling via ChaosSchedule CRDs, enabling recurring experiments that mirror the Chaos Monkey always-on philosophy.

Chaos Mesh: Feature-Rich Kubernetes Chaos Operator

Chaos Mesh, built by PingCAP (TiDB) and now a CNCF sandbox project, takes a broader scope than Litmus. Where Litmus focuses on experiment composition and the ChaosHub ecosystem, Chaos Mesh ships a richer set of built-in fault types and a more granular network simulation API — including precise bandwidth throttling, packet reorder, and partition by label selector.

Chaos Mesh architecture: the Dashboard writes CRDs, the Controller Manager reconciles them, and the privileged Chaos Daemon DaemonSet injects faults at the kernel/network level on each node.

The Chaos Daemon runs as a privileged DaemonSet with host PID and network namespace access. This is what makes network-level faults — real tc netem rules, iptables drops, bandwidth caps — accurate rather than simulated at the application layer. It also means you must audit the daemon RBAC carefully in multi-tenant clusters.

# Install Chaos Mesh via Helm (Kubernetes 1.26+)
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace=chaos-mesh --create-namespace \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock

# NetworkChaos: inject 200 ms latency + 20 ms jitter into the inventory service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: inventory-latency
  namespace: inventory
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - inventory
    labelSelectors:
      app: inventory-api
  delay:
    latency: "200ms"
    jitter: "20ms"
    correlation: "25"      # % correlation between consecutive packets
  duration: "5m"
  direction: to            # affects outbound traffic from selected pods

---
# PodChaos: kill one random pod every 90 seconds
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: inventory-pod-kill
  namespace: inventory
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - inventory
    labelSelectors:
      app: inventory-api
  scheduler:
    cron: "@every 90s"

Never run Chaos Mesh with mode: all on a production namespace without a duration limit and an automated rollback policy. The controller manager respects the duration field and removes injected faults automatically at expiry — but if the chaos-mesh namespace itself is disrupted, you need a manual cleanup path. Always test the pause and delete flows in staging before prod.

AWS Fault Injection Service (FIS)

AWS FIS, GA since March 2021, is the managed control plane for infrastructure-level chaos on AWS. It operates on IAM principals — you grant FIS an execution role that allows specific fault actions, and it never needs cluster-level RBAC. This makes it the natural choice for AWS-heavy shops that want chaos without running an in-cluster operator.

FIS models experiments as ExperimentTemplates. Each template defines: a set of actions (each an AWS fault primitive), targets (filtered by tags, ARNs, or random percent), stop conditions (CloudWatch Alarms that halt the experiment automatically), and an IAM role. The stop conditions are the critical safety primitive — wire them to your error-rate or latency alarms so FIS self-aborts if the blast radius exceeds expectations.

# FIS ExperimentTemplate (AWS CLI JSON) — CPU stress on 25% of tagged EC2 instances
aws fis create-experiment-template --cli-input-json '{
  "description": "CPU stress on api-server fleet",
  "targets": {
    "ApiServers": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {"Role": "api-server", "Env": "prod"},
      "selectionMode": "PERCENT(25)"
    }
  },
  "actions": {
    "cpu-stress": {
      "actionId": "aws:ssm:send-command",
      "description": "stress-ng 60s via SSM",
      "parameters": {
        "documentArn": "arn:aws:ssm:::document/AWSFIS-Run-CPU-Stress",
        "documentParameters": "{\"DurationSeconds\":\"60\",\"CPU\":\"0\",\"InstallDependencies\":\"True\"}",
        "duration": "PT2M"
      },
      "targets": {"Instances": "ApiServers"}
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:api-error-rate-high"
    }
  ],
  "roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole",
  "tags": {"Purpose": "chaos-cpu-resilience"}
}'

# Start the experiment
aws fis start-experiment --experiment-template-id EXT123EXAMPLE

FIS supports 50+ built-in fault actions in 2025: EC2 instance termination/pause/CPU stress, EKS pod termination, RDS failover, ElastiCache primary reboot, S3 throttling, network latency via VPC network disruption, and Spot interruption simulation. It integrates with AWS Systems Manager (SSM) for on-host faults without an agent binary you manage.

Always use FIS stop conditions tied to real CloudWatch Alarms — not placeholder alarms. The experiment role should also be scoped to minimum privilege: only the specific fis:* and target-service actions needed. Audit it with IAM Access Analyzer before your first prod run. FIS also provides an experiment log to CloudWatch and S3 — enable this; you need the audit trail for post-incident reviews.

Choosing the Right Tool

In practice, large organizations use more than one: FIS for AWS-native infrastructure faults (RDS failover, Spot interruption), Litmus or Chaos Mesh for Kubernetes application faults (pod kill, network partition, CPU/memory pressure at the container level). Litmus wins on ecosystem (ChaosHub, Argo integration, GitOps). Chaos Mesh wins on network fault granularity. FIS wins on zero operational overhead for AWS customers. The Chaos Monkey philosophy — always-on, automated, continuous — applies regardless of which tool implements it.