Configuration Management with Ansible

Ansible at Scale

18 min Lesson 9 of 30

Ansible at Scale

Running a playbook against ten servers feels effortless. Running the same playbook against ten thousand servers — across multiple datacenters, cloud accounts, and network segments, under strict change-window constraints — is a fundamentally different engineering problem. This lesson covers the strategies, tuning knobs, and orchestration tooling that separate a hobby Ansible setup from a production-grade fleet automation platform.

Understanding Forks: Parallelism in Ansible

By default, Ansible processes only 5 hosts in parallel (the forks setting). That default is deliberately conservative and is wrong for most production fleets. Forks control how many SSH connections the control node opens simultaneously.

# ansible.cfg — tune for your control node's CPU and open-file limits
[defaults]
forks          = 50          # parallel SSH connections; raise to 100-500 on powerful control nodes
host_key_checking = False
pipelining     = True        # eliminates a round-trip per task; requires requiretty=False on targets
gathering      = smart       # cache facts; do not re-gather unchanged hosts
fact_caching   = redis
fact_caching_connection = localhost:6379:0
fact_caching_timeout = 86400

[ssh_connection]
ssh_args       = -o ControlMaster=auto -o ControlPersist=60s -o ServerAliveInterval=10
retries        = 3

Three settings do the most work at scale:

forks — raise to 50-200 for cloud fleets. The practical ceiling is the control node's open-file limit (ulimit -n) and available memory (~10 MB per fork).
pipelining = True — bundles the module upload and execution into one SSH call instead of three. On a 500-host playbook this can cut total run time by 30-40%.
fact_caching — avoids re-running the gather_facts task on every play. With Redis, facts survive across playbook runs for fact_caching_timeout seconds.

Execution Strategies

Ansible ships three built-in execution strategies that change how tasks are distributed across hosts.

Three Ansible execution strategies: linear (safe default), free (maximum throughput), and serial/rolling (production deploys with controlled blast radius).

# Rolling deploy — update 10% of the fleet at a time
- name: Rolling application deploy
  hosts: app_servers
  serial:
    - "10%"   # first batch: 10% of hosts
    - "25%"   # second batch: 25%
    - "100%"  # remaining hosts
  max_fail_percentage: 5   # abort if more than 5% of a batch fails
  strategy: free           # within each batch, hosts run independently

  pre_tasks:
    - name: Remove host from load balancer
      uri:
        url: "http://lb.internal/api/drain/{{ inventory_hostname }}"
        method: POST
      delegate_to: localhost

  tasks:
    - name: Deploy new application artifact
      ansible.builtin.unarchive:
        src: "s3://artifacts/app-{{ version }}.tar.gz"
        dest: /opt/app
        remote_src: yes

    - name: Restart service
      ansible.builtin.systemd:
        name: app
        state: restarted

  post_tasks:
    - name: Re-add host to load balancer
      uri:
        url: "http://lb.internal/api/enable/{{ inventory_hostname }}"
        method: POST
      delegate_to: localhost

Set max_fail_percentage: 0 for zero-tolerance rollouts (any failure stops the entire play). Use a graduated serial list to canary-test on a small batch first before touching the bulk of the fleet.

AWX and Ansible Automation Platform (Controller)

The command-line ansible-playbook workflow does not scale organisationally. Who ran the last playbook? Against which hosts? With what variables? Did it succeed? Can a non-engineer trigger it safely? These questions are unanswerable without a control plane. That control plane is AWX (the open-source upstream) or Red Hat Ansible Automation Platform / Automation Controller (the enterprise product).

Key capabilities AWX adds over raw CLI:

RBAC — teams get permissions to specific job templates, not shell access to the control node.
Credential management — SSH keys, vault passwords, cloud credentials stored encrypted in the AWX database; never exposed to operators running jobs.
Job templates — a named, versioned combination of playbook + inventory + credentials + extra vars. Anyone with access can launch it; no CLI knowledge required.
Surveys — web forms that prompt operators for variables (e.g. target environment, version) before launching a job. Safe, auditable variable injection.
Workflow job templates — directed acyclic graphs (DAGs) of job templates: "run hardening, then deploy, then smoke tests; if smoke tests fail, run rollback."
Audit log — every job stores its full stdout, the user who launched it, timestamps, and outcome. Essential for compliance (SOC 2, PCI-DSS).
Scheduling — cron-driven execution for nightly compliance enforcement.

In practice, big-tech teams treat AWX job templates as the only sanctioned way to run Ansible in production. Engineers who need a one-off run open a ticket that triggers a parameterised job template — not SSH to the control node. This eliminates the "works on my laptop" problem and provides a complete change audit trail.

Dynamic Inventory at Scale

Static inventory/hosts.ini files are unmaintainable beyond a few dozen hosts. At scale, inventory must be pulled dynamically from the source of truth — your cloud provider, CMDB, or service registry.

# inventory/aws_ec2.yml — AWS dynamic inventory plugin
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
  - eu-west-1
filters:
  tag:Env: production
  instance-state-name: running
keyed_groups:
  - key: tags.Role
    prefix: role
  - key: placement.availability_zone
    prefix: az
compose:
  ansible_host: public_ip_address   # or private_ip_address for VPN-connected fleets
  ansible_user: "'ec2-user'"

# Run: ansible-inventory -i inventory/aws_ec2.yml --graph
# role_webserver
# |--10.0.1.5
# |--10.0.1.6
# az_us_east_1a
# |--10.0.1.5

When Ansible Is the Wrong Tool: Immutable Infrastructure

Ansible excels at configuring mutable servers — machines that live long enough to warrant ongoing management. But the modern cloud trend is immutable infrastructure: bake a machine image once, deploy it, and replace rather than modify it when a change is needed.

Use Ansible for configuration management when:

You manage long-lived VMs (database servers, legacy on-prem nodes, bare metal).
Startup time matters and baking a new AMI for every change is too slow.
The system cannot be replaced without data migration (stateful services).

Prefer immutable images (Packer + Ansible, Docker, AMI-based ASG) when:

You run stateless application tiers — web servers, API nodes, workers.
You want identical behaviour in dev, staging, and prod (the image is the artefact).
Your threat model requires that production hosts have no SSH access whatsoever.
You already use Kubernetes or ECS — containers are the immutable unit; host config is minimal.

The best-practice hybrid: use Packer + Ansible to bake golden AMIs. Ansible configures the base OS hardening, common agents (CloudWatch, SSM), and runtime dependencies once during image build. The running fleet never receives Ansible pushes — it is replaced. Ansible still manages the small set of long-lived stateful hosts that cannot follow the immutable pattern.

Never run ansible-playbook without a change-management gate in production. An untested playbook can fire against thousands of hosts in seconds. Always test against a staging inventory group first, use --check (dry-run) + --diff to preview changes, and restrict production job template launch permissions to senior engineers or require a second approval in AWX.

Performance Profiling

When a large playbook is slow, instrument it before tuning blindly. The profile_tasks and profile_roles callback plugins ship with Ansible and add zero overhead in normal runs.

# ansible.cfg — enable timing callbacks
[defaults]
callbacks_enabled = profile_tasks, profile_roles, timer

# After the run you will see output like:
# ============================================================
# gather_facts ----------------------------------- 48.23s
# apt: install packages -------------------------- 32.11s
# template: nginx.conf --------------------------- 4.87s
# ============================================================
# Total time: 85.21s

# Quick wins once you know the slow tasks:
# 1. Cache facts (fact_caching = redis, shown earlier)
# 2. Use `async` + `poll` for long-running tasks (e.g. package installs)
# 3. Raise forks
# 4. Use Mitogen strategy plugin (3-7x speedup, replaces SSH with Python RPC)

# pip install mitogen ansible-mitogen
# ansible.cfg:
# [defaults]
# strategy_plugins = /path/to/ansible_mitogen/plugins/strategy
# strategy         = mitogen_linear

At big-tech scale, the Mitogen strategy plugin is commonly adopted for its dramatic speed improvement. It replaces the SSH-then-shell-then-Python bootstrap with a persistent in-process Python channel, cutting per-task overhead from ~200 ms to ~5 ms per host. The trade-off is an additional dependency and occasional compatibility issues with community modules that use unusual Python — always test in staging first.