Configuration Management with Ansible

Ansible at Scale

18 min Lesson 9 of 30

Ansible at Scale

Running a playbook against ten servers feels effortless. Running the same playbook against ten thousand servers — across multiple datacenters, cloud accounts, and network segments, under strict change-window constraints — is a fundamentally different engineering problem. This lesson covers the strategies, tuning knobs, and orchestration tooling that separate a hobby Ansible setup from a production-grade fleet automation platform.

Understanding Forks: Parallelism in Ansible

By default, Ansible processes only 5 hosts in parallel (the forks setting). That default is deliberately conservative and is wrong for most production fleets. Forks control how many SSH connections the control node opens simultaneously.

# ansible.cfg — tune for your control node's CPU and open-file limits [defaults] forks = 50 # parallel SSH connections; raise to 100-500 on powerful control nodes host_key_checking = False pipelining = True # eliminates a round-trip per task; requires requiretty=False on targets gathering = smart # cache facts; do not re-gather unchanged hosts fact_caching = redis fact_caching_connection = localhost:6379:0 fact_caching_timeout = 86400 [ssh_connection] ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o ServerAliveInterval=10 retries = 3

Three settings do the most work at scale:

  • forks — raise to 50-200 for cloud fleets. The practical ceiling is the control node's open-file limit (ulimit -n) and available memory (~10 MB per fork).
  • pipelining = True — bundles the module upload and execution into one SSH call instead of three. On a 500-host playbook this can cut total run time by 30-40%.
  • fact_caching — avoids re-running the gather_facts task on every play. With Redis, facts survive across playbook runs for fact_caching_timeout seconds.

Execution Strategies

Ansible ships three built-in execution strategies that change how tasks are distributed across hosts.

Ansible execution strategies comparison linear (default) Task 1 → all hosts Task 2 → all hosts Task 3 → all hosts Waits for ALL hosts before next task. Safe, predictable. free Host A: T1,2,3 Host B: T1 Host A: done Host B: T2,3 Each host races independently. Maximum speed, no ordering guarantees. serial (rolling) Batch 1 (10%) → T1,2,3 Batch 2 (10%) → T1,2,3 Batch 3 (10%) → T1,2,3 Rolling deploy: capacity stays up, blast radius limited.
Three Ansible execution strategies: linear (safe default), free (maximum throughput), and serial/rolling (production deploys with controlled blast radius).
# Rolling deploy — update 10% of the fleet at a time - name: Rolling application deploy hosts: app_servers serial: - "10%" # first batch: 10% of hosts - "25%" # second batch: 25% - "100%" # remaining hosts max_fail_percentage: 5 # abort if more than 5% of a batch fails strategy: free # within each batch, hosts run independently pre_tasks: - name: Remove host from load balancer uri: url: "http://lb.internal/api/drain/{{ inventory_hostname }}" method: POST delegate_to: localhost tasks: - name: Deploy new application artifact ansible.builtin.unarchive: src: "s3://artifacts/app-{{ version }}.tar.gz" dest: /opt/app remote_src: yes - name: Restart service ansible.builtin.systemd: name: app state: restarted post_tasks: - name: Re-add host to load balancer uri: url: "http://lb.internal/api/enable/{{ inventory_hostname }}" method: POST delegate_to: localhost
Set max_fail_percentage: 0 for zero-tolerance rollouts (any failure stops the entire play). Use a graduated serial list to canary-test on a small batch first before touching the bulk of the fleet.

AWX and Ansible Automation Platform (Controller)

The command-line ansible-playbook workflow does not scale organisationally. Who ran the last playbook? Against which hosts? With what variables? Did it succeed? Can a non-engineer trigger it safely? These questions are unanswerable without a control plane. That control plane is AWX (the open-source upstream) or Red Hat Ansible Automation Platform / Automation Controller (the enterprise product).

Key capabilities AWX adds over raw CLI:

  • RBAC — teams get permissions to specific job templates, not shell access to the control node.
  • Credential management — SSH keys, vault passwords, cloud credentials stored encrypted in the AWX database; never exposed to operators running jobs.
  • Job templates — a named, versioned combination of playbook + inventory + credentials + extra vars. Anyone with access can launch it; no CLI knowledge required.
  • Surveys — web forms that prompt operators for variables (e.g. target environment, version) before launching a job. Safe, auditable variable injection.
  • Workflow job templates — directed acyclic graphs (DAGs) of job templates: "run hardening, then deploy, then smoke tests; if smoke tests fail, run rollback."
  • Audit log — every job stores its full stdout, the user who launched it, timestamps, and outcome. Essential for compliance (SOC 2, PCI-DSS).
  • Scheduling — cron-driven execution for nightly compliance enforcement.
In practice, big-tech teams treat AWX job templates as the only sanctioned way to run Ansible in production. Engineers who need a one-off run open a ticket that triggers a parameterised job template — not SSH to the control node. This eliminates the "works on my laptop" problem and provides a complete change audit trail.

Dynamic Inventory at Scale

Static inventory/hosts.ini files are unmaintainable beyond a few dozen hosts. At scale, inventory must be pulled dynamically from the source of truth — your cloud provider, CMDB, or service registry.

# inventory/aws_ec2.yml — AWS dynamic inventory plugin plugin: amazon.aws.aws_ec2 regions: - us-east-1 - eu-west-1 filters: tag:Env: production instance-state-name: running keyed_groups: - key: tags.Role prefix: role - key: placement.availability_zone prefix: az compose: ansible_host: public_ip_address # or private_ip_address for VPN-connected fleets ansible_user: "'ec2-user'" # Run: ansible-inventory -i inventory/aws_ec2.yml --graph # role_webserver # |--10.0.1.5 # |--10.0.1.6 # az_us_east_1a # |--10.0.1.5

When Ansible Is the Wrong Tool: Immutable Infrastructure

Ansible excels at configuring mutable servers — machines that live long enough to warrant ongoing management. But the modern cloud trend is immutable infrastructure: bake a machine image once, deploy it, and replace rather than modify it when a change is needed.

Use Ansible for configuration management when:

  • You manage long-lived VMs (database servers, legacy on-prem nodes, bare metal).
  • Startup time matters and baking a new AMI for every change is too slow.
  • The system cannot be replaced without data migration (stateful services).

Prefer immutable images (Packer + Ansible, Docker, AMI-based ASG) when:

  • You run stateless application tiers — web servers, API nodes, workers.
  • You want identical behaviour in dev, staging, and prod (the image is the artefact).
  • Your threat model requires that production hosts have no SSH access whatsoever.
  • You already use Kubernetes or ECS — containers are the immutable unit; host config is minimal.
The best-practice hybrid: use Packer + Ansible to bake golden AMIs. Ansible configures the base OS hardening, common agents (CloudWatch, SSM), and runtime dependencies once during image build. The running fleet never receives Ansible pushes — it is replaced. Ansible still manages the small set of long-lived stateful hosts that cannot follow the immutable pattern.
Never run ansible-playbook without a change-management gate in production. An untested playbook can fire against thousands of hosts in seconds. Always test against a staging inventory group first, use --check (dry-run) + --diff to preview changes, and restrict production job template launch permissions to senior engineers or require a second approval in AWX.

Performance Profiling

When a large playbook is slow, instrument it before tuning blindly. The profile_tasks and profile_roles callback plugins ship with Ansible and add zero overhead in normal runs.

# ansible.cfg — enable timing callbacks [defaults] callbacks_enabled = profile_tasks, profile_roles, timer # After the run you will see output like: # ============================================================ # gather_facts ----------------------------------- 48.23s # apt: install packages -------------------------- 32.11s # template: nginx.conf --------------------------- 4.87s # ============================================================ # Total time: 85.21s # Quick wins once you know the slow tasks: # 1. Cache facts (fact_caching = redis, shown earlier) # 2. Use `async` + `poll` for long-running tasks (e.g. package installs) # 3. Raise forks # 4. Use Mitogen strategy plugin (3-7x speedup, replaces SSH with Python RPC) # pip install mitogen ansible-mitogen # ansible.cfg: # [defaults] # strategy_plugins = /path/to/ansible_mitogen/plugins/strategy # strategy = mitogen_linear

At big-tech scale, the Mitogen strategy plugin is commonly adopted for its dramatic speed improvement. It replaces the SSH-then-shell-then-Python bootstrap with a persistent in-process Python channel, cutting per-task overhead from ~200 ms to ~5 ms per host. The trade-off is an additional dependency and occasional compatibility issues with community modules that use unusual Python — always test in staging first.