Configuration Management with Ansible

Project: Configure a Fleet

35 min Lesson 10 of 30

Project: Configure a Fleet

Everything you have built in this tutorial — inventory design, ad-hoc commands, playbooks, variables and Jinja2 templates, conditionals and loops, roles, Ansible Vault, and scaling strategies — converges in this capstone project. You will design and deploy a complete, production-grade Ansible codebase that configures three distinct server groups: web (NGINX reverse proxy), app (Node.js application), and database (PostgreSQL). This is the structure you will encounter at real companies running dozens to hundreds of nodes.

The goal is not just working configuration — it is maintainable, auditable, reusable code that a new team member can understand in thirty minutes and that survives the chaos of real production: rotating secrets, node additions, OS upgrades, and security remediation under pressure.

Project Layout: Roles-First Structure

Real fleets are organized around roles, not flat playbooks. The canonical structure below separates concerns cleanly: the site-level site.yml is the entry point; per-environment inventories isolate prod from staging; reusable roles live under roles/; secrets are encrypted in group_vars.

fleet-config/ ├── ansible.cfg # Project-scoped defaults ├── site.yml # Master playbook — applies all roles ├── inventories/ │ ├── prod/ │ │ ├── hosts.ini # Prod host definitions │ │ └── group_vars/ │ │ ├── all/ │ │ │ ├── vars.yml # Shared vars (non-secret) │ │ │ └── vault.yml # Encrypted secrets (ansible-vault) │ │ ├── web/ │ │ │ └── vars.yml # Web-group-specific vars │ │ ├── app/ │ │ │ └── vars.yml │ │ └── db/ │ │ └── vars.yml │ └── staging/ │ └── ... # Mirrors prod layout ├── roles/ │ ├── base/ # Applied to ALL hosts │ │ ├── tasks/main.yml │ │ ├── handlers/main.yml │ │ └── templates/ │ │ └── sshd_config.j2 │ ├── web/ │ │ ├── tasks/main.yml │ │ ├── handlers/main.yml │ │ └── templates/ │ │ └── nginx.conf.j2 │ ├── app/ │ │ ├── tasks/main.yml │ │ ├── handlers/main.yml │ │ └── templates/ │ │ └── app.env.j2 │ └── db/ │ ├── tasks/main.yml │ ├── handlers/main.yml │ └── templates/ │ └── pg_hba.conf.j2 └── requirements.yml # Galaxy collection dependencies
Why inventories/prod/group_vars/all/vault.yml instead of a root-level vault? Keeping the vault inside the inventory directory means running ansible-playbook -i inventories/prod site.yml automatically loads the correct secrets for that environment. A single root-level vault would be shared across all environments, making accidental cross-environment secret exposure a real risk. Per-inventory vault files eliminate the mistake at the structural level.

Inventory: Grouping the Fleet

The hosts.ini file defines three server groups plus a compound group fleet that spans all of them. The compound group is used by the base role — applied to every node regardless of function.

# inventories/prod/hosts.ini [web] web-01.prod.example.com ansible_user=deploy web-02.prod.example.com ansible_user=deploy [app] app-01.prod.example.com ansible_user=deploy app-02.prod.example.com ansible_user=deploy app-03.prod.example.com ansible_user=deploy [db] db-primary.prod.example.com ansible_user=deploy db_role=primary db-replica-01.prod.example.com ansible_user=deploy db_role=replica db-replica-02.prod.example.com ansible_user=deploy db_role=replica # Compound group: all managed hosts [fleet:children] web app db

The Base Role: Hardening Every Node

The base role enforces the OS-level baseline that every node in the fleet must share — regardless of whether it runs NGINX, Node.js, or PostgreSQL. This is where you enforce SSH hardening, auditd, sysctl tuning, NTP, and the shared monitoring agent. Applying a universal baseline via a compound group prevents the class of incident where a production database has a weaker SSH configuration than the web tier because someone forgot to apply the hardening playbook to it.

# roles/base/tasks/main.yml --- - name: Ensure essential packages are installed ansible.builtin.package: name: - curl - vim - htop - auditd - chrony - fail2ban state: present - name: Harden SSH daemon config ansible.builtin.template: src: sshd_config.j2 dest: /etc/ssh/sshd_config owner: root group: root mode: "0600" validate: /usr/sbin/sshd -t -f %s notify: Restart sshd - name: Set sysctl parameters for network performance ansible.posix.sysctl: name: "{{ item.key }}" value: "{{ item.value }}" state: present reload: true loop: - { key: net.core.somaxconn, value: "65535" } - { key: net.ipv4.tcp_tw_reuse, value: "1" } - { key: vm.swappiness, value: "10" } - name: Ensure auditd is running and enabled ansible.builtin.service: name: auditd state: started enabled: true - name: Deploy deploy user authorized key ansible.posix.authorized_key: user: deploy key: "{{ deploy_ssh_public_key }}" exclusive: true # roles/base/handlers/main.yml --- - name: Restart sshd ansible.builtin.service: name: sshd state: restarted
The validate parameter on the sshd_config template is non-negotiable in production. Without it, a Jinja2 rendering error or a typo in the template deploys a syntactically invalid sshd_config to disk, the handler restarts sshd, sshd refuses to start, and you are locked out of the node. The validate: /usr/sbin/sshd -t -f %s line runs sshd -t (config test) against the rendered file before moving it to /etc/ssh/sshd_config. If the test fails, the task fails and the old config stays in place. This single line has saved countless production SSH lockouts.

The Web Role: NGINX Reverse Proxy

The web role installs NGINX and deploys a Jinja2-rendered virtual host configuration. The template uses group variables to populate upstream app server addresses dynamically — no hardcoded IPs in any template.

# roles/web/tasks/main.yml --- - name: Install NGINX ansible.builtin.package: name: nginx state: present - name: Remove default NGINX site ansible.builtin.file: path: /etc/nginx/sites-enabled/default state: absent notify: Reload nginx - name: Deploy main NGINX config ansible.builtin.template: src: nginx.conf.j2 dest: /etc/nginx/sites-available/app.conf owner: root group: root mode: "0644" validate: /usr/sbin/nginx -t -c %s notify: Reload nginx - name: Enable app site ansible.builtin.file: src: /etc/nginx/sites-available/app.conf dest: /etc/nginx/sites-enabled/app.conf state: link notify: Reload nginx - name: Ensure NGINX is running and enabled ansible.builtin.service: name: nginx state: started enabled: true # roles/web/handlers/main.yml --- - name: Reload nginx ansible.builtin.service: name: nginx state: reloaded

The corresponding Jinja2 template dynamically builds the upstream block from the live inventory — this is the most powerful pattern in fleet configuration management. NGINX is reconfigured automatically whenever you add or remove an app server; you never touch the template:

{# roles/web/templates/nginx.conf.j2 #} upstream app_cluster { {% for host in groups['app'] %} server {{ hostvars[host]['ansible_default_ipv4']['address'] }}:{{ app_port }}; {% endfor %} } server { listen 80; listen [::]:80; server_name {{ web_server_name }}; location / { proxy_pass http://app_cluster; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_connect_timeout 5s; proxy_read_timeout 60s; } location /health { access_log off; return 200 "ok\n"; add_header Content-Type text/plain; } }

The App Role: Node.js Application Server

The app role installs Node.js, deploys the application from a versioned artifact, and manages the systemd unit that keeps it running. The application's secrets (database password, JWT secret) are injected at deploy time from Ansible Vault into an .env file that is never committed to source control.

# roles/app/tasks/main.yml --- - name: Add NodeSource GPG key ansible.builtin.rpm_key: key: https://rpm.nodesource.com/gpgkey/ns-operations@nodesource.com.gpg.key state: present - name: Install Node.js 20 LTS ansible.builtin.package: name: nodejs state: present - name: Create app user ansible.builtin.user: name: "{{ app_user }}" system: true shell: /sbin/nologin create_home: false - name: Create app directory ansible.builtin.file: path: "{{ app_dir }}" state: directory owner: "{{ app_user }}" group: "{{ app_user }}" mode: "0755" - name: Deploy application environment file ansible.builtin.template: src: app.env.j2 dest: "{{ app_dir }}/.env" owner: "{{ app_user }}" group: "{{ app_user }}" mode: "0600" # secrets: owner-read only notify: Restart app - name: Deploy systemd unit ansible.builtin.template: src: app.service.j2 dest: /etc/systemd/system/{{ app_service_name }}.service owner: root group: root mode: "0644" notify: - Reload systemd - Restart app - name: Ensure app service is running and enabled ansible.builtin.service: name: "{{ app_service_name }}" state: started enabled: true

The .env template draws from Vault-encrypted variables. The file mode 0600 ensures only the app_user process can read it — a discipline that prevents secrets from leaking to world-readable logs or other processes on the same host.

The DB Role: Idempotent PostgreSQL Setup

The database role is the most sensitive. It must handle the primary/replica distinction (captured as a host variable in the inventory), initialize PostgreSQL only once (not on every playbook run), and configure pg_hba.conf to allow app-tier connections while blocking everything else.

# roles/db/tasks/main.yml --- - name: Install PostgreSQL 16 ansible.builtin.package: name: - postgresql16-server - postgresql16 state: present - name: Check if PostgreSQL is already initialized ansible.builtin.stat: path: /var/lib/pgsql/16/data/PG_VERSION register: pg_initialized - name: Initialize PostgreSQL (first run only) ansible.builtin.command: /usr/pgsql-16/bin/postgresql-16-setup initdb when: not pg_initialized.stat.exists notify: Start postgresql - name: Deploy pg_hba.conf ansible.builtin.template: src: pg_hba.conf.j2 dest: /var/lib/pgsql/16/data/pg_hba.conf owner: postgres group: postgres mode: "0600" notify: Reload postgresql - name: Ensure postgresql is running and enabled ansible.builtin.service: name: postgresql-16 state: started enabled: true - name: Create application database community.postgresql.postgresql_db: name: "{{ db_name }}" state: present become_user: postgres when: db_role == 'primary' - name: Create application database user community.postgresql.postgresql_user: name: "{{ db_app_user }}" password: "{{ db_app_password }}" # from vault.yml priv: "{{ db_name }}.*:ALL" state: present become_user: postgres when: db_role == 'primary'

The Master Playbook: Bringing It All Together

The site.yml orchestrates everything. It applies the base role to the entire fleet group first, then applies tier-specific roles in separate plays. The ordering matters: the base hardening must be in place before any service is installed, and the database must be reachable before the app servers start.

# site.yml --- - name: Apply baseline hardening to all fleet nodes hosts: fleet become: true gather_facts: true roles: - base - name: Configure web tier (NGINX reverse proxy) hosts: web become: true gather_facts: true roles: - web - name: Configure app tier (Node.js application) hosts: app become: true gather_facts: true roles: - app - name: Configure database tier (PostgreSQL) hosts: db become: true gather_facts: true roles: - db
Fleet configuration: role-based architecture for web, app, and database tiers Control Node site.yml ansible-playbook WEB TIER role: base + web web-01 web-02 NGINX reverse proxy APP TIER role: base + app app-01 app-02 app-03 Node.js workers DB TIER role: base + db primary read-write replicas x2 read-only PostgreSQL 16 HTTP 80/443 Internet traffic proxy_pass :3000 PostgreSQL :5432 Plays in site.yml Play 1: fleet → base role Play 2: web role Play 3: app role Play 4: db role
Fleet architecture: a single site.yml applies the base role to all nodes, then delegates tier-specific roles to web, app, and db groups in sequence.

Running the Fleet: Full Deploy Workflow

With the codebase complete, the deployment workflow follows a strict three-stage pattern: preview, validate, apply. Never run site.yml against production without a dry-run pass.

# 1. Install Galaxy collection dependencies first ansible-galaxy collection install -r requirements.yml # 2. Syntax check — catches YAML errors and undefined variables before touching hosts ansible-playbook -i inventories/prod site.yml --syntax-check # 3. Dry-run against ALL prod hosts — review every "changed" task ansible-playbook -i inventories/prod site.yml \ --ask-vault-pass \ --check --diff # 4. Live run — apply to staging first, then prod ansible-playbook -i inventories/staging site.yml --ask-vault-pass ansible-playbook -i inventories/prod site.yml --ask-vault-pass # 5. Target a single tier for incremental changes (e.g., NGINX config update) ansible-playbook -i inventories/prod site.yml \ --ask-vault-pass \ --limit web \ --tags web # 6. Emergency limit: target a single host during incident triage ansible-playbook -i inventories/prod site.yml \ --ask-vault-pass \ --limit web-01.prod.example.com \ --tags web \ --check --diff

Production Failure Modes and How to Prevent Them

Fleet-scale Ansible runs expose failure modes you will not encounter in single-host tests. Each one below is a real production incident class, with the mitigation built into this project's structure.

  • Partial fleet apply on network error. Ansible applies tasks host by host in parallel batches. A transient SSH timeout on three nodes marks them UNREACHABLE, but the other nodes get configured. Use serial: "20%" in high-blast-radius plays so failures are bounded, and check the failed_hosts count in AWX before proceeding to the next batch.
  • Vault password mismatch between environments. Staging and prod vaults use different passwords. If you use --vault-password-file pointing to the wrong file, you decrypt the staging vault with the prod key and get a silent wrong-value error (the variable resolves but contains garbage). Use separate --vault-id labels: --vault-id prod@~/.vault-prod makes the password source explicit and the mismatch loud.
  • Template rendering silently skipping a block. A Jinja2 {% for host in groups['app'] %} loop that references hostvars[host]['ansible_default_ipv4']['address'] silently generates an empty upstream block if fact gathering was disabled (gather_facts: false) for the app group. Always run gather_facts: true for any play whose templates reference hostvars, and add an assert task that verifies the rendered file is non-empty.
  • Handler not firing after a failed task. Handlers only fire if the play completes without failure on that host. A task failure mid-play suppresses all pending handlers — meaning a template was deployed but the service was never reloaded. Use force_handlers: true at the play level for service-reload handlers to ensure a best-effort reload even on partial failure, and monitor service health separately.
Tag every task and role for surgical targeting. Tag the base role tasks with tags: [base, hardening], web tasks with tags: [web, nginx], and so on. In production you will almost never re-run the full site.yml — you will run --tags nginx to push an NGINX config change, or --tags hardening to respond to a CVE. Tagging is what separates an Ansible codebase you can use confidently in production from one that terrifies your team because every change requires running everything.

Continuous Enforcement: Scheduling via AWX

In production, manual ansible-playbook runs are for one-off changes and incident response. Continuous baseline enforcement is handled by scheduling the full site.yml run every 30 minutes in AWX. Every run is logged, its output is searchable, and failed hosts trigger PagerDuty alerts via AWX's notification system. This approximates the drift-correction guarantees of a pull-based tool — without the operational complexity of managing a Puppet infrastructure.

The playbook you have built in this project is the complete pattern used by infrastructure teams at scale. From this foundation you can extend it: add a monitoring role that deploys the Prometheus node exporter, a logship role that configures Fluent Bit, a certs role that rotates TLS certificates via Vault PKI — all following the same role structure, the same inventory convention, and the same dry-run-first discipline.