Configuration Management with Ansible

Project: Configure a Fleet

35 min Lesson 10 of 30

Project: Configure a Fleet

Everything you have built in this tutorial — inventory design, ad-hoc commands, playbooks, variables and Jinja2 templates, conditionals and loops, roles, Ansible Vault, and scaling strategies — converges in this capstone project. You will design and deploy a complete, production-grade Ansible codebase that configures three distinct server groups: web (NGINX reverse proxy), app (Node.js application), and database (PostgreSQL). This is the structure you will encounter at real companies running dozens to hundreds of nodes.

The goal is not just working configuration — it is maintainable, auditable, reusable code that a new team member can understand in thirty minutes and that survives the chaos of real production: rotating secrets, node additions, OS upgrades, and security remediation under pressure.

Project Layout: Roles-First Structure

Real fleets are organized around roles, not flat playbooks. The canonical structure below separates concerns cleanly: the site-level site.yml is the entry point; per-environment inventories isolate prod from staging; reusable roles live under roles/; secrets are encrypted in group_vars.

fleet-config/
├── ansible.cfg                     # Project-scoped defaults
├── site.yml                        # Master playbook — applies all roles
├── inventories/
│   ├── prod/
│   │   ├── hosts.ini               # Prod host definitions
│   │   └── group_vars/
│   │       ├── all/
│   │       │   ├── vars.yml        # Shared vars (non-secret)
│   │       │   └── vault.yml       # Encrypted secrets (ansible-vault)
│   │       ├── web/
│   │       │   └── vars.yml        # Web-group-specific vars
│   │       ├── app/
│   │       │   └── vars.yml
│   │       └── db/
│   │           └── vars.yml
│   └── staging/
│       └── ...                     # Mirrors prod layout
├── roles/
│   ├── base/                       # Applied to ALL hosts
│   │   ├── tasks/main.yml
│   │   ├── handlers/main.yml
│   │   └── templates/
│   │       └── sshd_config.j2
│   ├── web/
│   │   ├── tasks/main.yml
│   │   ├── handlers/main.yml
│   │   └── templates/
│   │       └── nginx.conf.j2
│   ├── app/
│   │   ├── tasks/main.yml
│   │   ├── handlers/main.yml
│   │   └── templates/
│   │       └── app.env.j2
│   └── db/
│       ├── tasks/main.yml
│       ├── handlers/main.yml
│       └── templates/
│           └── pg_hba.conf.j2
└── requirements.yml                # Galaxy collection dependencies

Why inventories/prod/group_vars/all/vault.yml instead of a root-level vault? Keeping the vault inside the inventory directory means running ansible-playbook -i inventories/prod site.yml automatically loads the correct secrets for that environment. A single root-level vault would be shared across all environments, making accidental cross-environment secret exposure a real risk. Per-inventory vault files eliminate the mistake at the structural level.

Inventory: Grouping the Fleet

The hosts.ini file defines three server groups plus a compound group fleet that spans all of them. The compound group is used by the base role — applied to every node regardless of function.

# inventories/prod/hosts.ini

[web]
web-01.prod.example.com ansible_user=deploy
web-02.prod.example.com ansible_user=deploy

[app]
app-01.prod.example.com ansible_user=deploy
app-02.prod.example.com ansible_user=deploy
app-03.prod.example.com ansible_user=deploy

[db]
db-primary.prod.example.com  ansible_user=deploy  db_role=primary
db-replica-01.prod.example.com ansible_user=deploy db_role=replica
db-replica-02.prod.example.com ansible_user=deploy db_role=replica

# Compound group: all managed hosts
[fleet:children]
web
app
db

The Base Role: Hardening Every Node

The base role enforces the OS-level baseline that every node in the fleet must share — regardless of whether it runs NGINX, Node.js, or PostgreSQL. This is where you enforce SSH hardening, auditd, sysctl tuning, NTP, and the shared monitoring agent. Applying a universal baseline via a compound group prevents the class of incident where a production database has a weaker SSH configuration than the web tier because someone forgot to apply the hardening playbook to it.

# roles/base/tasks/main.yml
---
- name: Ensure essential packages are installed
  ansible.builtin.package:
    name:
      - curl
      - vim
      - htop
      - auditd
      - chrony
      - fail2ban
    state: present

- name: Harden SSH daemon config
  ansible.builtin.template:
    src: sshd_config.j2
    dest: /etc/ssh/sshd_config
    owner: root
    group: root
    mode: "0600"
    validate: /usr/sbin/sshd -t -f %s
  notify: Restart sshd

- name: Set sysctl parameters for network performance
  ansible.posix.sysctl:
    name: "{{ item.key }}"
    value: "{{ item.value }}"
    state: present
    reload: true
  loop:
    - { key: net.core.somaxconn,       value: "65535" }
    - { key: net.ipv4.tcp_tw_reuse,    value: "1" }
    - { key: vm.swappiness,             value: "10" }

- name: Ensure auditd is running and enabled
  ansible.builtin.service:
    name: auditd
    state: started
    enabled: true

- name: Deploy deploy user authorized key
  ansible.posix.authorized_key:
    user: deploy
    key: "{{ deploy_ssh_public_key }}"
    exclusive: true

# roles/base/handlers/main.yml
---
- name: Restart sshd
  ansible.builtin.service:
    name: sshd
    state: restarted

The validate parameter on the sshd_config template is non-negotiable in production. Without it, a Jinja2 rendering error or a typo in the template deploys a syntactically invalid sshd_config to disk, the handler restarts sshd, sshd refuses to start, and you are locked out of the node. The validate: /usr/sbin/sshd -t -f %s line runs sshd -t (config test) against the rendered file before moving it to /etc/ssh/sshd_config. If the test fails, the task fails and the old config stays in place. This single line has saved countless production SSH lockouts.

The Web Role: NGINX Reverse Proxy

The web role installs NGINX and deploys a Jinja2-rendered virtual host configuration. The template uses group variables to populate upstream app server addresses dynamically — no hardcoded IPs in any template.

# roles/web/tasks/main.yml
---
- name: Install NGINX
  ansible.builtin.package:
    name: nginx
    state: present

- name: Remove default NGINX site
  ansible.builtin.file:
    path: /etc/nginx/sites-enabled/default
    state: absent
  notify: Reload nginx

- name: Deploy main NGINX config
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/sites-available/app.conf
    owner: root
    group: root
    mode: "0644"
    validate: /usr/sbin/nginx -t -c %s
  notify: Reload nginx

- name: Enable app site
  ansible.builtin.file:
    src: /etc/nginx/sites-available/app.conf
    dest: /etc/nginx/sites-enabled/app.conf
    state: link
  notify: Reload nginx

- name: Ensure NGINX is running and enabled
  ansible.builtin.service:
    name: nginx
    state: started
    enabled: true

# roles/web/handlers/main.yml
---
- name: Reload nginx
  ansible.builtin.service:
    name: nginx
    state: reloaded

The corresponding Jinja2 template dynamically builds the upstream block from the live inventory — this is the most powerful pattern in fleet configuration management. NGINX is reconfigured automatically whenever you add or remove an app server; you never touch the template:

{# roles/web/templates/nginx.conf.j2 #}
upstream app_cluster {
{% for host in groups['app'] %}
    server {{ hostvars[host]['ansible_default_ipv4']['address'] }}:{{ app_port }};
{% endfor %}
}

server {
    listen 80;
    listen [::]:80;
    server_name {{ web_server_name }};

    location / {
        proxy_pass         http://app_cluster;
        proxy_set_header   Host              $host;
        proxy_set_header   X-Real-IP         $remote_addr;
        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
        proxy_connect_timeout 5s;
        proxy_read_timeout    60s;
    }

    location /health {
        access_log off;
        return 200 "ok\n";
        add_header Content-Type text/plain;
    }
}

The App Role: Node.js Application Server

The app role installs Node.js, deploys the application from a versioned artifact, and manages the systemd unit that keeps it running. The application's secrets (database password, JWT secret) are injected at deploy time from Ansible Vault into an .env file that is never committed to source control.

# roles/app/tasks/main.yml
---
- name: Add NodeSource GPG key
  ansible.builtin.rpm_key:
    key: https://rpm.nodesource.com/gpgkey/ns-operations@nodesource.com.gpg.key
    state: present

- name: Install Node.js 20 LTS
  ansible.builtin.package:
    name: nodejs
    state: present

- name: Create app user
  ansible.builtin.user:
    name: "{{ app_user }}"
    system: true
    shell: /sbin/nologin
    create_home: false

- name: Create app directory
  ansible.builtin.file:
    path: "{{ app_dir }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    mode: "0755"

- name: Deploy application environment file
  ansible.builtin.template:
    src: app.env.j2
    dest: "{{ app_dir }}/.env"
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    mode: "0600"   # secrets: owner-read only
  notify: Restart app

- name: Deploy systemd unit
  ansible.builtin.template:
    src: app.service.j2
    dest: /etc/systemd/system/{{ app_service_name }}.service
    owner: root
    group: root
    mode: "0644"
  notify:
    - Reload systemd
    - Restart app

- name: Ensure app service is running and enabled
  ansible.builtin.service:
    name: "{{ app_service_name }}"
    state: started
    enabled: true

The .env template draws from Vault-encrypted variables. The file mode 0600 ensures only the app_user process can read it — a discipline that prevents secrets from leaking to world-readable logs or other processes on the same host.

The DB Role: Idempotent PostgreSQL Setup

The database role is the most sensitive. It must handle the primary/replica distinction (captured as a host variable in the inventory), initialize PostgreSQL only once (not on every playbook run), and configure pg_hba.conf to allow app-tier connections while blocking everything else.

# roles/db/tasks/main.yml
---
- name: Install PostgreSQL 16
  ansible.builtin.package:
    name:
      - postgresql16-server
      - postgresql16
    state: present

- name: Check if PostgreSQL is already initialized
  ansible.builtin.stat:
    path: /var/lib/pgsql/16/data/PG_VERSION
  register: pg_initialized

- name: Initialize PostgreSQL (first run only)
  ansible.builtin.command: /usr/pgsql-16/bin/postgresql-16-setup initdb
  when: not pg_initialized.stat.exists
  notify: Start postgresql

- name: Deploy pg_hba.conf
  ansible.builtin.template:
    src: pg_hba.conf.j2
    dest: /var/lib/pgsql/16/data/pg_hba.conf
    owner: postgres
    group: postgres
    mode: "0600"
  notify: Reload postgresql

- name: Ensure postgresql is running and enabled
  ansible.builtin.service:
    name: postgresql-16
    state: started
    enabled: true

- name: Create application database
  community.postgresql.postgresql_db:
    name: "{{ db_name }}"
    state: present
  become_user: postgres
  when: db_role == 'primary'

- name: Create application database user
  community.postgresql.postgresql_user:
    name: "{{ db_app_user }}"
    password: "{{ db_app_password }}"    # from vault.yml
    priv: "{{ db_name }}.*:ALL"
    state: present
  become_user: postgres
  when: db_role == 'primary'

The Master Playbook: Bringing It All Together

The site.yml orchestrates everything. It applies the base role to the entire fleet group first, then applies tier-specific roles in separate plays. The ordering matters: the base hardening must be in place before any service is installed, and the database must be reachable before the app servers start.

# site.yml
---
- name: Apply baseline hardening to all fleet nodes
  hosts: fleet
  become: true
  gather_facts: true
  roles:
    - base

- name: Configure web tier (NGINX reverse proxy)
  hosts: web
  become: true
  gather_facts: true
  roles:
    - web

- name: Configure app tier (Node.js application)
  hosts: app
  become: true
  gather_facts: true
  roles:
    - app

- name: Configure database tier (PostgreSQL)
  hosts: db
  become: true
  gather_facts: true
  roles:
    - db

Fleet architecture: a single site.yml applies the base role to all nodes, then delegates tier-specific roles to web, app, and db groups in sequence.

Running the Fleet: Full Deploy Workflow

With the codebase complete, the deployment workflow follows a strict three-stage pattern: preview, validate, apply. Never run site.yml against production without a dry-run pass.

# 1. Install Galaxy collection dependencies first
ansible-galaxy collection install -r requirements.yml

# 2. Syntax check — catches YAML errors and undefined variables before touching hosts
ansible-playbook -i inventories/prod site.yml --syntax-check

# 3. Dry-run against ALL prod hosts — review every "changed" task
ansible-playbook -i inventories/prod site.yml \
  --ask-vault-pass \
  --check --diff

# 4. Live run — apply to staging first, then prod
ansible-playbook -i inventories/staging site.yml --ask-vault-pass
ansible-playbook -i inventories/prod site.yml --ask-vault-pass

# 5. Target a single tier for incremental changes (e.g., NGINX config update)
ansible-playbook -i inventories/prod site.yml \
  --ask-vault-pass \
  --limit web \
  --tags web

# 6. Emergency limit: target a single host during incident triage
ansible-playbook -i inventories/prod site.yml \
  --ask-vault-pass \
  --limit web-01.prod.example.com \
  --tags web \
  --check --diff

Production Failure Modes and How to Prevent Them

Fleet-scale Ansible runs expose failure modes you will not encounter in single-host tests. Each one below is a real production incident class, with the mitigation built into this project's structure.

Partial fleet apply on network error. Ansible applies tasks host by host in parallel batches. A transient SSH timeout on three nodes marks them UNREACHABLE, but the other nodes get configured. Use serial: "20%" in high-blast-radius plays so failures are bounded, and check the failed_hosts count in AWX before proceeding to the next batch.
Vault password mismatch between environments. Staging and prod vaults use different passwords. If you use --vault-password-file pointing to the wrong file, you decrypt the staging vault with the prod key and get a silent wrong-value error (the variable resolves but contains garbage). Use separate --vault-id labels: --vault-id prod@~/.vault-prod makes the password source explicit and the mismatch loud.
Template rendering silently skipping a block. A Jinja2 {% for host in groups['app'] %} loop that references hostvars[host]['ansible_default_ipv4']['address'] silently generates an empty upstream block if fact gathering was disabled (gather_facts: false) for the app group. Always run gather_facts: true for any play whose templates reference hostvars, and add an assert task that verifies the rendered file is non-empty.
Handler not firing after a failed task. Handlers only fire if the play completes without failure on that host. A task failure mid-play suppresses all pending handlers — meaning a template was deployed but the service was never reloaded. Use force_handlers: true at the play level for service-reload handlers to ensure a best-effort reload even on partial failure, and monitor service health separately.

Tag every task and role for surgical targeting. Tag the base role tasks with tags: [base, hardening], web tasks with tags: [web, nginx], and so on. In production you will almost never re-run the full site.yml — you will run --tags nginx to push an NGINX config change, or --tags hardening to respond to a CVE. Tagging is what separates an Ansible codebase you can use confidently in production from one that terrifies your team because every change requires running everything.

Continuous Enforcement: Scheduling via AWX

In production, manual ansible-playbook runs are for one-off changes and incident response. Continuous baseline enforcement is handled by scheduling the full site.yml run every 30 minutes in AWX. Every run is logged, its output is searchable, and failed hosts trigger PagerDuty alerts via AWX's notification system. This approximates the drift-correction guarantees of a pull-based tool — without the operational complexity of managing a Puppet infrastructure.

The playbook you have built in this project is the complete pattern used by infrastructure teams at scale. From this foundation you can extend it: add a monitoring role that deploys the Prometheus node exporter, a logship role that configures Fluent Bit, a certs role that rotates TLS certificates via Vault PKI — all following the same role structure, the same inventory convention, and the same dry-run-first discipline.