Everything you have built in this tutorial — inventory design, ad-hoc commands, playbooks, variables and Jinja2 templates, conditionals and loops, roles, Ansible Vault, and scaling strategies — converges in this capstone project. You will design and deploy a complete, production-grade Ansible codebase that configures three distinct server groups: web (NGINX reverse proxy), app (Node.js application), and database (PostgreSQL). This is the structure you will encounter at real companies running dozens to hundreds of nodes.
The goal is not just working configuration — it is maintainable, auditable, reusable code that a new team member can understand in thirty minutes and that survives the chaos of real production: rotating secrets, node additions, OS upgrades, and security remediation under pressure.
Project Layout: Roles-First Structure
Real fleets are organized around roles, not flat playbooks. The canonical structure below separates concerns cleanly: the site-level site.yml is the entry point; per-environment inventories isolate prod from staging; reusable roles live under roles/; secrets are encrypted in group_vars.
Why inventories/prod/group_vars/all/vault.yml instead of a root-level vault? Keeping the vault inside the inventory directory means running ansible-playbook -i inventories/prod site.yml automatically loads the correct secrets for that environment. A single root-level vault would be shared across all environments, making accidental cross-environment secret exposure a real risk. Per-inventory vault files eliminate the mistake at the structural level.
Inventory: Grouping the Fleet
The hosts.ini file defines three server groups plus a compound group fleet that spans all of them. The compound group is used by the base role — applied to every node regardless of function.
The base role enforces the OS-level baseline that every node in the fleet must share — regardless of whether it runs NGINX, Node.js, or PostgreSQL. This is where you enforce SSH hardening, auditd, sysctl tuning, NTP, and the shared monitoring agent. Applying a universal baseline via a compound group prevents the class of incident where a production database has a weaker SSH configuration than the web tier because someone forgot to apply the hardening playbook to it.
The validate parameter on the sshd_config template is non-negotiable in production. Without it, a Jinja2 rendering error or a typo in the template deploys a syntactically invalid sshd_config to disk, the handler restarts sshd, sshd refuses to start, and you are locked out of the node. The validate: /usr/sbin/sshd -t -f %s line runs sshd -t (config test) against the rendered file before moving it to /etc/ssh/sshd_config. If the test fails, the task fails and the old config stays in place. This single line has saved countless production SSH lockouts.
The Web Role: NGINX Reverse Proxy
The web role installs NGINX and deploys a Jinja2-rendered virtual host configuration. The template uses group variables to populate upstream app server addresses dynamically — no hardcoded IPs in any template.
The corresponding Jinja2 template dynamically builds the upstream block from the live inventory — this is the most powerful pattern in fleet configuration management. NGINX is reconfigured automatically whenever you add or remove an app server; you never touch the template:
The app role installs Node.js, deploys the application from a versioned artifact, and manages the systemd unit that keeps it running. The application's secrets (database password, JWT secret) are injected at deploy time from Ansible Vault into an .env file that is never committed to source control.
The .env template draws from Vault-encrypted variables. The file mode 0600 ensures only the app_user process can read it — a discipline that prevents secrets from leaking to world-readable logs or other processes on the same host.
The DB Role: Idempotent PostgreSQL Setup
The database role is the most sensitive. It must handle the primary/replica distinction (captured as a host variable in the inventory), initialize PostgreSQL only once (not on every playbook run), and configure pg_hba.conf to allow app-tier connections while blocking everything else.
The site.yml orchestrates everything. It applies the base role to the entire fleet group first, then applies tier-specific roles in separate plays. The ordering matters: the base hardening must be in place before any service is installed, and the database must be reachable before the app servers start.
# site.yml
---
- name: Apply baseline hardening to all fleet nodes
hosts: fleet
become: true
gather_facts: true
roles:
- base
- name: Configure web tier (NGINX reverse proxy)
hosts: web
become: true
gather_facts: true
roles:
- web
- name: Configure app tier (Node.js application)
hosts: app
become: true
gather_facts: true
roles:
- app
- name: Configure database tier (PostgreSQL)
hosts: db
become: true
gather_facts: true
roles:
- db
Fleet architecture: a single site.yml applies the base role to all nodes, then delegates tier-specific roles to web, app, and db groups in sequence.
Running the Fleet: Full Deploy Workflow
With the codebase complete, the deployment workflow follows a strict three-stage pattern: preview, validate, apply. Never run site.yml against production without a dry-run pass.
# 1. Install Galaxy collection dependencies first
ansible-galaxy collection install -r requirements.yml
# 2. Syntax check — catches YAML errors and undefined variables before touching hosts
ansible-playbook -i inventories/prod site.yml --syntax-check
# 3. Dry-run against ALL prod hosts — review every "changed" task
ansible-playbook -i inventories/prod site.yml \
--ask-vault-pass \
--check --diff
# 4. Live run — apply to staging first, then prod
ansible-playbook -i inventories/staging site.yml --ask-vault-pass
ansible-playbook -i inventories/prod site.yml --ask-vault-pass
# 5. Target a single tier for incremental changes (e.g., NGINX config update)
ansible-playbook -i inventories/prod site.yml \
--ask-vault-pass \
--limit web \
--tags web
# 6. Emergency limit: target a single host during incident triage
ansible-playbook -i inventories/prod site.yml \
--ask-vault-pass \
--limit web-01.prod.example.com \
--tags web \
--check --diff
Production Failure Modes and How to Prevent Them
Fleet-scale Ansible runs expose failure modes you will not encounter in single-host tests. Each one below is a real production incident class, with the mitigation built into this project's structure.
Partial fleet apply on network error. Ansible applies tasks host by host in parallel batches. A transient SSH timeout on three nodes marks them UNREACHABLE, but the other nodes get configured. Use serial: "20%" in high-blast-radius plays so failures are bounded, and check the failed_hosts count in AWX before proceeding to the next batch.
Vault password mismatch between environments. Staging and prod vaults use different passwords. If you use --vault-password-file pointing to the wrong file, you decrypt the staging vault with the prod key and get a silent wrong-value error (the variable resolves but contains garbage). Use separate --vault-id labels: --vault-id prod@~/.vault-prod makes the password source explicit and the mismatch loud.
Template rendering silently skipping a block. A Jinja2 {% for host in groups['app'] %} loop that references hostvars[host]['ansible_default_ipv4']['address'] silently generates an empty upstream block if fact gathering was disabled (gather_facts: false) for the app group. Always run gather_facts: true for any play whose templates reference hostvars, and add an assert task that verifies the rendered file is non-empty.
Handler not firing after a failed task. Handlers only fire if the play completes without failure on that host. A task failure mid-play suppresses all pending handlers — meaning a template was deployed but the service was never reloaded. Use force_handlers: true at the play level for service-reload handlers to ensure a best-effort reload even on partial failure, and monitor service health separately.
Tag every task and role for surgical targeting. Tag the base role tasks with tags: [base, hardening], web tasks with tags: [web, nginx], and so on. In production you will almost never re-run the full site.yml — you will run --tags nginx to push an NGINX config change, or --tags hardening to respond to a CVE. Tagging is what separates an Ansible codebase you can use confidently in production from one that terrifies your team because every change requires running everything.
Continuous Enforcement: Scheduling via AWX
In production, manual ansible-playbook runs are for one-off changes and incident response. Continuous baseline enforcement is handled by scheduling the full site.yml run every 30 minutes in AWX. Every run is logged, its output is searchable, and failed hosts trigger PagerDuty alerts via AWX's notification system. This approximates the drift-correction guarantees of a pull-based tool — without the operational complexity of managing a Puppet infrastructure.
The playbook you have built in this project is the complete pattern used by infrastructure teams at scale. From this foundation you can extend it: add a monitoring role that deploys the Prometheus node exporter, a logship role that configures Fluent Bit, a certs role that rotates TLS certificates via Vault PKI — all following the same role structure, the same inventory convention, and the same dry-run-first discipline.