Project: A Cost Optimization Program
Project: A Cost Optimization Program
This lesson is the capstone of the FinOps tutorial. You will work through a realistic scenario: a $480,000/month AWS bill for a mid-size SaaS platform, walk every line of that bill with a structured audit methodology, and produce a concrete savings roadmap with effort tiers, dollar estimates, and a 12-month delivery calendar. This is the exact exercise a FinOps practitioner runs when joining a new organisation or when cloud spend starts outgrowing revenue growth.
The Sample Bill — Anatomy of $480k/Month
The platform is a B2B SaaS product serving 8,000 tenants in us-east-1 and eu-west-1. The engineering org is 120 engineers across 14 product squads. The bill has never been systematically reviewed. Current state:
- EC2 & Auto Scaling: $198,000 (41%) — 420 production instances, all on-demand. Sizes range from
t3.largetoc6i.8xlarge. No Savings Plans, no RIs. Average utilisation reported by CloudWatch is 22% CPU across the fleet. - RDS & Aurora: $87,000 (18%) — 38 Aurora clusters (MySQL-compatible). 11 clusters are
db.r6g.8xlargerunning multi-AZ. No reserved instances. 6 clusters have not received a single write query in the past 14 days (dev/test environments not shut down on nights/weekends). - Data Transfer: $62,000 (13%) — the single largest line item nobody examined. $41,000 is cross-AZ transfer. $14,000 is inter-region replication to
eu-west-1for tenants that actually sit entirely inus-east-1. - S3: $34,000 (7%) — 1.2 PB stored. No Intelligent-Tiering, no lifecycle rules. CloudWatch shows 85% of objects have not been accessed in over 90 days.
- CloudWatch Logs: $29,000 (6%) — default infinite retention on 340 log groups. 60% of ingestion is debug-level logs from a Java service that should have been switched to INFO in production 18 months ago.
- NAT Gateways: $24,000 (5%) — 12 NAT gateways across 4 VPCs. $19,000 of that is data-processing charges from S3 and DynamoDB traffic routing through NAT instead of VPC endpoints.
- Other (EBS snapshots, ELBs, ECR, Lambda, SQS, SNS): $46,000 (10%)
Phase 1 — Audit: Ask the Right Questions Before Touching Anything
The worst mistake in a bill audit is immediately clicking "purchase Reserved Instances" or deleting resources without understanding causality. A structured audit follows this sequence:
- Verify tag coverage. Run
aws resourcegroupstaggingapi get-resources --tag-filters Key=teamand measure what fraction of resources carry the mandatoryteam,env, andservicetags. In the sample bill, tag coverage is 38% — meaning 62% of spend is unallocated. Fix tagging before analysing anything else, or your findings will be meaningless. - Export 90 days of Cost Explorer data. Export at DAILY granularity grouped by SERVICE, USAGE_TYPE, and the
teamtag. Load into a spreadsheet or a BigQuery/Athena table. Look for monotonic growth lines (cost that grows every day without a corresponding feature launch), step-function spikes (cost that jumped suddenly — usually a new workload or a misconfiguration), and flat lines on large amounts (committed resources sitting idle). - Cross-reference with CloudWatch metrics. For EC2, pull average CPUUtilization, NetworkIn, and NetworkOut over 90 days. Anything averaging below 10% CPU is a right-sizing or termination candidate. For RDS, pull DatabaseConnections — an Aurora cluster with zero connections over 14 days is a dev environment that never got a shutdown schedule.
- Map data flows. Use VPC Flow Logs aggregated in Athena to identify the top-10 source/destination pairs by byte count. This is the only way to understand your $62,000 data transfer bill without guessing. The cross-AZ traffic almost always comes from a small number of chattty services that were deployed without AZ affinity.
The Savings Roadmap — Tiered by Effort and Time to Value
A savings roadmap is not a wish list. Every initiative needs a dollar estimate, an effort estimate, an owner, and a completion date. The tiers below reflect real-world implementation difficulty and the typical organisational friction involved.
Tier 0 — No-Risk Deletions (Week 1–2, zero engineering effort):
- Terminate the 6 idle Aurora dev clusters: saves $14,400/month. These have zero connections. Snapshot them first (
aws rds create-db-cluster-snapshot), then delete. Add a Lambda + EventBridge scheduler to auto-stop dev clusters at 18:00 weekdays and restart at 08:00 — this pattern saves $8–12k/month on its own by eliminating 65% of dev environment runtime. - Set CloudWatch Logs retention to 30 days on all log groups (90 days for audit-sensitive groups). Switch the Java service log level to INFO: saves $17,400/month combined. This is a single
aws logs put-retention-policycall per log group, scriptable in 20 minutes. - Delete unattached EBS volumes, unused Elastic IPs, and idle load balancers found in the "Other" category: estimated $8,000–12,000/month.
Tier 1 — Quick Architectural Fixes (Weeks 2–6, 1–2 engineers each):
- VPC Endpoints for S3 and DynamoDB: $19,000/month of NAT Gateway data-processing charges is the single highest-ROI fix in the bill. Gateway-type VPC Endpoints are free; the data no longer routes through NAT. Implementation: one Terraform module, one PR, one apply. Saves ~$19,000/month.
- Enable S3 Intelligent-Tiering: 1.2 PB with 85% cold objects. IT-Flexible tier reduces storage cost from ~$0.023/GB to ~$0.004/GB for infrequent-access objects. Net saving after monitoring fee: ~$18,000/month. Single S3 Batch Operations job to tag objects.
- Fix cross-region replication scope: $14,000/month of inter-region transfer for US-only tenants. Scope replication to EU-domiciled tenants only. Saves ~$11,000/month (some legitimate EU tenants remain).
Tier 2 — Right-Sizing (Weeks 4–10, 0.5 FTE for 6 weeks):
- 22% average CPU across 420 instances means significant over-provisioning. AWS Compute Optimizer generates right-sizing recommendations with ML-derived confidence scores. Conservative approach: only action HIGH-confidence recommendations, moving instances down one size class. Expected reduction: 20–30% of instance costs. On $198,000, a 25% reduction is $49,500/month. Use instance scheduler for non-production to add another $15–20k.
- Right-sizing must happen before you buy Savings Plans — committing to over-provisioned instance types locks in the waste at a discount.
Tier 3 — Commitment Discounts (Month 2–3, FinOps lead + finance sign-off):
- After right-sizing, the EC2 fleet will cost approximately $148,500/month on-demand. Pull the trailing 30-day minimum hourly spend (post-right-sizing): this becomes the safe 3-year Compute Savings Plan commitment. The 11 Aurora
r6g.8xlargemulti-AZ clusters are stable — purchase 3-year Partial Upfront RDS Reserved Instances. Combined estimated saving at 55–65% off on-demand: $60,000–75,000/month.
The 12-Month Savings Calendar
Sequencing matters. Doing commitment discounts before right-sizing wastes money. Fixing data transfer before understanding traffic patterns can break replication. The calendar below is the recommended execution order:
Unit Economics: Closing the Loop
A savings roadmap that stops at "we reduced the bill" misses half the value. Mature FinOps connects cloud cost to business metrics. For a B2B SaaS, the key ratio is cost per tenant per month. At $480k/month serving 8,000 tenants, cost-per-tenant is $60. After the 12-month programme, the same 8,000 tenants cost $34/month — a 43% improvement that, if revenue is growing, means significantly improved gross margin.
Instrument this in your observability stack. Emit a daily metric cloud.cost_per_tenant to your Grafana/Datadog dashboard, plotted alongside revenue_per_tenant and gross_margin_pct. When cost-per-tenant starts rising without a corresponding feature investment, something went wrong — new workload without right-sizing, a data pipeline whose volume grew unexpectedly, a service that lost its Spot coverage after a spot-interruption failure. Catching these signals at the unit-economics level is faster than waiting for the monthly bill review.
Governance: Preventing Regression
The most common failure mode of a FinOps programme is a 6-month sprint that achieves great results, followed by a 12-month slow drift back to the original spend as the organisation grows and nobody enforces the new patterns. Prevention requires three structural controls:
- Infracost in every Terraform PR. Cost diff is a required CI check, not optional. A PR that adds $5,000/month of new spend without a JIRA ticket linking to a business justification is blocked until an engineer explicitly overrides it. This is exactly the same pattern as a security scanner blocking a PR with a critical CVE.
- Monthly FinOps reviews with team-level showback. Each squad sees their cost trend on the same slide deck as their SLO performance. Cost spikes get the same attention as error rate spikes.
- Tagging enforcement via AWS Config Rules / SCPs. Any resource created without the mandatory tags is automatically sent a remediation event that tags it with
team=untaggedand triggers an alert to the FinOps lead. Resources taggeduntaggedafter 7 days are eligible for auto-deletion in non-production accounts.
Your Deliverables
As the engineer who owns the cost optimization programme, your output by end of month one should be: a one-page executive summary with four numbers (current spend, annualised saving opportunity, 12-month plan, and cost-per-tenant before/after); a Terraform module implementing VPC endpoints and S3 Intelligent-Tiering; a Jira epic with one ticket per Tier 0 and Tier 1 initiative, each with dollar estimates in the acceptance criteria; and a Grafana dashboard with cloud.cost_per_tenant, Savings Plan utilisation percentage, and top-5 services by spend. That artefact set is how you demonstrate FinOps maturity at the senior engineer and staff engineer level.