Project: A Progressive Delivery Plan
Project: A Progressive Delivery Plan
Every concept in this tutorial — canary releases, feature flags, rollback, automated analysis, expand-contract — exists to solve a single problem: how do you ship a change to a critical service without causing an outage? This capstone lesson walks you through designing a complete, production-grade progressive delivery system for a high-stakes service: a checkout payment processor handling thousands of transactions per minute.
You will design the entire system end-to-end: the canary pipeline, the flag hierarchy, the SLO-gated promotion gates, the database migration strategy, and the rollback playbook. This is exactly how teams at Stripe, Amazon, and Google approach releases for their most critical paths.
Step 1 — Map the Risk Surface
Before writing a single YAML file, enumerate what can go wrong. For each risk, you will assign a mitigation control from your progressive delivery toolkit:
- New routing logic rejects valid cards — mitigation: canary at 1%, automated error-rate gate, feature flag kill switch.
- Latency regression in the new engine — mitigation: p99 latency SLO gate; rollback if p99 > 200 ms for 5 min.
- Database schema incompatible between old and new code — mitigation: expand-contract migration, dual-write period.
- New dynamic currency conversion causes rounding errors — mitigation: currency-conversion flag starts OFF; only users who opt in see it; A/B experiment with statistical significance gate before full rollout.
- Third-party FX rate API is down — mitigation: ops toggle kill switch disables dynamic conversion, falls back to static rates instantly.
Step 2 — Design the Flag Hierarchy
You need three distinct flags for this release, each with a different lifecycle and owner:
payments.new-router.enabled— Release toggle. Controls whether traffic goes through the new routing engine at all. This is the top-level kill switch. Default:false. Removal date: 30 days after 100% rollout.payments.dynamic-currency.enabled— Experiment toggle. Controls whether dynamic FX rates are offered to the user. Starts at 0%. Gated on statistical significance (p-value < 0.05, minimum 10,000 conversions per variant). Removal date: 14 days after winner declared.payments.new-router.circuit-open— Ops toggle. Whentrue, immediately routes 100% of traffic back to the legacy engine regardless of any other flag state. This is the emergency circuit breaker, not the same as a rollback — it acts in seconds. Long-lived; owned by the payments on-call rotation permanently.
true, short-circuit to the legacy path immediately — do not evaluate downstream flags. This ensures the kill switch has sub-10ms effect even under extreme load.
Step 3 — Design the Canary Pipeline
The canary pipeline is the scaffolding that moves payments.new-router.enabled from 0% to 100% safely. Each stage has explicit entry criteria (SLOs that must be passing) and exit criteria (SLOs that, if violated, trigger automatic rollback).
Encode this pipeline as an Argo Rollouts manifest so the orchestration is version-controlled and reproducible:
Step 4 — Wire the Automated Analysis
The SLO gates are enforced by an AnalysisTemplate that queries your observability stack every 5 minutes. If any metric breaches its threshold, Argo Rollouts pauses and pages the on-call engineer:
Step 5 — Design the Database Migration
The new router requires a new column fx_rate_snapshot on the payment_transactions table. This is a critical-path table with 50 million rows. You cannot do a blocking ALTER TABLE in production. Use the expand-contract pattern across three deployments:
- Expand (Deploy N): Add the column as
NULLABLEwith no default — zero downtime on MySQL/Postgres withALGORITHM=INSTANT. Old code ignores the column. New code writes to it. Both codes run simultaneously during the canary window. - Migrate (Deploy N): Run a background job via a feature-flagged worker that backfills the column for existing rows in batches of 1,000 rows with a 10ms sleep between batches to avoid I/O pressure on the primary.
- Contract (Deploy N+1, post-100% rollout): After the old code is fully retired, add
NOT NULLconstraint and remove the dual-write logic. This is the cleanup deploy — ship it no sooner than 72 hours after full rollout to allow for emergency rollback without breaking the column contract.
Step 6 — The Rollback Playbook
A rollback plan written during an incident is a bad plan. Write it before you deploy and link it from the PR description. For this service, there are three rollback triggers and three response tiers:
- Tier 1 — Automatic (0 to 60 seconds): Argo Rollouts detects SLO breach via AnalysisTemplate. It sets the rollout to
Pausedand pages on-call. The on-call engineer runskubectl argo rollouts abort payment-routerto instantly shift 100% traffic back to stable pods. No code change required. The new-router flag remains at the current percentage in case investigation reveals a false alarm. - Tier 2 — Circuit breaker (0 to 10 seconds): If automated rollback is too slow or unavailable, the on-call engineer flips
payments.new-router.circuit-opentotruein the flag console. Traffic returns to the legacy path within one polling cycle (default: 30 seconds on SDK; use streaming SSE for near-instant propagation). This is faster than a Kubernetes rollout because it requires no pod restart. - Tier 3 — Full revert (30 to 120 minutes): For catastrophic failures where the new binary itself is corrupted or the container cannot start, revert the Git SHA and trigger a fresh CI build. This path is rare and slow — the first two tiers should handle 99.9% of incidents.
Step 7 — Linking It All Together
A progressive delivery plan is not just technical configuration — it is a communication protocol for your entire organization. Before you merge the first PR for this release, the following artefacts must exist and be linked from the PR description:
- Risk surface document — the table from Step 1, reviewed by the team lead and the on-call rotation.
- Flag inventory entry — all three flags registered in the flag management console with owner, removal date, and description.
- Rollout YAML in Git — the Argo Rollouts manifest is version-controlled next to the service code, not managed ad-hoc.
- AnalysisTemplate in Git — SLO thresholds are code-reviewed, not set by one engineer in a UI at midnight.
- Rollback runbook linked from PagerDuty — when the on-call engineer is woken at 2 AM, the runbook is the first link in the PagerDuty alert.
- Scheduled cleanup tickets — three Jira tickets created on day one: remove the release flag, apply the NOT NULL constraint, delete the legacy routing code. Each has a due date and an owner.
This is the standard that top-tier engineering organizations hold themselves to. It takes investment to build, but the payoff is measured in incidents that never happen — the invisible successes that define world-class reliability engineering.