Deployment Strategies & Progressive Delivery

Project: A Progressive Delivery Plan

18 min Lesson 10 of 28

Project: A Progressive Delivery Plan

Every concept in this tutorial — canary releases, feature flags, rollback, automated analysis, expand-contract — exists to solve a single problem: how do you ship a change to a critical service without causing an outage? This capstone lesson walks you through designing a complete, production-grade progressive delivery system for a high-stakes service: a checkout payment processor handling thousands of transactions per minute.

You will design the entire system end-to-end: the canary pipeline, the flag hierarchy, the SLO-gated promotion gates, the database migration strategy, and the rollback playbook. This is exactly how teams at Stripe, Amazon, and Google approach releases for their most critical paths.

Scenario: Your team is replacing the legacy payment routing engine with a new one that supports dynamic currency conversion. The service processes 3,000 TPS at peak. A one-minute outage costs $180,000 in lost revenue. Your job is to design the delivery plan — not just the happy path, but every failure mode and the exact response to each.

Step 1 — Map the Risk Surface

Before writing a single YAML file, enumerate what can go wrong. For each risk, you will assign a mitigation control from your progressive delivery toolkit:

  • New routing logic rejects valid cards — mitigation: canary at 1%, automated error-rate gate, feature flag kill switch.
  • Latency regression in the new engine — mitigation: p99 latency SLO gate; rollback if p99 > 200 ms for 5 min.
  • Database schema incompatible between old and new code — mitigation: expand-contract migration, dual-write period.
  • New dynamic currency conversion causes rounding errors — mitigation: currency-conversion flag starts OFF; only users who opt in see it; A/B experiment with statistical significance gate before full rollout.
  • Third-party FX rate API is down — mitigation: ops toggle kill switch disables dynamic conversion, falls back to static rates instantly.

Step 2 — Design the Flag Hierarchy

You need three distinct flags for this release, each with a different lifecycle and owner:

  1. payments.new-router.enabledRelease toggle. Controls whether traffic goes through the new routing engine at all. This is the top-level kill switch. Default: false. Removal date: 30 days after 100% rollout.
  2. payments.dynamic-currency.enabledExperiment toggle. Controls whether dynamic FX rates are offered to the user. Starts at 0%. Gated on statistical significance (p-value < 0.05, minimum 10,000 conversions per variant). Removal date: 14 days after winner declared.
  3. payments.new-router.circuit-openOps toggle. When true, immediately routes 100% of traffic back to the legacy engine regardless of any other flag state. This is the emergency circuit breaker, not the same as a rollback — it acts in seconds. Long-lived; owned by the payments on-call rotation permanently.
Flag ordering matters. Evaluate the circuit breaker flag first in every request, before any other flag. If it is true, short-circuit to the legacy path immediately — do not evaluate downstream flags. This ensures the kill switch has sub-10ms effect even under extreme load.

Step 3 — Design the Canary Pipeline

The canary pipeline is the scaffolding that moves payments.new-router.enabled from 0% to 100% safely. Each stage has explicit entry criteria (SLOs that must be passing) and exit criteria (SLOs that, if violated, trigger automatic rollback).

Canary pipeline with SLO gates and rollback path Deploy 0% traffic Gate 1 Canary 1% 30 min bake Gate 2 Canary 10% 60 min bake Gate 3 Full 100% Stable smoke ok err < 0.1% p99 < 200ms err < 0.05% p99 < 150ms Auto-rollback: circuit-open flag = true SLO Gate Promote Auto-Rollback
Canary pipeline with SLO promotion gates and automatic rollback to stable via circuit-breaker flag.

Encode this pipeline as an Argo Rollouts manifest so the orchestration is version-controlled and reproducible:

# payment-router-rollout.yaml apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: payment-router namespace: payments spec: replicas: 50 selector: matchLabels: app: payment-router template: metadata: labels: app: payment-router spec: containers: - name: payment-router image: registry.internal/payment-router:2.4.0 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1" memory: "1Gi" strategy: canary: analysis: templates: - templateName: payment-router-slo startingStep: 1 # begin analysis from step 1 (1% canary) args: - name: service-name value: payment-router canaryService: payment-router-canary stableService: payment-router-stable trafficRouting: istio: virtualService: name: payment-router-vsvc routes: - primary steps: - setWeight: 1 # 1% canary - pause: {duration: 30m} - setWeight: 10 # 10% canary - pause: {duration: 1h} - setWeight: 100 # full promotion

Step 4 — Wire the Automated Analysis

The SLO gates are enforced by an AnalysisTemplate that queries your observability stack every 5 minutes. If any metric breaches its threshold, Argo Rollouts pauses and pages the on-call engineer:

# payment-router-analysis.yaml apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: payment-router-slo namespace: payments spec: args: - name: service-name metrics: - name: error-rate interval: 5m failureLimit: 1 # one failed measurement triggers pause provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate(http_requests_total{ service="{{args.service-name}}", status=~"5..", version="canary" }[5m])) / sum(rate(http_requests_total{ service="{{args.service-name}}", version="canary" }[5m])) successCondition: result[0] < 0.001 # < 0.1% error rate - name: p99-latency-ms interval: 5m failureLimit: 2 provider: prometheus: address: http://prometheus.monitoring:9090 query: | histogram_quantile(0.99, sum(rate( http_request_duration_seconds_bucket{ service="{{args.service-name}}", version="canary" }[5m] )) by (le)) * 1000 successCondition: result[0] < 200 # p99 < 200 ms

Step 5 — Design the Database Migration

The new router requires a new column fx_rate_snapshot on the payment_transactions table. This is a critical-path table with 50 million rows. You cannot do a blocking ALTER TABLE in production. Use the expand-contract pattern across three deployments:

  • Expand (Deploy N): Add the column as NULLABLE with no default — zero downtime on MySQL/Postgres with ALGORITHM=INSTANT. Old code ignores the column. New code writes to it. Both codes run simultaneously during the canary window.
  • Migrate (Deploy N): Run a background job via a feature-flagged worker that backfills the column for existing rows in batches of 1,000 rows with a 10ms sleep between batches to avoid I/O pressure on the primary.
  • Contract (Deploy N+1, post-100% rollout): After the old code is fully retired, add NOT NULL constraint and remove the dual-write logic. This is the cleanup deploy — ship it no sooner than 72 hours after full rollout to allow for emergency rollback without breaking the column contract.
Never rename a column in a single deploy. Rename = add new column + dual-write + migrate data + drop old column, each in a separate deploy with a bake period between steps. A rename in one migration locks the table and breaks old code reading the old column name simultaneously — which is exactly what you have during a canary window.

Step 6 — The Rollback Playbook

A rollback plan written during an incident is a bad plan. Write it before you deploy and link it from the PR description. For this service, there are three rollback triggers and three response tiers:

  • Tier 1 — Automatic (0 to 60 seconds): Argo Rollouts detects SLO breach via AnalysisTemplate. It sets the rollout to Paused and pages on-call. The on-call engineer runs kubectl argo rollouts abort payment-router to instantly shift 100% traffic back to stable pods. No code change required. The new-router flag remains at the current percentage in case investigation reveals a false alarm.
  • Tier 2 — Circuit breaker (0 to 10 seconds): If automated rollback is too slow or unavailable, the on-call engineer flips payments.new-router.circuit-open to true in the flag console. Traffic returns to the legacy path within one polling cycle (default: 30 seconds on SDK; use streaming SSE for near-instant propagation). This is faster than a Kubernetes rollout because it requires no pod restart.
  • Tier 3 — Full revert (30 to 120 minutes): For catastrophic failures where the new binary itself is corrupted or the container cannot start, revert the Git SHA and trigger a fresh CI build. This path is rare and slow — the first two tiers should handle 99.9% of incidents.
# Rollback runbook commands (paste into incident doc as-is) # --- Tier 1: Abort the Argo rollout (shift traffic to stable pods) --- kubectl argo rollouts abort payment-router -n payments kubectl argo rollouts status payment-router -n payments # verify stable # --- Tier 2: Flip the circuit breaker via CLI (LaunchDarkly example) --- # Requires launchDarkly CLI: brew install launchdarkly/tap/ld ld flag update \ --project payments-prod \ --flag payments.new-router.circuit-open \ --value true \ --environment production # --- Verify: watch error rate drop in real time --- watch -n 5 'kubectl exec -n monitoring deploy/prometheus -- \ promtool query instant http://localhost:9090 \ "sum(rate(http_requests_total{service=\"payment-router\",status=~\"5..\"}[1m]))"' # --- Tier 3: Full git revert + redeploy --- git revert HEAD --no-edit git push origin main # CI triggers and deploys the reverted image automatically

Step 7 — Linking It All Together

A progressive delivery plan is not just technical configuration — it is a communication protocol for your entire organization. Before you merge the first PR for this release, the following artefacts must exist and be linked from the PR description:

  • Risk surface document — the table from Step 1, reviewed by the team lead and the on-call rotation.
  • Flag inventory entry — all three flags registered in the flag management console with owner, removal date, and description.
  • Rollout YAML in Git — the Argo Rollouts manifest is version-controlled next to the service code, not managed ad-hoc.
  • AnalysisTemplate in Git — SLO thresholds are code-reviewed, not set by one engineer in a UI at midnight.
  • Rollback runbook linked from PagerDuty — when the on-call engineer is woken at 2 AM, the runbook is the first link in the PagerDuty alert.
  • Scheduled cleanup tickets — three Jira tickets created on day one: remove the release flag, apply the NOT NULL constraint, delete the legacy routing code. Each has a due date and an owner.
The real measure of a progressive delivery plan is how boring the deploy is. If your post-deploy runbook says "watch the dashboard for 30 minutes," you have not automated enough. The goal is that the engineer who ships the change can walk away after hitting merge — the pipeline handles promotion, gates, and rollback automatically. Humans get involved only when something genuinely unexpected happens that requires judgment.

This is the standard that top-tier engineering organizations hold themselves to. It takes investment to build, but the payoff is measured in incidents that never happen — the invisible successes that define world-class reliability engineering.