Deployment Strategies & Progressive Delivery

Project: A Progressive Delivery Plan

18 min Lesson 10 of 28

Project: A Progressive Delivery Plan

Every concept in this tutorial — canary releases, feature flags, rollback, automated analysis, expand-contract — exists to solve a single problem: how do you ship a change to a critical service without causing an outage? This capstone lesson walks you through designing a complete, production-grade progressive delivery system for a high-stakes service: a checkout payment processor handling thousands of transactions per minute.

You will design the entire system end-to-end: the canary pipeline, the flag hierarchy, the SLO-gated promotion gates, the database migration strategy, and the rollback playbook. This is exactly how teams at Stripe, Amazon, and Google approach releases for their most critical paths.

Scenario: Your team is replacing the legacy payment routing engine with a new one that supports dynamic currency conversion. The service processes 3,000 TPS at peak. A one-minute outage costs $180,000 in lost revenue. Your job is to design the delivery plan — not just the happy path, but every failure mode and the exact response to each.

Step 1 — Map the Risk Surface

Before writing a single YAML file, enumerate what can go wrong. For each risk, you will assign a mitigation control from your progressive delivery toolkit:

New routing logic rejects valid cards — mitigation: canary at 1%, automated error-rate gate, feature flag kill switch.
Latency regression in the new engine — mitigation: p99 latency SLO gate; rollback if p99 > 200 ms for 5 min.
Database schema incompatible between old and new code — mitigation: expand-contract migration, dual-write period.
New dynamic currency conversion causes rounding errors — mitigation: currency-conversion flag starts OFF; only users who opt in see it; A/B experiment with statistical significance gate before full rollout.
Third-party FX rate API is down — mitigation: ops toggle kill switch disables dynamic conversion, falls back to static rates instantly.

Step 2 — Design the Flag Hierarchy

You need three distinct flags for this release, each with a different lifecycle and owner:

payments.new-router.enabled — Release toggle. Controls whether traffic goes through the new routing engine at all. This is the top-level kill switch. Default: false. Removal date: 30 days after 100% rollout.
payments.dynamic-currency.enabled — Experiment toggle. Controls whether dynamic FX rates are offered to the user. Starts at 0%. Gated on statistical significance (p-value < 0.05, minimum 10,000 conversions per variant). Removal date: 14 days after winner declared.
payments.new-router.circuit-open — Ops toggle. When true, immediately routes 100% of traffic back to the legacy engine regardless of any other flag state. This is the emergency circuit breaker, not the same as a rollback — it acts in seconds. Long-lived; owned by the payments on-call rotation permanently.

Flag ordering matters. Evaluate the circuit breaker flag first in every request, before any other flag. If it is true, short-circuit to the legacy path immediately — do not evaluate downstream flags. This ensures the kill switch has sub-10ms effect even under extreme load.

Step 3 — Design the Canary Pipeline

The canary pipeline is the scaffolding that moves payments.new-router.enabled from 0% to 100% safely. Each stage has explicit entry criteria (SLOs that must be passing) and exit criteria (SLOs that, if violated, trigger automatic rollback).

Canary pipeline with SLO promotion gates and automatic rollback to stable via circuit-breaker flag.

Encode this pipeline as an Argo Rollouts manifest so the orchestration is version-controlled and reproducible:

# payment-router-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-router
  namespace: payments
spec:
  replicas: 50
  selector:
    matchLabels:
      app: payment-router
  template:
    metadata:
      labels:
        app: payment-router
    spec:
      containers:
        - name: payment-router
          image: registry.internal/payment-router:2.4.0
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"
  strategy:
    canary:
      analysis:
        templates:
          - templateName: payment-router-slo
        startingStep: 1       # begin analysis from step 1 (1% canary)
        args:
          - name: service-name
            value: payment-router
      canaryService: payment-router-canary
      stableService: payment-router-stable
      trafficRouting:
        istio:
          virtualService:
            name: payment-router-vsvc
            routes:
              - primary
      steps:
        - setWeight: 1         # 1% canary
        - pause: {duration: 30m}
        - setWeight: 10        # 10% canary
        - pause: {duration: 1h}
        - setWeight: 100       # full promotion

Step 4 — Wire the Automated Analysis

The SLO gates are enforced by an AnalysisTemplate that queries your observability stack every 5 minutes. If any metric breaches its threshold, Argo Rollouts pauses and pages the on-call engineer:

# payment-router-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: payment-router-slo
  namespace: payments
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 5m
      failureLimit: 1          # one failed measurement triggers pause
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"5..",
              version="canary"
            }[5m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              version="canary"
            }[5m]))
      successCondition: result[0] < 0.001   # < 0.1% error rate

    - name: p99-latency-ms
      interval: 5m
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99, sum(rate(
              http_request_duration_seconds_bucket{
                service="{{args.service-name}}",
                version="canary"
              }[5m]
            )) by (le)) * 1000
      successCondition: result[0] < 200     # p99 < 200 ms

Step 5 — Design the Database Migration

The new router requires a new column fx_rate_snapshot on the payment_transactions table. This is a critical-path table with 50 million rows. You cannot do a blocking ALTER TABLE in production. Use the expand-contract pattern across three deployments:

Expand (Deploy N): Add the column as NULLABLE with no default — zero downtime on MySQL/Postgres with ALGORITHM=INSTANT. Old code ignores the column. New code writes to it. Both codes run simultaneously during the canary window.
Migrate (Deploy N): Run a background job via a feature-flagged worker that backfills the column for existing rows in batches of 1,000 rows with a 10ms sleep between batches to avoid I/O pressure on the primary.
Contract (Deploy N+1, post-100% rollout): After the old code is fully retired, add NOT NULL constraint and remove the dual-write logic. This is the cleanup deploy — ship it no sooner than 72 hours after full rollout to allow for emergency rollback without breaking the column contract.

Never rename a column in a single deploy. Rename = add new column + dual-write + migrate data + drop old column, each in a separate deploy with a bake period between steps. A rename in one migration locks the table and breaks old code reading the old column name simultaneously — which is exactly what you have during a canary window.

Step 6 — The Rollback Playbook

A rollback plan written during an incident is a bad plan. Write it before you deploy and link it from the PR description. For this service, there are three rollback triggers and three response tiers:

Tier 1 — Automatic (0 to 60 seconds): Argo Rollouts detects SLO breach via AnalysisTemplate. It sets the rollout to Paused and pages on-call. The on-call engineer runs kubectl argo rollouts abort payment-router to instantly shift 100% traffic back to stable pods. No code change required. The new-router flag remains at the current percentage in case investigation reveals a false alarm.
Tier 2 — Circuit breaker (0 to 10 seconds): If automated rollback is too slow or unavailable, the on-call engineer flips payments.new-router.circuit-open to true in the flag console. Traffic returns to the legacy path within one polling cycle (default: 30 seconds on SDK; use streaming SSE for near-instant propagation). This is faster than a Kubernetes rollout because it requires no pod restart.
Tier 3 — Full revert (30 to 120 minutes): For catastrophic failures where the new binary itself is corrupted or the container cannot start, revert the Git SHA and trigger a fresh CI build. This path is rare and slow — the first two tiers should handle 99.9% of incidents.

# Rollback runbook commands (paste into incident doc as-is)

# --- Tier 1: Abort the Argo rollout (shift traffic to stable pods) ---
kubectl argo rollouts abort payment-router -n payments
kubectl argo rollouts status payment-router -n payments   # verify stable

# --- Tier 2: Flip the circuit breaker via CLI (LaunchDarkly example) ---
# Requires launchDarkly CLI: brew install launchdarkly/tap/ld
ld flag update \
  --project payments-prod \
  --flag payments.new-router.circuit-open \
  --value true \
  --environment production

# --- Verify: watch error rate drop in real time ---
watch -n 5 'kubectl exec -n monitoring deploy/prometheus -- \
  promtool query instant http://localhost:9090 \
  "sum(rate(http_requests_total{service=\"payment-router\",status=~\"5..\"}[1m]))"'

# --- Tier 3: Full git revert + redeploy ---
git revert HEAD --no-edit
git push origin main
# CI triggers and deploys the reverted image automatically

Step 7 — Linking It All Together

A progressive delivery plan is not just technical configuration — it is a communication protocol for your entire organization. Before you merge the first PR for this release, the following artefacts must exist and be linked from the PR description:

Risk surface document — the table from Step 1, reviewed by the team lead and the on-call rotation.
Flag inventory entry — all three flags registered in the flag management console with owner, removal date, and description.
Rollout YAML in Git — the Argo Rollouts manifest is version-controlled next to the service code, not managed ad-hoc.
AnalysisTemplate in Git — SLO thresholds are code-reviewed, not set by one engineer in a UI at midnight.
Rollback runbook linked from PagerDuty — when the on-call engineer is woken at 2 AM, the runbook is the first link in the PagerDuty alert.
Scheduled cleanup tickets — three Jira tickets created on day one: remove the release flag, apply the NOT NULL constraint, delete the legacy routing code. Each has a due date and an owner.

The real measure of a progressive delivery plan is how boring the deploy is. If your post-deploy runbook says "watch the dashboard for 30 minutes," you have not automated enough. The goal is that the engineer who ships the change can walk away after hitting merge — the pipeline handles promotion, gates, and rollback automatically. Humans get involved only when something genuinely unexpected happens that requires judgment.

This is the standard that top-tier engineering organizations hold themselves to. It takes investment to build, but the payoff is measured in incidents that never happen — the invisible successes that define world-class reliability engineering.