Deployment Strategies & Progressive Delivery

Rollbacks & Roll-Forward

18 min Lesson 9 of 28

Rollbacks & Roll-Forward

No deployment strategy eliminates failures — they all reduce the blast radius. When a bad deploy slips through your canary analysis, your feature flag gates, or your smoke tests, two recovery paths exist: roll back (restore the previous known-good state) or roll forward (ship a targeted fix as the next release). Choosing the wrong path under pressure is one of the most costly operational mistakes you can make. This lesson codifies the mechanics, the decision framework, and the safety nets that let you recover in minutes rather than hours.

Why Rollback Is Not Free

Many teams treat rollback as a guaranteed escape hatch. It is not. A rollback is another deployment — it carries its own risk, its own migration window, and its own potential for failure. The assumptions that make rollback fast and safe are:

  • The previous artifact still exists and is immutable. Container registries, S3 deployment buckets, and Helm chart repositories must retain old versions. If you allow tag mutation (e.g., re-pushing to :latest), you have no artifact to roll back to.
  • The database schema is backward-compatible with the previous code. If the current release ran an additive migration (new column, new table), rolling back the code is safe — the old code simply ignores the new column. If the release ran a destructive migration (drop column, rename column, change type), rollback is catastrophically unsafe unless you have already applied the Expand-Contract pattern.
  • External state has not advanced beyond the point of no return. If the new code sent emails, billed customers, or published events to Kafka, rolling back the service does not undo those side-effects. Plan compensating transactions or accept the drift.
The most dangerous moment in an incident is when an engineer runs a destructive ALTER TABLE migration and then tries to roll back. If the column has been dropped in production, no amount of Kubernetes rollout undo commands will help. Database changes must be the last line of a rollback plan, not the first instinct. Always apply Expand-Contract migrations before deploying new code.

Fast Rollback Mechanics by Platform

Each deployment target has its own rollback primitive. Knowing all of them — and their latencies — is essential.

Kubernetes: kubectl rollout undo

Kubernetes stores the last ten ReplicaSet revisions by default (controlled by revisionHistoryLimit). A rollback reactivates a prior ReplicaSet rather than re-pulling an image — making it fast, typically completing in under 60 seconds for a small Deployment:

# Inspect available revisions kubectl rollout history deployment/payments-api -n prod # Roll back to the immediately previous revision kubectl rollout undo deployment/payments-api -n prod # Roll back to a specific revision (e.g., revision 3) kubectl rollout undo deployment/payments-api -n prod --to-revision=3 # Watch the rollback progress kubectl rollout status deployment/payments-api -n prod --timeout=120s # Verify the active image after rollback kubectl get deployment payments-api -n prod \ -o jsonpath='{.spec.template.spec.containers[0].image}'
Set revisionHistoryLimit: 5 in your Deployment spec explicitly — the default of 10 wastes etcd storage at scale. Five revisions gives you enough rollback headroom while staying lean. In a GitOps world (ArgoCD, Flux), you roll back by reverting the Git commit; kubectl rollout undo should be reserved for break-glass emergencies only, because it creates drift between Git and the cluster.

Helm: Rolling Back a Chart Release

Helm maintains a release history in Kubernetes Secrets. Rolling back a Helm release restores the previous values.yaml snapshot and re-applies the old templates:

# List Helm release history helm history payments-api -n prod # Roll back to the previous release revision helm rollback payments-api -n prod # Roll back to a specific release revision with a 2-minute timeout helm rollback payments-api 3 -n prod --timeout 2m0s --wait # Verify the rolled-back release helm status payments-api -n prod

AWS ECS / App Runner: Task Definition Rollback

# List recent task definition revisions aws ecs list-task-definitions \ --family-prefix payments-api \ --sort DESC \ --query 'taskDefinitionArns[0:5]' \ --output table # Update the ECS service to run the previous task definition (e.g., revision 47) aws ecs update-service \ --cluster prod \ --service payments-api \ --task-definition payments-api:47 \ --force-new-deployment \ --region us-east-1 # Wait for the service to stabilize aws ecs wait services-stable \ --cluster prod \ --services payments-api \ --region us-east-1

The Decision Framework: Roll Back or Roll Forward?

Rollback vs Roll-Forward Decision Tree Production Incident triggered by new deploy Destructive DB migration ran in this deploy? YES Roll Forward (DB cannot go back) NO Root cause identified and fix is < 30 min away? YES Roll Forward (ship the fix fast) NO Previous artifact is immutable and DB-safe? YES Roll Back (fastest recovery) NO Escalate (incident bridge) Engage incident bridge — manual triage
Decision tree: roll back vs. roll forward based on DB state, fix readiness, and artifact integrity.

The framework above distills to three questions asked in sequence:

  1. Has the database moved forward destructively? If yes, rollback is off the table — the old code expects a schema that no longer exists. Roll forward with a targeted fix.
  2. Do you know what is broken, and can you fix it fast? If yes (a missing nil-check, a wrong environment variable, a bad config value), it is often faster and safer to ship the fix than to coordinate a rollback — especially when traffic is already flowing through the new code path and your canary infrastructure is warmed up.
  3. Is the previous artifact intact and DB-safe? If both answers are yes, roll back immediately. Every minute of degraded service costs real money and real user trust.
At Google, the default posture for most services is roll forward. Services are designed so that any version N+1 can be fixed by a version N+2 hotfix release rather than returning to N-1. This philosophy is enforced structurally: all schema changes are backward-compatible by policy, all releases are one atomic diff from HEAD, and the CI pipeline for a hotfix branch runs in under 10 minutes. Design your system so roll-forward is always a viable option.

GitOps Rollback — The Production-Safe Pattern

In a GitOps environment (ArgoCD, Flux), the Git repository is the source of truth. The correct rollback is a Git revert, not a kubectl command, because imperative kubectl changes create drift between the cluster state and the declared state in Git:

# GitOps rollback — revert the offending commit git log --oneline -5 # a3f9c12 deploy: bump payments-api to v2.4.1 <-- bad deploy # 8e1b047 deploy: bump payments-api to v2.4.0 <-- last good state # ... # Revert the bad commit git revert a3f9c12 --no-edit # Push — ArgoCD / Flux will auto-sync within the poll interval (<=3 min) git push origin main # For immediate recovery, trigger a manual sync in ArgoCD argocd app sync payments --prune --force # Or via the ArgoCD API: # argocd app rollback payments --revision <previous-revision-id>

The ArgoCD app rollback command redeploys from a prior cached Git state — it is an escape hatch for when the reverted commit has not yet propagated. In steady state, a git revert + push is the canonical path because it produces an auditable change in Git history and does not leave the cluster in a state that ArgoCD considers "OutOfSync."

Deployment Safety Nets — Preventing the Need for Rollback

The best rollback is the one you never need. Three safety nets drastically reduce rollback frequency:

1. Automated Smoke Tests in the Deploy Pipeline

# GitHub Actions — smoke test gate after a Helm upgrade - name: Deploy to staging run: | helm upgrade --install payments-api ./charts/payments \ --namespace staging \ --set image.tag=${{ github.sha }} \ --wait --timeout 3m - name: Run smoke tests run: | # Wait for at least one healthy pod kubectl rollout status deployment/payments-api -n staging --timeout=90s # Hit the health endpoint — fail fast if not 200 HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ https://payments-staging.internal/health) [ "$HTTP_STATUS" = "200" ] || { echo "Smoke test FAILED: HTTP $HTTP_STATUS"; exit 1; } # Hit a critical business endpoint HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ -H "Authorization: Bearer $SMOKE_TEST_TOKEN" \ https://payments-staging.internal/api/v1/ping) [ "$HTTP_STATUS" = "200" ] || { echo "API smoke test FAILED"; exit 1; } - name: Promote to production (only if smoke tests pass) if: success() run: | helm upgrade --install payments-api ./charts/payments \ --namespace prod \ --set image.tag=${{ github.sha }} \ --wait --timeout 5m

2. Progressive Traffic Shifting with Automatic Abort

Argo Rollouts (or Flagger) can be configured to abort and roll back automatically when error rate or latency thresholds are breached during a canary promotion. The rollout spec encodes the safety net directly:

# argo-rollout.yaml — automatic rollback on SLO breach apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: payments-api namespace: prod spec: replicas: 20 revisionHistoryLimit: 3 strategy: canary: steps: - setWeight: 5 - pause: { duration: 5m } - setWeight: 25 - pause: { duration: 10m } - setWeight: 100 analysis: templates: - templateName: payments-success-rate startingStep: 1 args: - name: service-name value: payments-api-canary --- apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: payments-success-rate namespace: prod spec: args: - name: service-name metrics: - name: success-rate interval: 1m successCondition: result[0] >= 0.995 failureLimit: 2 # abort after 2 consecutive failures provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[2m])) / sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))

When the AnalysisTemplate fires two consecutive failures (failureLimit: 2), Argo Rollouts automatically sets weight: 0 for the canary, scales the stable ReplicaSet back to 100%, and marks the rollout as Degraded. No human action required — the system rolls back itself.

3. Immutable Artifacts and Tag Discipline

# Bad practice — mutable tags allow silent rollback failures docker push myregistry.io/payments-api:latest # BAD — latest is mutable # Correct practice — every build tagged with the git SHA IMAGE_TAG=$(git rev-parse --short HEAD) docker build -t myregistry.io/payments-api:${IMAGE_TAG} . docker push myregistry.io/payments-api:${IMAGE_TAG} # Also push a semantic version tag (immutable once the release is cut) docker tag myregistry.io/payments-api:${IMAGE_TAG} myregistry.io/payments-api:v2.4.1 docker push myregistry.io/payments-api:v2.4.1 # In AWS ECR — enable image tag immutability via CLI aws ecr put-image-tag-mutability \ --repository-name payments-api \ --image-tag-mutability IMMUTABLE \ --region us-east-1
Keep your last five production-deployed images pinned in the registry with a prod-pinned lifecycle policy that prevents deletion. This guarantees that even if your CI system purges old images during cleanup, the five most recent production versions are always available for an emergency rollback without a rebuild. ECR lifecycle policies and Harbor retention rules both support this pattern.

Roll-Forward in Practice — The Hotfix Pipeline

Rolling forward requires a hotfix pipeline that is faster than your standard release. At big-tech companies, hotfix pipelines are a first-class concern with dedicated tooling:

  • Branch from the deployed SHA, not from HEAD. HEAD may already include unreleased changes. Checkout the exact commit that is running in production, apply the single targeted fix, and promote that.
  • Run a minimal test suite. Smoke tests + the failing test case for the bug. Not the full 45-minute test suite — that is for standard releases. The hotfix CI job should complete in under 10 minutes.
  • Bypass staging for true P0 incidents — with explicit approval. If the production error rate is 40% and the fix is a one-line nil-check, waiting for a full staging promotion cycle costs real users. Structure your pipeline to allow a gated bypass with a required approval from the on-call engineering manager.
  • Deploy via the standard progressive strategy. Even a hotfix should use canary or blue-green — just with compressed step durations (1 minute at 5%, 2 minutes at 25%, then 100%).
Document your rollback and roll-forward runbooks in your incident management tool (PagerDuty Runbooks, Confluence, Notion) and link them from your alert playbooks. During an incident, engineers operate under stress with reduced cognitive capacity. A runbook reduces mean time to recovery (MTTR) by eliminating the need to recall the exact commands from memory — they should be executable from the runbook directly, copy-paste ready.