Deployment Strategies & Progressive Delivery

Database Changes & Expand-Contract

18 min Lesson 8 of 28

Database Changes & Expand-Contract

Every progressive delivery technique — canary releases, blue-green deployments, feature flags — assumes that your application code can be swapped independently of your database schema. In reality, the database is a shared, stateful resource that both the old and new versions of your service must read and write simultaneously during any rolling or canary deploy. A schema change that is incompatible with the running version will corrupt data or crash pods the moment it runs. This is the root cause of more production incidents than most engineers realise.

The Expand-Contract pattern (also called parallel-change or multi-phase migration) is the industry-standard solution. It breaks every schema change into a sequence of backward-compatible steps so that any number of application versions can safely coexist at any point in the migration.

Why a Naive Migration Breaks Rolling Deploys

Imagine renaming a column user_name to full_name with a single ALTER TABLE RENAME COLUMN. The moment that migration runs on the production database, every pod still running the old application version — which reads user_name — begins throwing errors. With 100 pods and a 5-minute rolling update, you have 5 minutes of partial outage guaranteed. The same risk applies to DROP COLUMN, NOT NULL additions, and foreign-key constraint changes.

Core principle: A database migration that runs before a code deploy must be backward-compatible with the old code. A migration that runs after a code deploy must be backward-compatible with the new code. The Expand-Contract pattern enforces this by design.

The Three Phases of Expand-Contract

The three phases of Expand-Contract: each phase ships as its own deploy, keeping the schema compatible with all live application versions.

Phase 1 — Expand: Add Without Breaking

The first migration adds the new structure while keeping the old structure intact. For a column rename, this means adding the new column as nullable. The new application version writes to both columns and reads from the old one. The old application version ignores the new column entirely. No code is broken.

-- Phase 1 migration (runs BEFORE new app code deploys)
-- Rename user_name -> full_name safely

ALTER TABLE users ADD COLUMN full_name VARCHAR(255) NULL;

-- Optionally copy existing data immediately for small tables (<1M rows)
UPDATE users SET full_name = user_name WHERE full_name IS NULL;

-- Application code at this point writes both columns:
-- INSERT INTO users (user_name, full_name, ...) VALUES (?, ?, ...)
-- UPDATE users SET user_name = ?, full_name = ? WHERE id = ?
-- SELECT user_name FROM users WHERE id = ?   -- still reads OLD column

Phase 2 — Migrate: Backfill in Batches

For large tables, a single UPDATE users SET full_name = user_name will lock the table for minutes and cause a production incident. The safe pattern is a batched backfill — update rows in chunks of 1,000–10,000 with a short sleep between batches to avoid replication lag and I/O saturation. This runs as a background job or a one-off script, not as part of the migration file itself.

#!/bin/bash
# batch-backfill.sh — safe batched column backfill (runs on live prod DB)
# Tune BATCH_SIZE and SLEEP_MS for your table size and replication lag tolerance

BATCH_SIZE=5000
SLEEP_MS=200   # 200ms between batches
LAST_ID=0

while true; do
  ROWS=$(mysql -u root -p"${DB_PASSWORD}" esb1995 -sN -e "
    UPDATE users
    SET full_name = user_name
    WHERE id > ${LAST_ID}
      AND full_name IS NULL
    ORDER BY id
    LIMIT ${BATCH_SIZE};
    SELECT ROW_COUNT();
  ")

  echo "Backfilled ${ROWS} rows from id > ${LAST_ID}"

  if [ "${ROWS}" -eq 0 ]; then
    echo "Backfill complete."
    break
  fi

  LAST_ID=$(mysql -u root -p"${DB_PASSWORD}" esb1995 -sN -e "
    SELECT MAX(id) FROM users WHERE full_name IS NOT NULL;
  ")

  sleep "0.${SLEEP_MS}"
done

-- After backfill completes, add NOT NULL constraint + index in a separate migration:
ALTER TABLE users
  MODIFY COLUMN full_name VARCHAR(255) NOT NULL,
  ADD INDEX idx_users_full_name (full_name);

Ghost / pt-online-schema-change for large tables. For tables over 10 million rows, use gh-ost (GitHub) or pt-online-schema-change (Percona Toolkit). These tools apply schema changes online by creating a shadow table, replaying binlog events, and doing an atomic table swap — zero downtime even on billion-row tables. Used by GitHub, Shopify, and Airbnb in production.

Phase 3 — Contract: Remove the Old Structure

Only after every application pod is on the new version — confirmed by checking that zero pods still query user_name — do you run the contract migration. This removes the old column. From this point forward, the schema is lean and the migration is complete.

-- Phase 3 migration (runs AFTER 100% of pods are on new code)
-- Verify first: grep your codebase and application logs for any reference to user_name

-- Check slow-query log for the old column name before dropping:
SELECT * FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST_TEXT LIKE '%user_name%'
  AND LAST_SEEN > NOW() - INTERVAL 7 DAY;

-- If zero results, safe to drop:
ALTER TABLE users DROP COLUMN user_name;

-- Always test the DROP on a staging replica first:
-- mysqldump prod_users | mysql staging_users
-- ALTER TABLE staging_users.users DROP COLUMN user_name;

Applying Expand-Contract to Other Schema Change Types

The same three-phase logic applies to every dangerous DDL operation:

Adding a NOT NULL column — Expand: add as nullable with a default. Migrate: backfill all nulls. Contract: add NOT NULL constraint.
Splitting a column (e.g. address → city + country) — Expand: add both new columns. Migrate: parse and copy data. Contract: drop old column.
Adding a foreign key constraint — Expand: add column without constraint. Migrate: fix orphaned rows. Contract: add constraint with VALIDATE CONSTRAINT.
Changing a column type (e.g. INT → BIGINT) — Add a shadow column of the new type; dual-write; swap reads; drop old column.

Never add a NOT NULL column without a default in a single migration on a live table. PostgreSQL will rewrite the entire table row-by-row and hold an exclusive lock for the duration — minutes on a large table. MySQL/MariaDB with the InnoDB engine does an online DDL for some changes, but not all. Always check ALGORITHM=INPLACE, LOCK=NONE compatibility in the MySQL docs before assuming a change is lock-free.

Managing Migrations in a CI/CD Pipeline

Migrations must be version-controlled, reviewed, and automated. The standard pattern is to store migrations in the repo, run them in CI against a copy of the schema, and gate the deploy on migration success. In Kubernetes environments, migrations are typically run as a Job or an initContainer that completes before the new Deployment pods start.

# Kubernetes Job pattern — run migration before rolling out new app pods
# migration-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: migrate-v2-3-0
  namespace: production
  labels:
    app: myapp
    version: "2.3.0"
spec:
  backoffLimit: 0          # fail fast; do not retry a broken migration
  ttlSecondsAfterFinished: 3600
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: myapp:2.3.0
          command: ["php", "artisan", "migrate", "--force"]
          env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"

# In your GitOps pipeline (ArgoCD / Flux):
# 1. Apply the Job and wait for completion before syncing the Deployment.
# 2. If the Job fails, the sync is aborted — pods on old code keep serving traffic.
# 3. On success, ArgoCD syncs the Deployment for the rolling update.

Use a migration lock. If multiple CI jobs or pods can start simultaneously, two migration processes may attempt to run the same migration concurrently. Tools like Flyway and Liquibase use an advisory lock on the database to serialise migration runs. Laravel uses the migrations lock column for the same purpose. Never roll your own migration runner without a distributed lock.

Observability During a Live Migration

Always monitor the following during and after a migration deploy: query latency on the affected table (watch for lock waits), replication lag on read replicas (bulk writes can cause lag spikes), and error rates from the application. Set up a pre-built Grafana dashboard for each major migration. If replication lag exceeds your SLA threshold, pause the backfill job immediately.