We are still cooking the magic in the way!
Database Changes & Expand-Contract
Database Changes & Expand-Contract
Every progressive delivery technique — canary releases, blue-green deployments, feature flags — assumes that your application code can be swapped independently of your database schema. In reality, the database is a shared, stateful resource that both the old and new versions of your service must read and write simultaneously during any rolling or canary deploy. A schema change that is incompatible with the running version will corrupt data or crash pods the moment it runs. This is the root cause of more production incidents than most engineers realise.
The Expand-Contract pattern (also called parallel-change or multi-phase migration) is the industry-standard solution. It breaks every schema change into a sequence of backward-compatible steps so that any number of application versions can safely coexist at any point in the migration.
Why a Naive Migration Breaks Rolling Deploys
Imagine renaming a column user_name to full_name with a single ALTER TABLE RENAME COLUMN. The moment that migration runs on the production database, every pod still running the old application version — which reads user_name — begins throwing errors. With 100 pods and a 5-minute rolling update, you have 5 minutes of partial outage guaranteed. The same risk applies to DROP COLUMN, NOT NULL additions, and foreign-key constraint changes.
The Three Phases of Expand-Contract
Phase 1 — Expand: Add Without Breaking
The first migration adds the new structure while keeping the old structure intact. For a column rename, this means adding the new column as nullable. The new application version writes to both columns and reads from the old one. The old application version ignores the new column entirely. No code is broken.
Phase 2 — Migrate: Backfill in Batches
For large tables, a single UPDATE users SET full_name = user_name will lock the table for minutes and cause a production incident. The safe pattern is a batched backfill — update rows in chunks of 1,000–10,000 with a short sleep between batches to avoid replication lag and I/O saturation. This runs as a background job or a one-off script, not as part of the migration file itself.
gh-ost (GitHub) or pt-online-schema-change (Percona Toolkit). These tools apply schema changes online by creating a shadow table, replaying binlog events, and doing an atomic table swap — zero downtime even on billion-row tables. Used by GitHub, Shopify, and Airbnb in production.
Phase 3 — Contract: Remove the Old Structure
Only after every application pod is on the new version — confirmed by checking that zero pods still query user_name — do you run the contract migration. This removes the old column. From this point forward, the schema is lean and the migration is complete.
Applying Expand-Contract to Other Schema Change Types
The same three-phase logic applies to every dangerous DDL operation:
- Adding a NOT NULL column — Expand: add as nullable with a default. Migrate: backfill all nulls. Contract: add
NOT NULLconstraint. - Splitting a column (e.g.
address→city+country) — Expand: add both new columns. Migrate: parse and copy data. Contract: drop old column. - Adding a foreign key constraint — Expand: add column without constraint. Migrate: fix orphaned rows. Contract: add constraint with
VALIDATE CONSTRAINT. - Changing a column type (e.g.
INT→BIGINT) — Add a shadow column of the new type; dual-write; swap reads; drop old column.
ALGORITHM=INPLACE, LOCK=NONE compatibility in the MySQL docs before assuming a change is lock-free.
Managing Migrations in a CI/CD Pipeline
Migrations must be version-controlled, reviewed, and automated. The standard pattern is to store migrations in the repo, run them in CI against a copy of the schema, and gate the deploy on migration success. In Kubernetes environments, migrations are typically run as a Job or an initContainer that completes before the new Deployment pods start.
migrations lock column for the same purpose. Never roll your own migration runner without a distributed lock.
Observability During a Live Migration
Always monitor the following during and after a migration deploy: query latency on the affected table (watch for lock waits), replication lag on read replicas (bulk writes can cause lag spikes), and error rates from the application. Set up a pre-built Grafana dashboard for each major migration. If replication lag exceeds your SLA threshold, pause the backfill job immediately.