Databases in Production

Zero-Downtime Schema Migrations

22 min Lesson 5 of 30

Zero-Downtime Schema Migrations

Every production database schema eventually needs to change. Adding a column, renaming a table, dropping an index, or splitting a field — these are routine operations during product development. The naive approach is to stop the application, apply the migration, then restart. That is acceptable for a personal project with zero users. At big-tech scale, where SLOs demand 99.9%+ availability and traffic peaks at millions of requests per minute, any schema change that locks a table for seconds triggers a customer-facing incident.

Zero-downtime schema migration is not a single tool — it is a discipline built on three interlocking ideas: the expand-contract pattern, Online DDL (database engine support for non-blocking structural changes), and migration tooling integrated into CI/CD pipelines. Master all three and you can evolve your database schema continuously, safely, without coordinating maintenance windows.

Why Naive Migrations Kill Availability

MySQL and PostgreSQL both take table-level metadata locks during DDL operations. An ALTER TABLE orders ADD COLUMN loyalty_points INT on a 500-million-row table can hold an exclusive lock for minutes. Every query hitting that table — reads and writes — queues behind the lock. Your connection pool fills. The application returns 503s. The on-call engineer gets paged at 2 AM.

Even "fast" DDL has hidden traps. Adding a column with a DEFAULT in older MySQL (pre-8.0) rewrites the entire table. Renaming a column breaks any application code that still uses the old name — and in a rolling deployment, old pods and new pods run simultaneously. This is the core problem expand-contract solves.

The rolling-deployment trap: During a Kubernetes rolling update, old and new application pods run concurrently for 30–120 seconds. If your migration renames a column from user_id to account_id, old pods write to user_id (now missing), new pods write to account_id. Both sets of writes fail. Data loss and errors occur even though each individual piece looks correct in isolation.

The Expand-Contract Pattern

Expand-contract (also called parallel-change or blue-green schema migration) decomposes every breaking schema change into at least three deployment phases, each of which is backward-compatible with the live application code:

Expand: Add the new structure alongside the old. Add account_id as a nullable column. Deploy code that writes to both columns and reads from the old one. The old column is still authoritative; no data is lost.
Migrate data: Backfill account_id from user_id in batches (never a single UPDATE on the whole table). After backfill, deploy code that reads from account_id. Both columns are now in sync.
Contract: Once 100% of pods are on the new code and metrics confirm no reads from the old column, drop user_id. This is now a safe, fast DDL because no live code references it.

Expand-Contract: each phase is independently deployable and backward-compatible with the previous application version.

The key insight is that the database schema and the application code version are decoupled. You never need to deploy both simultaneously. Each phase is independently rollable, and because old and new app versions are always compatible with the current schema, a rollback at any phase is safe.

Always start expand-contract with a nullable column, never NOT NULL with a default. A NOT NULL DEFAULT 0 column in MySQL < 8.0 performs a full table rewrite. In PostgreSQL 11+, a NOT NULL DEFAULT with a constant is instant (no rewrite). Know your engine version before choosing the DDL form.

Online DDL: Engine-Level Non-Blocking Changes

Modern database engines have built-in mechanisms to perform structural changes without locking out concurrent reads and writes. Understanding these is essential for predicting whether a migration will be safe or will cause an outage.

MySQL / InnoDB Online DDL (MySQL 5.6+, 8.0): Most ALTER TABLE operations support ALGORITHM=INPLACE, LOCK=NONE, which applies changes in place while continuing to serve traffic. Not all operations support this — dropping a primary key or changing the column charset still requires a full copy. Always test with EXPLAIN FORMAT=TREE ALTER TABLE ... or check the MySQL documentation matrix.

pt-online-schema-change (pt-osc) and gh-ost are tools that implement online DDL at the application layer for cases where the engine cannot do it natively, or where you need more control (e.g., throttling, pausable migrations). gh-ost (GitHub's tool) is the industry standard for large MySQL tables — it uses MySQL's binary log stream to replay writes to a shadow table with zero triggers, then atomically renames tables at cutover.

PostgreSQL has the CONCURRENTLY modifier for index operations: CREATE INDEX CONCURRENTLY builds the index without an exclusive lock, at the cost of taking longer. PostgreSQL 12+ added CREATE MATERIALIZED VIEW CONCURRENTLY. However, plain ALTER TABLE ADD COLUMN NOT NULL without a server-side default still requires a full table scan in older versions.

-- MySQL: check whether an ALTER will use INPLACE or COPY algorithm
-- Run this before deploying the migration in production
ALTER TABLE orders
  ADD COLUMN loyalty_points INT NULL DEFAULT 0,
  ALGORITHM=INPLACE, LOCK=NONE;

-- If the above fails with "ALGORITHM=INPLACE not supported", fall back to gh-ost:
-- gh-ost will replicate via binlog and do a low-impact cutover
gh-ost \
  --user="admin" \
  --password="$DB_PASS" \
  --host="primary.prod.internal" \
  --database="orders" \
  --table="orders" \
  --alter="ADD COLUMN loyalty_points INT NULL DEFAULT 0" \
  --allow-on-master \
  --chunk-size=1000 \
  --max-lag-millis=1500 \
  --throttle-control-replicas="replica1.prod.internal,replica2.prod.internal" \
  --execute

-- PostgreSQL: add index without locking
CREATE INDEX CONCURRENTLY idx_orders_user_id
  ON orders (user_id)
  WHERE status = 'active';

-- PostgreSQL: safe nullable column add (instant in PG 11+)
ALTER TABLE orders ADD COLUMN loyalty_points INT;
UPDATE orders SET loyalty_points = 0 WHERE loyalty_points IS NULL; -- batched externally
ALTER TABLE orders ALTER COLUMN loyalty_points SET NOT NULL;

Batched Backfill: The Safest Way to Move Data

During the Expand phase you may need to populate a new column from existing data. Never run UPDATE orders SET account_id = user_id — on a large table this takes an exclusive row lock on every row simultaneously, saturates replication lag, and runs for hours. Use a batched backfill loop instead:

#!/usr/bin/env bash
# backfill_account_id.sh — run as a one-off Kubernetes Job or cron
# Processes 1000 rows per iteration with a short sleep to throttle I/O

BATCH=1000
SLEEP=0.1   # seconds between batches
MAX_ID=$(mysql -u admin -p"$DB_PASS" -sNe "SELECT MAX(id) FROM orders" orders_db)
CURSOR=0

while [ "$CURSOR" -lt "$MAX_ID" ]; do
  NEXT=$((CURSOR + BATCH))
  mysql -u admin -p"$DB_PASS" orders_db <<SQL
    UPDATE orders
    SET account_id = user_id
    WHERE id > $CURSOR AND id <= $NEXT
      AND account_id IS NULL;
SQL
  echo "Backfilled rows $CURSOR to $NEXT"
  CURSOR=$NEXT
  sleep "$SLEEP"
done
echo "Backfill complete."

Monitor replication lag during backfill. Even batched UPDATEs replicate to read replicas and can cause lag to spike. Set a lag threshold (e.g., pause if replica lag > 2 s) using SHOW SLAVE STATUS or pt-heartbeat. gh-ost has this built in via --max-lag-millis.

Migration Tooling in CI/CD Pipelines

Ad-hoc, manually-run migrations are a source of drift, errors, and undocumented state. The industry standard is to version-control every migration and run them automatically through the deployment pipeline, with guardrails that enforce safety at each stage.

Common migration tools by ecosystem:

Flyway (Java, SQL) — file-based versioned migrations (V1__add_column.sql). Integrates with Maven, Gradle, GitHub Actions. Has a dryRun mode for pre-flight review.
Liquibase (Java, XML/YAML/SQL) — richer than Flyway; supports rollback scripts, preconditions, and change-set labels for environment targeting.
Atlas (Go, HCL) — schema-as-code; generates a migration plan from schema diff, integrates with Terraform, and has a cloud-based linting API.
golang-migrate — CLI + Go library; idiomatic for Go microservices with up/down migration files.
Alembic (Python) — the SQLAlchemy migration tool; auto-generates migrations by diffing ORM models against the live schema.

A production-grade CI pipeline for database migrations follows this sequence:

# .github/workflows/deploy.yml (excerpt)
# This pipeline runs migrations BEFORE new application pods are deployed
# (expand phase). The app's rolling update comes after.

jobs:
  migrate:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      # 1. Lint: reject dangerous operations before they run
      - name: Atlas migration lint
        run: |
          atlas migrate lint \
            --dev-url "mysql://root:password@localhost:3306/dev" \
            --dir "file://migrations" \
            --format "{{ range .Files }}{{ .Name }}: {{ range .Reports }}{{ .Text }}{{ end }}{{ end }}"

      # 2. Dry-run on a production clone (RDS snapshot restore)
      - name: Flyway dry-run on prod clone
        run: |
          flyway \
            -url="jdbc:mysql://$PROD_CLONE_HOST:3306/orders" \
            -user="$DB_USER" \
            -password="$DB_PASS" \
            -dryRunOutput="/tmp/migration-preview.sql" \
            migrate
          cat /tmp/migration-preview.sql  # reviewed in PR

      # 3. Apply to production (app not yet deployed)
      - name: Apply migration
        run: |
          flyway \
            -url="jdbc:mysql://$PROD_DB_HOST:3306/orders" \
            -user="$DB_USER" \
            -password="$DB_PASS" \
            migrate

  deploy-app:
    needs: migrate   # app deploy only starts after migration succeeds
    runs-on: ubuntu-latest
    steps:
      - name: Rolling deploy
        run: kubectl rollout restart deployment/orders-service -n production

Never auto-apply destructive migrations. DROP TABLE, DROP COLUMN, and TRUNCATE in automated pipelines are high-risk. Gate them with a manual approval step in GitHub Actions (environment: production with required reviewers), or deploy them as a separate PR that triggers a separate, manually-approved pipeline run.

Schema Linting and the Safety Checklist

Before any migration reaches production, a linter should automatically flag patterns that are known to cause incidents. Atlas's migrate lint command, Squawk (PostgreSQL-specific), and GitHub's skeema diff can all run in CI to block dangerous SQL. Patterns to reject automatically:

Adding a NOT NULL column without a server-side default (full table rewrite or requires existing data to be backfilled first).
Creating an index without CONCURRENTLY (PostgreSQL) or verifying ALGORITHM=INPLACE (MySQL).
ALTER TABLE RENAME COLUMN without a prior expand phase (will break old application pods during rolling deploy).
Any migration that modifies a table larger than N GB without an explicit approval label.
Missing down-migration (V2__undo.sql) — every migration must be reversible, or the rollback plan requires code freeze.

Zero-downtime schema migrations are one of the most underestimated skills in production engineering. The patterns here — expand-contract, online DDL, batched backfill, and pipeline-integrated linting — are how teams at Google, Stripe, and GitHub ship database changes dozens of times per day without a maintenance window. Internalise the expand-contract lifecycle first; every other technique follows naturally from it.