Zero-Downtime Schema Migrations
Zero-Downtime Schema Migrations
Every production database schema eventually needs to change. Adding a column, renaming a table, dropping an index, or splitting a field — these are routine operations during product development. The naive approach is to stop the application, apply the migration, then restart. That is acceptable for a personal project with zero users. At big-tech scale, where SLOs demand 99.9%+ availability and traffic peaks at millions of requests per minute, any schema change that locks a table for seconds triggers a customer-facing incident.
Zero-downtime schema migration is not a single tool — it is a discipline built on three interlocking ideas: the expand-contract pattern, Online DDL (database engine support for non-blocking structural changes), and migration tooling integrated into CI/CD pipelines. Master all three and you can evolve your database schema continuously, safely, without coordinating maintenance windows.
Why Naive Migrations Kill Availability
MySQL and PostgreSQL both take table-level metadata locks during DDL operations. An ALTER TABLE orders ADD COLUMN loyalty_points INT on a 500-million-row table can hold an exclusive lock for minutes. Every query hitting that table — reads and writes — queues behind the lock. Your connection pool fills. The application returns 503s. The on-call engineer gets paged at 2 AM.
Even "fast" DDL has hidden traps. Adding a column with a DEFAULT in older MySQL (pre-8.0) rewrites the entire table. Renaming a column breaks any application code that still uses the old name — and in a rolling deployment, old pods and new pods run simultaneously. This is the core problem expand-contract solves.
user_id to account_id, old pods write to user_id (now missing), new pods write to account_id. Both sets of writes fail. Data loss and errors occur even though each individual piece looks correct in isolation.
The Expand-Contract Pattern
Expand-contract (also called parallel-change or blue-green schema migration) decomposes every breaking schema change into at least three deployment phases, each of which is backward-compatible with the live application code:
- Expand: Add the new structure alongside the old. Add
account_idas a nullable column. Deploy code that writes to both columns and reads from the old one. The old column is still authoritative; no data is lost. - Migrate data: Backfill
account_idfromuser_idin batches (never a singleUPDATEon the whole table). After backfill, deploy code that reads fromaccount_id. Both columns are now in sync. - Contract: Once 100% of pods are on the new code and metrics confirm no reads from the old column, drop
user_id. This is now a safe, fast DDL because no live code references it.
The key insight is that the database schema and the application code version are decoupled. You never need to deploy both simultaneously. Each phase is independently rollable, and because old and new app versions are always compatible with the current schema, a rollback at any phase is safe.
NOT NULL DEFAULT 0 column in MySQL < 8.0 performs a full table rewrite. In PostgreSQL 11+, a NOT NULL DEFAULT with a constant is instant (no rewrite). Know your engine version before choosing the DDL form.
Online DDL: Engine-Level Non-Blocking Changes
Modern database engines have built-in mechanisms to perform structural changes without locking out concurrent reads and writes. Understanding these is essential for predicting whether a migration will be safe or will cause an outage.
MySQL / InnoDB Online DDL (MySQL 5.6+, 8.0): Most ALTER TABLE operations support ALGORITHM=INPLACE, LOCK=NONE, which applies changes in place while continuing to serve traffic. Not all operations support this — dropping a primary key or changing the column charset still requires a full copy. Always test with EXPLAIN FORMAT=TREE ALTER TABLE ... or check the MySQL documentation matrix.
pt-online-schema-change (pt-osc) and gh-ost are tools that implement online DDL at the application layer for cases where the engine cannot do it natively, or where you need more control (e.g., throttling, pausable migrations). gh-ost (GitHub's tool) is the industry standard for large MySQL tables — it uses MySQL's binary log stream to replay writes to a shadow table with zero triggers, then atomically renames tables at cutover.
PostgreSQL has the CONCURRENTLY modifier for index operations: CREATE INDEX CONCURRENTLY builds the index without an exclusive lock, at the cost of taking longer. PostgreSQL 12+ added CREATE MATERIALIZED VIEW CONCURRENTLY. However, plain ALTER TABLE ADD COLUMN NOT NULL without a server-side default still requires a full table scan in older versions.
Batched Backfill: The Safest Way to Move Data
During the Expand phase you may need to populate a new column from existing data. Never run UPDATE orders SET account_id = user_id — on a large table this takes an exclusive row lock on every row simultaneously, saturates replication lag, and runs for hours. Use a batched backfill loop instead:
UPDATEs replicate to read replicas and can cause lag to spike. Set a lag threshold (e.g., pause if replica lag > 2 s) using SHOW SLAVE STATUS or pt-heartbeat. gh-ost has this built in via --max-lag-millis.
Migration Tooling in CI/CD Pipelines
Ad-hoc, manually-run migrations are a source of drift, errors, and undocumented state. The industry standard is to version-control every migration and run them automatically through the deployment pipeline, with guardrails that enforce safety at each stage.
Common migration tools by ecosystem:
- Flyway (Java, SQL) — file-based versioned migrations (V1__add_column.sql). Integrates with Maven, Gradle, GitHub Actions. Has a
dryRunmode for pre-flight review. - Liquibase (Java, XML/YAML/SQL) — richer than Flyway; supports rollback scripts, preconditions, and change-set labels for environment targeting.
- Atlas (Go, HCL) — schema-as-code; generates a migration plan from schema diff, integrates with Terraform, and has a cloud-based linting API.
- golang-migrate — CLI + Go library; idiomatic for Go microservices with up/down migration files.
- Alembic (Python) — the SQLAlchemy migration tool; auto-generates migrations by diffing ORM models against the live schema.
A production-grade CI pipeline for database migrations follows this sequence:
DROP TABLE, DROP COLUMN, and TRUNCATE in automated pipelines are high-risk. Gate them with a manual approval step in GitHub Actions (environment: production with required reviewers), or deploy them as a separate PR that triggers a separate, manually-approved pipeline run.
Schema Linting and the Safety Checklist
Before any migration reaches production, a linter should automatically flag patterns that are known to cause incidents. Atlas's migrate lint command, Squawk (PostgreSQL-specific), and GitHub's skeema diff can all run in CI to block dangerous SQL. Patterns to reject automatically:
- Adding a
NOT NULLcolumn without a server-side default (full table rewrite or requires existing data to be backfilled first). - Creating an index without
CONCURRENTLY(PostgreSQL) or verifyingALGORITHM=INPLACE(MySQL). ALTER TABLE RENAME COLUMNwithout a prior expand phase (will break old application pods during rolling deploy).- Any migration that modifies a table larger than N GB without an explicit approval label.
- Missing down-migration (
V2__undo.sql) — every migration must be reversible, or the rollback plan requires code freeze.
Zero-downtime schema migrations are one of the most underestimated skills in production engineering. The patterns here — expand-contract, online DDL, batched backfill, and pipeline-integrated linting — are how teams at Google, Stripe, and GitHub ship database changes dozens of times per day without a maintenance window. Internalise the expand-contract lifecycle first; every other technique follows naturally from it.