Database Migrations Without Downtime: The Five-Phase Pattern
Digital Engineering
Schema changes in production databases are the most common cause of unplanned downtime we see. Most teams know the pattern in theory. In practice the order gets compressed, the timeline gets compressed, and at 3am someone is hand-editing a migration table.
By Arjun Raghavan, Security & Systems Lead, BIPI · May 25, 2025 · 7 min read
Schema changes in production databases are the most common cause of unplanned downtime in operations we audit. Most teams know the pattern in theory: add the new column, dual-write, backfill, switch reads, drop the old column. In practice the order gets compressed, the timeline gets compressed, and at 3 AM someone is hand-editing a migration table to recover.
The five-phase pattern is not new. The discipline of running it correctly across releases is what separates teams that have planned outages once a year from teams that have unplanned outages every quarter.
Phase 1: add the new schema, do nothing else
Add the new column, table, or index. No application code reads or writes to it yet. The migration is purely additive. PostgreSQL ALTER TABLE ADD COLUMN with no DEFAULT and no NOT NULL is metadata-only and instant; with a DEFAULT, on Postgres 11+, it is also instant. NOT NULL with a default is the dangerous case — it triggers a full table rewrite on older versions.
The temptation is to combine 'add column' with 'populate column' in one migration. Resist it. Big tables and DEFAULT-rewriting on older Postgres can take hours and lock the table the entire time.
Phase 2: dual-write
Application code now writes to both old and new schema on every write. Reads still come from the old schema. The new schema is being populated by writes only — historical data is still missing.
Dual-write is the phase that fails most often. The application code has to keep both columns in sync on every code path that writes. Easy to miss a path. The fix: a database trigger that mirrors writes from old to new. Triggers are slower but they cannot be bypassed by an application bug.
Run dual-write for at least 24 hours before moving on. Watch for divergence — count rows where new column is NULL but old column is not. If that count grows, an application path is missing the dual-write.
Phase 3: backfill historical data
Now copy historical data from old to new schema. The trick is doing it in batches that do not lock the table for long.
Pattern: a script that processes 1,000 to 10,000 rows per batch, with a 100ms sleep between batches. UPDATE WHERE id IN (...) on a tight ID range. Track progress in a meta table so the script can resume if it crashes. Run it during business-hours-low traffic, monitor replication lag and IOPS.
Verify completeness when the script finishes: SELECT count(*) WHERE new_column IS NULL — should be zero or only rows added before dual-write started.
Phase 4: switch reads, keep dual-write
Application code now reads from the new schema. Writes still hit both. This is the phase where you discover whether your backfill was actually complete and your application paths actually populate the new schema correctly.
Roll out behind a feature flag. Start with 1% of traffic, watch error rates and query latency for 24 hours, then 10%, then 100%. The feature flag is your rollback button — if anything looks wrong, flip back to reading from old schema. Dual-write keeps the new schema fresh during the rollback so you can re-attempt.
Phase 5: stop dual-write, drop the old schema
After at least a week of stable reads on the new schema, remove the dual-write code. Application now only touches the new schema. Wait another week. Drop the old column or table. Postgres ALTER TABLE DROP COLUMN is instant (metadata-only).
The temptation here is to skip the wait. Don't. The wait is your insurance against discovering at week three that some monthly reporting job still reads the old column and now has stale data.
When to compress the timeline
Two phases can run in parallel if the data is small (< 1M rows). Phase 1 + 2 can land in the same release. Phase 3 + 4 can land together. Phase 5 should always wait at least a week after phase 4 — that is the rollback insurance.
For tables over 100M rows, every phase needs explicit operational planning. Backfill alone might take 48 hours. Coordinate with the SRE team on read replicas, replication lag, and disk IOPS budgets.
Closing
Zero-downtime migrations are not a clever trick. They are a disciplined sequence — additive change first, dual-write, backfill, switch reads with a flag, drop after stable. The teams that do it well have an explicit checklist for each phase and a 'no compress' rule. The teams that do not have it well will tell you migrations are scary. They are not. Compressing them is.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.