Zero-Downtime Database Migrations
TL;DR
Zero-downtime migrations use backward-compatible steps. Add columns nullable first. Deploy code that handles both schemas. Backfill data in batches. Add constraints. Never rename or drop in one step—use expand and contract.
Our last "5-minute migration window" turned into 45 minutes of downtime. ALTER TABLE on a 50M row table took longer than expected, locks piled up, queries timed out. We lost $30k in sales.
That was the last time we took downtime for a migration. Here's the multi-step approach that works while your database is under full production load.
Why Migrations Cause Downtime
Most migration problems come from three things:
-- 1. Table locks (ALTER TABLE acquires exclusive lock)
ALTER TABLE users ADD COLUMN verified BOOLEAN DEFAULT false;
-- Blocks ALL reads and writes until complete
-- On a 10M row table: potentially minutes
-- 2. Schema-code mismatch during rolling deploys
-- Old code expects 'user_name', new code uses 'username'
-- Migration runs → old servers still running → break
-- 3. Long-running data transforms
UPDATE users SET verified = true WHERE created_at < '2024-01-01';
-- 5M rows × lock time = everything piles up
The Key Principle: Backward-Compatible Steps
Every migration step must work with both old and new code running simultaneously. During a rolling deploy, some servers run old code, some run new.
Pattern 1: Adding a Column
-- BAD: NOT NULL DEFAULT on large table rewrites every row
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW();
-- Long exclusive lock, potential timeout
-- GOOD: Three steps
-- Step 1: Add column as nullable (fast, no rewrite)
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
-- Deploy: milliseconds, compatible with all code
-- Step 2: Backfill existing rows in batches
DO $$
DECLARE
batch_size INT := 1000;
last_id BIGINT := 0;
BEGIN
LOOP
UPDATE orders
SET processed_at = created_at
WHERE id IN (
SELECT id FROM orders
WHERE processed_at IS NULL AND id > last_id
ORDER BY id
LIMIT batch_size
)
RETURNING MAX(id) INTO last_id;
EXIT WHEN last_id IS NULL;
PERFORM pg_sleep(0.01); -- Be gentle on the database
END LOOP;
END $$;
-- Step 3: Add NOT NULL constraint (fast once no NULLs exist)
ALTER TABLE orders ALTER COLUMN processed_at SET NOT NULL;
Note: PostgreSQL 11+ adds columns with constant defaults instantly (stores default in catalog, not each row). For older versions or non-constant defaults, always go nullable first.
Pattern 2: Renaming a Column
Never rename in one step. Old code breaks immediately.
-- BAD: Immediate breakage
ALTER TABLE users RENAME COLUMN user_name TO username;
-- All servers still running old code break instantly
Use the expand/contract pattern — 6 steps over multiple deployments:
Step 1: Add new column (keep old column)
Step 2: Deploy: write to BOTH columns
Step 3: Backfill old → new
Step 4: Deploy: read from new column
Step 5: Deploy: stop writing to old column
Step 6: Drop old column (weeks later)
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN username TEXT;
-- Step 3: Backfill
UPDATE users SET username = user_name WHERE username IS NULL;
-- Step 6: Drop old column (after confirming nothing uses it)
ALTER TABLE users DROP COLUMN user_name;
// Step 2: Dual write (both old and new code can coexist)
async function updateUsername(userId, name) {
await db.execute(
'UPDATE users SET user_name = $1, username = $1 WHERE id = $2',
[name, userId]
);
}
Pattern 3: Adding an Index
-- BAD: Blocks writes for the entire build duration
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- On 10M rows: 30+ seconds, all writes blocked
-- GOOD: CONCURRENTLY (no write blocking)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
-- Takes 2-3x longer to build, but causes no downtime
-- Cannot be run inside a transaction
-- For unique indexes:
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Then enforce at the table level using the built index:
ALTER TABLE users ADD CONSTRAINT users_email_unique
UNIQUE USING INDEX idx_users_email;
Pattern 4: Removing a Column
-- Step 1: Make column nullable (if it isn't already)
ALTER TABLE users ALTER COLUMN legacy_field DROP NOT NULL;
-- Step 2: Deploy: remove all code references to the column
-- Monitor for errors in production
-- Step 3: Drop the column (days or weeks later)
ALTER TABLE users DROP COLUMN legacy_field;
Search your codebase before dropping:
# Find references before dropping
grep -r "legacy_field" src/ --include="*.{js,py,go,rb}"
grep -r "legacy_field" migrations/
Pattern 5: Changing a Column Type
The hardest migration. Never change type in-place for large tables.
-- BAD: Full table rewrite, long exclusive lock
ALTER TABLE events ALTER COLUMN payload TYPE JSONB USING payload::JSONB;
-- GOOD: Parallel column
-- Step 1: Add new column
ALTER TABLE events ADD COLUMN payload_jsonb JSONB;
-- Step 2: Dual write (both columns)
-- Step 3: Backfill
UPDATE events SET payload_jsonb = payload::JSONB WHERE payload_jsonb IS NULL;
-- Step 4: Switch reads to new column
-- Step 5: Remove old column
ALTER TABLE events DROP COLUMN payload;
ALTER TABLE events RENAME COLUMN payload_jsonb TO payload;
Long-Running Migrations: Do Them in Batches
Never update millions of rows in a single statement:
# BAD - locks table for minutes
db.execute("""
UPDATE users
SET email_verified = true
WHERE created_at < '2024-01-01'
""")
# GOOD - batches with delays
import time
def backfill_in_batches():
last_id = 0
batch_size = 1000
while True:
result = db.execute("""
UPDATE users
SET email_verified = true
WHERE id IN (
SELECT id FROM users
WHERE created_at < '2024-01-01'
AND email_verified = false
AND id > %s
ORDER BY id
LIMIT %s
)
""", last_id, batch_size)
if result.rowcount == 0:
break
last_id = db.fetchone("SELECT MAX(id) FROM users WHERE id > %s LIMIT %s",
last_id, batch_size)[0]
time.sleep(0.05) # Give the database a breather
print(f"Updated through id {last_id}")
Testing Migrations Before Production
#!/bin/bash
# Test against a copy of your production schema
# 1. Dump production schema (no data)
pg_dump --schema-only production_db > /tmp/schema.sql
# 2. Create test database
createdb migration_test
psql migration_test < /tmp/schema.sql
# 3. Time the migration
time psql migration_test < migrations/V10__add_column.sql
# 4. Verify schema looks correct
psql migration_test -c "\d users"
# 5. Drop test database
dropdb migration_test
Always do this before running on production. A migration that takes 2 minutes on a 10M row table will take 20 minutes on your 100M row production table.
The Bottom Line
Zero-downtime migrations take more steps, but each step is safe and reversible.
The rules:
- Add columns nullable first, add constraints after backfill
- Never rename in one step — use expand/contract over multiple deploys
- Use
CREATE INDEX CONCURRENTLY— never create indexes without it - Run data backfills in batches with sleeps between them
- Test migration timing on a schema copy before production
- Each step must work with both old and new application code running simultaneously
The expand/contract pattern is slower but never causes downtime. Take the extra time.