Zero-Downtime Database Migrations

TL;DR

Zero-downtime migrations use backward-compatible steps. Add columns nullable first. Deploy code that handles both schemas. Backfill data in batches. Add constraints. Never rename or drop in one step—use expand and contract.

Our last "5-minute migration window" turned into 45 minutes of downtime. ALTER TABLE on a 50M row table took longer than expected, locks piled up, queries timed out. We lost $30k in sales.

That was the last time we took downtime for a migration. Here's the multi-step approach that works while your database is under full production load.

Why Migrations Cause Downtime

Most migration problems come from three things:

-- 1. Table locks (ALTER TABLE acquires exclusive lock)
ALTER TABLE users ADD COLUMN verified BOOLEAN DEFAULT false;
-- Blocks ALL reads and writes until complete
-- On a 10M row table: potentially minutes

-- 2. Schema-code mismatch during rolling deploys
-- Old code expects 'user_name', new code uses 'username'
-- Migration runs → old servers still running → break

-- 3. Long-running data transforms
UPDATE users SET verified = true WHERE created_at < '2024-01-01';
-- 5M rows × lock time = everything piles up

The Key Principle: Backward-Compatible Steps

Every migration step must work with both old and new code running simultaneously. During a rolling deploy, some servers run old code, some run new.

Pattern 1: Adding a Column

-- BAD: NOT NULL DEFAULT on large table rewrites every row
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW();
-- Long exclusive lock, potential timeout
-- GOOD: Three steps

-- Step 1: Add column as nullable (fast, no rewrite)
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
-- Deploy: milliseconds, compatible with all code

-- Step 2: Backfill existing rows in batches
DO $$
DECLARE
    batch_size INT := 1000;
    last_id BIGINT := 0;
BEGIN
    LOOP
        UPDATE orders
        SET processed_at = created_at
        WHERE id IN (
            SELECT id FROM orders
            WHERE processed_at IS NULL AND id > last_id
            ORDER BY id
            LIMIT batch_size
        )
        RETURNING MAX(id) INTO last_id;

        EXIT WHEN last_id IS NULL;
        PERFORM pg_sleep(0.01);  -- Be gentle on the database
    END LOOP;
END $$;

-- Step 3: Add NOT NULL constraint (fast once no NULLs exist)
ALTER TABLE orders ALTER COLUMN processed_at SET NOT NULL;

Note: PostgreSQL 11+ adds columns with constant defaults instantly (stores default in catalog, not each row). For older versions or non-constant defaults, always go nullable first.

Pattern 2: Renaming a Column

Never rename in one step. Old code breaks immediately.

-- BAD: Immediate breakage
ALTER TABLE users RENAME COLUMN user_name TO username;
-- All servers still running old code break instantly

Use the expand/contract pattern — 6 steps over multiple deployments:

Step 1: Add new column (keep old column)
Step 2: Deploy: write to BOTH columns
Step 3: Backfill old → new
Step 4: Deploy: read from new column
Step 5: Deploy: stop writing to old column
Step 6: Drop old column (weeks later)
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN username TEXT;

-- Step 3: Backfill
UPDATE users SET username = user_name WHERE username IS NULL;

-- Step 6: Drop old column (after confirming nothing uses it)
ALTER TABLE users DROP COLUMN user_name;
// Step 2: Dual write (both old and new code can coexist)
async function updateUsername(userId, name) {
    await db.execute(
        'UPDATE users SET user_name = $1, username = $1 WHERE id = $2',
        [name, userId]
    );
}

Pattern 3: Adding an Index

-- BAD: Blocks writes for the entire build duration
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- On 10M rows: 30+ seconds, all writes blocked

-- GOOD: CONCURRENTLY (no write blocking)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
-- Takes 2-3x longer to build, but causes no downtime
-- Cannot be run inside a transaction
-- For unique indexes:
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email ON users(email);

-- Then enforce at the table level using the built index:
ALTER TABLE users ADD CONSTRAINT users_email_unique
    UNIQUE USING INDEX idx_users_email;

Pattern 4: Removing a Column

-- Step 1: Make column nullable (if it isn't already)
ALTER TABLE users ALTER COLUMN legacy_field DROP NOT NULL;

-- Step 2: Deploy: remove all code references to the column
-- Monitor for errors in production

-- Step 3: Drop the column (days or weeks later)
ALTER TABLE users DROP COLUMN legacy_field;

Search your codebase before dropping:

# Find references before dropping
grep -r "legacy_field" src/ --include="*.{js,py,go,rb}"
grep -r "legacy_field" migrations/

Pattern 5: Changing a Column Type

The hardest migration. Never change type in-place for large tables.

-- BAD: Full table rewrite, long exclusive lock
ALTER TABLE events ALTER COLUMN payload TYPE JSONB USING payload::JSONB;

-- GOOD: Parallel column
-- Step 1: Add new column
ALTER TABLE events ADD COLUMN payload_jsonb JSONB;

-- Step 2: Dual write (both columns)
-- Step 3: Backfill
UPDATE events SET payload_jsonb = payload::JSONB WHERE payload_jsonb IS NULL;

-- Step 4: Switch reads to new column
-- Step 5: Remove old column
ALTER TABLE events DROP COLUMN payload;
ALTER TABLE events RENAME COLUMN payload_jsonb TO payload;

Long-Running Migrations: Do Them in Batches

Never update millions of rows in a single statement:

# BAD - locks table for minutes
db.execute("""
    UPDATE users
    SET email_verified = true
    WHERE created_at < '2024-01-01'
""")

# GOOD - batches with delays
import time

def backfill_in_batches():
    last_id = 0
    batch_size = 1000

    while True:
        result = db.execute("""
            UPDATE users
            SET email_verified = true
            WHERE id IN (
                SELECT id FROM users
                WHERE created_at < '2024-01-01'
                  AND email_verified = false
                  AND id > %s
                ORDER BY id
                LIMIT %s
            )
        """, last_id, batch_size)

        if result.rowcount == 0:
            break

        last_id = db.fetchone("SELECT MAX(id) FROM users WHERE id > %s LIMIT %s",
                              last_id, batch_size)[0]
        time.sleep(0.05)  # Give the database a breather
        print(f"Updated through id {last_id}")

Testing Migrations Before Production

#!/bin/bash
# Test against a copy of your production schema

# 1. Dump production schema (no data)
pg_dump --schema-only production_db > /tmp/schema.sql

# 2. Create test database
createdb migration_test
psql migration_test < /tmp/schema.sql

# 3. Time the migration
time psql migration_test < migrations/V10__add_column.sql

# 4. Verify schema looks correct
psql migration_test -c "\d users"

# 5. Drop test database
dropdb migration_test

Always do this before running on production. A migration that takes 2 minutes on a 10M row table will take 20 minutes on your 100M row production table.

The Bottom Line

Zero-downtime migrations take more steps, but each step is safe and reversible.

The rules:

  • Add columns nullable first, add constraints after backfill
  • Never rename in one step — use expand/contract over multiple deploys
  • Use CREATE INDEX CONCURRENTLY — never create indexes without it
  • Run data backfills in batches with sleeps between them
  • Test migration timing on a schema copy before production
  • Each step must work with both old and new application code running simultaneously

The expand/contract pattern is slower but never causes downtime. Take the extra time.