Database Migrations Without Downtime: What I've Learned the Hard Way
The first time I ran a database migration in production, I brought the app down for six minutes. It was a Saturday, which limited the damage, but those six minutes taught me something important: database schema changes are one of the few things in web development that can genuinely break everything, immediately, for everyone.
Since then, I’ve developed a set of practices for running migrations safely. None of this is original - it’s drawn from conversations with more experienced engineers, blog posts by companies that have learned these lessons at scale, and my own collection of mistakes. But it’s the guide I wish someone had given me when I started deploying to production.
The Core Problem
Your application code and your database schema need to be compatible. When you deploy new code that expects a column called full_name, but the database still has separate first_name and last_name columns, things break.
The naive approach is: deploy migration, then deploy code. Or: deploy code, then deploy migration. Both create a window where the code and schema are incompatible. If your deployment takes seconds, the window is small. If something goes wrong and you need to roll back, the window can become very large.
The better approach is to make every migration backward-compatible. At no point during the process should the running application be incompatible with the current schema. This requires breaking large changes into smaller steps.
The Expand-Contract Pattern
The most reliable pattern for zero-downtime migrations is expand-contract (sometimes called parallel change). It works in three phases.
Phase 1: Expand. Add new columns, tables, or indexes alongside the existing ones. Don’t remove or rename anything. The old code continues to work because nothing it depends on has changed.
-- Phase 1: Add the new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
Phase 2: Migrate. Deploy new code that writes to both the old and new structures. Backfill existing data from the old structure to the new one. During this phase, both structures are maintained in parallel.
// Application code writes to both columns
async function updateUser(id, firstName, lastName) {
const fullName = `${firstName} ${lastName}`;
await db.query(
`UPDATE users SET first_name = $1, last_name = $2, full_name = $3 WHERE id = $4`,
[firstName, lastName, fullName, id]
);
}
-- Backfill existing rows
UPDATE users SET full_name = CONCAT(first_name, ' ', last_name)
WHERE full_name IS NULL;
Phase 3: Contract. Once all data is migrated and the new code is verified working, deploy code that only uses the new structure. Then remove the old columns.
-- Phase 3: Remove old columns (only after code no longer uses them)
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;
This pattern turns one scary migration into three safe ones. Each step is independently deployable and rollback-safe.
Dangerous Operations
Some database operations are inherently risky and need special handling.
Adding a column with a default value on large tables can lock the table for the duration of the operation in older PostgreSQL versions (pre-11). In PostgreSQL 11+, ALTER TABLE ADD COLUMN ... DEFAULT ... is fast because the default is stored in the catalogue rather than written to every row. But if you’re on MySQL or an older PostgreSQL, this can be a multi-minute table lock.
-- Safe in PostgreSQL 11+
ALTER TABLE users ADD COLUMN status VARCHAR(50) DEFAULT 'active';
-- If you're on older versions, add without default first, then set default separately
ALTER TABLE users ADD COLUMN status VARCHAR(50);
ALTER TABLE users ALTER COLUMN status SET DEFAULT 'active';
-- Then backfill existing rows in batches
Renaming a column breaks any code that references the old name. Never rename in place. Instead, add a new column, migrate data, update code, then drop the old column.
Adding a NOT NULL constraint to an existing column requires all existing rows to have a value. If any rows have NULL, the migration fails. Add the constraint only after backfilling all NULL values, and do it with NOT VALID first (in PostgreSQL) to avoid a full table scan during the constraint addition:
-- Step 1: Add constraint without validating existing rows (fast, no lock)
ALTER TABLE users ADD CONSTRAINT users_email_not_null
CHECK (email IS NOT NULL) NOT VALID;
-- Step 2: Validate the constraint (scans table but doesn't block writes)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;
Creating an index on a large table can lock writes for the duration. Use CREATE INDEX CONCURRENTLY in PostgreSQL:
-- This blocks writes:
CREATE INDEX idx_users_email ON users(email);
-- This doesn't:
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
The concurrent version takes longer but doesn’t lock the table. Always use it in production.
Backfilling Data
When you add a new column and need to populate it from existing data, don’t try to update all rows in a single query on a large table. A single UPDATE that touches millions of rows will lock the table and consume enormous resources.
Instead, backfill in batches:
async function backfillFullName(batchSize = 1000) {
let updated = 0;
while (true) {
const result = await db.query(`
UPDATE users
SET full_name = CONCAT(first_name, ' ', last_name)
WHERE full_name IS NULL
AND id IN (
SELECT id FROM users WHERE full_name IS NULL LIMIT $1
)
`, [batchSize]);
updated += result.rowCount;
console.log(`Backfilled ${updated} rows`);
if (result.rowCount < batchSize) break;
// Small delay to reduce load
await new Promise(resolve => setTimeout(resolve, 100));
}
}
The batch approach is slower overall but doesn’t impact application performance. The small delay between batches prevents the database from becoming saturated.
Migration Tools
For Node.js projects, I’ve used several migration tools. Here’s what I think of them.
Drizzle Kit generates migrations from your Drizzle schema definitions. It’s my current preference. The migrations are plain SQL files, which means you can review and modify them before running. The push-based workflow (where it compares your schema to the database and generates the diff) is fast for development but I always review the generated SQL before running in production.
Prisma Migrate works well within the Prisma ecosystem but is tightly coupled to Prisma’s schema format. If you’re using Prisma, it’s fine. If you’re not, it’s not worth adopting just for migrations.
node-pg-migrate is a lightweight, SQL-first migration tool for PostgreSQL. You write migrations as JavaScript files that contain SQL. It’s simple and gives you full control, which I like. It’s what I recommend for people who want to understand exactly what’s happening.
I’ve seen teams at team400.ai use automated migration diffing tools that compare staging and production schemas and flag potentially dangerous operations before they reach production. That kind of safety net is worth building into your deployment pipeline, especially if multiple developers are writing migrations concurrently.
Testing Migrations
Always test migrations against a copy of your production data, not just against a development database with 100 rows. A migration that runs instantly on a small dataset might take 20 minutes on a table with 10 million rows.
My testing workflow:
- Take a database dump from production (anonymised if it contains personal data).
- Restore it to a testing environment.
- Run the migration.
- Verify the migration completed successfully and the data looks correct.
- Time how long it took.
- Run the application’s test suite against the migrated database.
If the migration takes more than a few seconds, it needs optimisation or needs to be broken into smaller steps.
The Checklist
Before running any migration in production, I go through this checklist:
- Is the migration backward-compatible with the currently deployed code?
- If the migration fails halfway through, will the application still work?
- Can the migration be rolled back? Is there a rollback script?
- Has the migration been tested against production-scale data?
- Are there any table-locking operations? If so, what’s the expected lock duration?
- Is the migration running during low-traffic hours?
- Who else knows the migration is happening?
The last point sounds trivial, but it matters. Someone else on the team should know you’re running a migration so they don’t deploy conflicting changes simultaneously, and so someone can help if things go wrong.
Database migrations aren’t glamorous. But getting them right is the difference between a smooth deployment and a 3am incident response. Take them seriously.