Moving 3 Million Rows in Production Without Dropping a Request
Zero-downtime migrations require you to think forward and backward at the same time. Here's the expand-and-contract pattern that got us through it at Grab.
January 20, 2026
The database is live. Traffic is flowing. You need to change the schema.
This is one of those problems where the naive approach (ALTER TABLE, redeploy) works great in staging and causes a 3am incident in production. Here's how we approached a large migration at Grab without any downtime.
The Setup
We had a table with ~3 million rows that needed a new column added, backfilled, and an old column deprecated. Simple enough on paper. In production:
- The table was hit by multiple services
- Some services had long-running transactions that held locks
- We couldn't afford any read latency degradation
Expand-and-Contract
The pattern has three phases:
Phase 1: Expand (backward-compatible change)
Add the new column as nullable, with no constraints:
ALTER TABLE orders ADD COLUMN new_status_code varchar(50);
Deploy code that writes to both columns but reads from the old one. The system is now tolerant of both schema states.
Phase 2: Backfill
Backfill the new column in batches, never locking the table:
UPDATE orders
SET new_status_code = old_status_code
WHERE id BETWEEN $1 AND $2
AND new_status_code IS NULL;
We ran this in a Kubernetes Job with 10,000-row batches and a 50ms sleep between batches to avoid hammering Postgres. The full backfill took ~6 hours on a Saturday night.
Never run a UPDATE without a WHERE clause that limits the rows touched. A full-table update acquires a lock that will destroy your read latency.
Phase 3: Contract (remove old column)
Once backfill is verified, flip the read path to the new column:
ALTER TABLE orders DROP COLUMN old_status_code;
Remove the old column write from application code in a separate deploy.
Validation Before Cutover
Before we flipped reads to the new column, we ran a validation query:
SELECT COUNT(*)
FROM orders
WHERE new_status_code IS NULL
AND old_status_code IS NOT NULL;
When this returned 0, we were confident the backfill was complete.
Feature Flags
We used feature flags to control the read cutover independently of the deployment. This meant:
- Deploy new code (writes both columns, reads old)
- Verify backfill
- Flip feature flag to read new column (no redeploy needed)
- Monitor for 24h
- Remove old column and clean up feature flag
The flag meant we could roll back the read behavior instantly if something went wrong, without rolling back a deployment.
What I'd Do Differently
The 6-hour Saturday backfill window was nerve-wracking. In future, I'd start the backfill during a lower-traffic period and add a progress tracking table so we could see exact completion percentage. Blind waiting is the enemy of calm migrations.