Moving 3 Million Rows in Production Without Dropping a Request

The database is live. Traffic is flowing. You need to change the schema.

This is one of those problems where the naive approach (ALTER TABLE, redeploy) works great in staging and causes a 3am incident in production. Here's how we approached a large migration at Grab without any downtime.

The Setup

We had a table with ~3 million rows that needed a new column added, backfilled, and an old column deprecated. Simple enough on paper. In production:

The table was hit by multiple services
Some services had long-running transactions that held locks
We couldn't afford any read latency degradation

Expand-and-Contract

The pattern has three phases:

Phase 1: Expand (backward-compatible change)

Add the new column as nullable, with no constraints:

ALTER TABLE orders ADD COLUMN new_status_code varchar(50);

Deploy code that writes to both columns but reads from the old one. The system is now tolerant of both schema states.

Phase 2: Backfill

Backfill the new column in batches, never locking the table:

UPDATE orders
SET new_status_code = old_status_code
WHERE id BETWEEN $1 AND $2
  AND new_status_code IS NULL;

We ran this in a Kubernetes Job with 10,000-row batches and a 50ms sleep between batches to avoid hammering Postgres. The full backfill took ~6 hours on a Saturday night.

⚠️

Never run a UPDATE without a WHERE clause that limits the rows touched. A full-table update acquires a lock that will destroy your read latency.

Phase 3: Contract (remove old column)

Once backfill is verified, flip the read path to the new column:

ALTER TABLE orders DROP COLUMN old_status_code;

Remove the old column write from application code in a separate deploy.

Validation Before Cutover

Before we flipped reads to the new column, we ran a validation query:

SELECT COUNT(*)
FROM orders
WHERE new_status_code IS NULL
  AND old_status_code IS NOT NULL;

When this returned 0, we were confident the backfill was complete.

Feature Flags

We used feature flags to control the read cutover independently of the deployment. This meant:

Deploy new code (writes both columns, reads old)
Verify backfill
Flip feature flag to read new column (no redeploy needed)
Monitor for 24h
Remove old column and clean up feature flag

The flag meant we could roll back the read behavior instantly if something went wrong, without rolling back a deployment.

What I'd Do Differently

The 6-hour Saturday backfill window was nerve-wracking. In future, I'd start the backfill during a lower-traffic period and add a progress tracking table so we could see exact completion percentage. Blind waiting is the enemy of calm migrations.