Nearly every team will tell you their migrations are reversible. Try it on a Tuesday afternoon, with traffic on, in front of a watching CTO. The answer is usually "well, this one is, but…". The reason most migrations can't actually be rolled back is that nobody designs them that way. They design them to go forward, and then write def downgrade() because the framework asks for it.
We have spent a lot of nights on the wrong side of a migration that "should have" rolled back. The patterns below are the ones we now apply by default. They are not novel. They are the discipline that production-grade teams converge on after their second or third bad night. The point of this post is to spare you a few of them.
The constraint we hold throughout: any migration we ship has to be reversible without losing data, on a multi-instance fleet, under load. Not in theory. On a Tuesday.
Two-phase column additions
The single most common failure mode is the one-shot migration that adds a column, backfills it, and starts reading from it in the same deploy. It works on a laptop. It works in staging with one instance. It breaks the moment the fleet has more than one node and the rollout is gradual.
The pattern that holds up: five deploys, not one.
- Schema-only. Add the new column, nullable, no constraints. Old code keeps running unchanged. This deploy is trivially reversible because nothing reads or writes the new column yet.
- Dual-write. Deploy code that writes to both the old and the new column on every mutation. Reads still come from the old column. The new column starts catching up for new rows.
- Backfill. Run a separate, idempotent job that fills the new column for historical rows. Crucially, this is a job, not a migration — it can pause, resume, and be observed. We'll come back to this.
- Cut over reads. Deploy code that reads from the new column. Old column is still being written to as a safety net. If reading the new column blows up, you roll back this single deploy and the system is back to a known-good state in minutes.
- Retire the old column. Once the new column has been the read path long enough to trust, drop the old column in a final migration.
Five deploys for one logical change. The trade is that any of the four reversible deploys can be rolled back in isolation. The fifth — the drop — is the only step that genuinely commits, and by then you have weeks of evidence the new column is correct.
The discipline costs time per migration. It buys back the ability to deploy at 2pm instead of at 2am.
Renames are deletes-and-adds
Never rename a column in one migration. Not in Postgres, not in MySQL, not in anything that runs more than one instance of your application at once.
The reason is the fleet during a rolling deploy. Instance A is still reading customer_name. Instance B has the new code and renamed the column to full_name. For the few minutes the rollout takes, half your fleet is reading a column that no longer exists. The other half is writing one that the database doesn't have.
A rename is the two-phase column addition above, with one extra step: copy the data over during the dual-write phase. The "old column" is the original name. The "new column" is the new name. Five deploys, same as before. The rename is conceptually the most innocent change you can make and operationally one of the most dangerous, precisely because it looks so safe in a single-instance dev environment.
If a junior engineer on the team renames a column in a single migration, it's not a code review nit. It's a learning moment about what a deploy actually looks like in production.
Backfills are separate from schema changes
A schema change that adds a column is, with the right flags, near-instantaneous and fully reversible. A backfill that updates millions of rows is neither. Treating them as the same deploy is how you end up with a migration that has been running for forty minutes and you have no idea whether to wait or kill it.
The rule we hold: schema migrations are additive and fast. Backfills are jobs.
A schema migration that takes more than a second or two on the production database is a red flag. Real backfills go in a queue (Celery, a one-off management command, whatever the team uses) with three properties: idempotent, resumable, observable. We want to see progress. We want to be able to pause it during a traffic spike. We want to know it finished, and finished correctly.
Concretely, this means a feature lands in three logical phases:
- Phase 1: schema migration — additive, fast, reversible.
- Phase 2: code change that uses the new shape, with the dual-write safety net.
- Phase 3: the backfill, run as a job, observed to completion.
Each phase is independently reversible. If the backfill turns out to be wrong, you stop the job, fix the logic, and re-run it. You do not roll back the schema change. You do not redeploy the application. The blast radius of a mistake stays scoped to the phase where the mistake lives.
Constraint tightening is its own deploy
Adding NOT NULL to a column that was added two deploys ago feels like a continuation of the same change. It is not. It is a separate deploy, and it should be the last in the series, not bundled with the column addition.
Why this matters: when a constraint tightens, the surface area of "what can break" expands. Any code path that produces a NULL for that column — including code that was perfectly happy to do so a week ago — will now error. Including, often, code paths that nobody remembered existed.
The right ordering:
- Add the column, nullable, with a sensible default for new writes.
- Deploy application code that always populates the column.
- Backfill historical rows that were inserted before step 2.
- Only then add the
NOT NULLconstraint.
Each step is independently reversible. The constraint-tightening deploy is the riskiest one in the series, and it deserves its own deploy window precisely because it is the one most likely to surface an unknown unknown. If it fails, you roll back the constraint, the application still works, and you have a clean signal about what to fix.
The version of this rule we burn into junior engineers: a constraint is a deployment, not a migration.
Foreign keys with NOT VALID then VALIDATE
The Postgres-specific case worth knowing. Adding a foreign key to a large table looks innocent and takes an exclusive lock long enough to take down the application. We have seen this happen. Twice.
The pattern that doesn't:
ALTER TABLE orders
ADD CONSTRAINT orders_customer_fk
FOREIGN KEY (customer_id) REFERENCES customers(id)
NOT VALID;
NOT VALID tells Postgres to enforce the constraint on new rows but not to scan existing rows. The lock is brief. The migration finishes in milliseconds.
Then, separately:
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;
VALIDATE takes a much weaker lock and scans existing rows in the background. It can be run at a quiet time. If it fails — because some historical row has a dangling customer_id — you find out and fix the data, not the schema. The constraint is already protecting new writes.
Two migrations, one logical change, no production outage. This pattern generalises: any constraint that requires a table scan should be added with whatever the equivalent of "don't scan existing rows yet" is, and validated separately.
The migration that genuinely can't be rolled back
We are not going to claim every migration can be rolled back. Some are genuinely one-way. Dropping a column is one way. Dropping a table is one way. Collapsing two enum values into one is one way, and lossy on top of that. Splitting one column into three after a parsing change is, in practice, one way.
The discipline is not pretending these don't exist. It is naming them, in advance, in writing.
Our rule: any migration that is genuinely not reversible requires a short ADR before it merges. The ADR says, in three or four bullets:
- What changes, and what is lost.
- Why we are committing to it (i.e. why a two-phase migration isn't available).
- What the recovery plan is if the deploy goes wrong (usually: restore from backup, replay X hours of writes).
- The deploy window we are scheduling it in.
This is not paperwork for paperwork's sake. It is the moment where someone might notice that the migration didn't actually need to be one-way, and the team can pick the boring reversible path instead. About a third of the time, that's what happens.
When the ADR concludes the migration really is one-way, the team goes in with eyes open. The deploy gets a real window, real eyes on the dashboards, and a tested restore path. The cost of an irreversible migration is the planning around it, paid up front.
The trade-off
All of this is more work per migration. A feature that would have been one migration becomes three. A column rename becomes a five-deploy saga. The team has to slow down enough to think about which phase they're in.
The trade is that "the migration broke and we rolled back" stops being a Saturday emergency and becomes a five-minute incident in the middle of a normal workday. The cost is paid in small, predictable increments during planning. The cost is not paid in a single catastrophic night where the whole team is on a call trying to remember whether the backup is consistent and how long replay takes.
We have been on both sides of this trade. The first is genuinely better. It is not even close.
Closing
Most teams discover this discipline after the bad night, in retrospect. The post-mortem is honest, the lessons are sharp, and the next migration is meticulous. The one after that is also meticulous. The third one is a bit less meticulous, and by the sixth migration the team is back to one-shot changes that "should be fine."
We try to learn this discipline from other people's bad nights instead. The patterns above aren't ours. They are the convergent answer that production-grade teams reach after enough painful Saturdays. The point is to skip the Saturdays and keep the answer.
If you take one thing from this post: separate schema changes, code changes, backfills, and constraint changes into different deploys, and you will recover most of the reversibility you thought you already had.


