What AI can vibecode — and what it can't ship to production

In 2026 anyone with an idea and a credit card can produce something that looks like a working application by lunchtime. Models will scaffold a Next.js storefront. They will write a passable Django backend. They will, if asked, even sketch a deployment pipeline and a Postgres schema. The artifact at the end is real, runnable, and impressive.

What it isn't, in almost every case, is production-grade. The gap between "this works on my laptop in front of a friendly audience" and "this works at 11pm on a Saturday when a real user with the wrong character encoding in their email address tries to log in" remains enormous. AI has narrowed many parts of software engineering. It has narrowed this part the least.

We have spent the last year taking AI-generated codebases and vibecoded MVPs to production for our clients. The list of things that consistently need to be fixed before launch is mostly the same every time. This is that list — and why we still find ourselves doing this work by hand.

1. Authentication that's actually safe

A model will give you a working login form. It will likely use a reputable library. It will probably even handle password hashing correctly. What it almost never does correctly, the first time, is:

Session invalidation on password change.
Rate-limiting on login and reset endpoints, with the correct fail-open vs fail-closed semantics.
Email enumeration prevention on signup and reset.
CSRF tokens scoped to the right operations.
A consistent permission model rather than per-route ad-hoc checks.
Multi-tenant isolation that is enforced at the database level, not in application code.

Each of these is a one-line search in OWASP and a multi-day fix in a real codebase. AI can patch them one at a time when asked, but won't proactively flag them.

2. A database schema that survives the second feature

The most consistent failure mode of AI-built systems: schemas that look reasonable for the prototype's three use cases and become unworkable the moment the fourth case arrives.

What we typically rebuild:

Foreign keys instead of "we'll join in application code".
Real constraints — NOT NULL, CHECK, UNIQUE — instead of vibes.
Indexes designed against the actual query patterns, not against the model's guess about what might be queried.
Migrations that run cleanly forward and backward, including data migrations.
An honest accounting of what the JSON columns store, and a plan to evolve them.

The cost of getting the schema right before launch is ~one engineering week. The cost of getting it wrong is six months of incremental pain after the first hundred customers arrive.

3. Observability that lets you debug at 2am

A prototype is debugged by adding print() statements and re-running locally. A production system is debugged by reading dashboards from a phone in the middle of the night.

Almost no AI-generated codebase ships with:

Structured logs with a request ID propagated end to end.
Metrics on the things that actually matter (latency p95/p99, error rate per endpoint, queue depth, slow queries).
Distributed traces that link a user-facing 500 to the database query responsible.
Alerts wired to a channel a human will actually read.

We build this in as default. The work is unglamorous, takes about a week, and is the single most-leveraged investment in production-readiness any system can make.

4. Deploys that are routine and reversible

The AI-generated Dockerfile will probably work. What it won't tell you is:

How to roll back in two minutes when the next release breaks.
How to do a database migration in a way that's safe for a multi-instance deploy.
How to manage secrets without committing them.
How to keep dev / staging / prod consistent enough that "works in staging" means anything.
How to handle the build-on-Friday case where you want to deploy but it's also a public holiday in your AWS region.

The gap between "the container starts" and "we deploy to production on a Tuesday afternoon without anyone holding their breath" is six months of operational maturity. We compress it into ~two weeks of focused work.

5. Real authentication for AI features themselves

For AI-enabled apps there is an extra layer of risk that AI itself rarely surfaces: the AI features are now part of the attack surface.

Prompt injection via user-uploaded documents.
Cross-tenant data leakage through shared embeddings.
Cost denial-of-service from unbounded queries.
Hallucinated outputs that are confidently wrong about safety-critical information.
Lack of audit trail for what the agent did and why.

These are not edge cases. Production AI features need traceability, scoping, refusal behaviour, evaluation, and cost ceilings — none of which the prototype usually has.

6. Performance budgets and the discipline to hold them

A prototype is fast because nobody is using it. A production system that doesn't measure p95 latency, query plans, and Core Web Vitals will quietly degrade until users start leaving.

Real fix on inherited AI-built systems we've seen include:

Replacing SELECT * queries with explicit projections.
Adding indexes that the prototype didn't need because there were ten rows.
Moving N+1 ORM patterns out of hot paths.
Adding proper HTTP caching for read-mostly endpoints.
Pre-rendering content that doesn't need to be dynamic.

None of this is intellectually difficult. All of it requires sustained attention and a baseline of measurement. AI does not provide either by default.

What we actually do

When a client comes to us with an AI-built or vibecoded codebase that needs to become a real product, the engagement looks something like this:

Week 1 — audit. We read the code, run it, measure it. We catalogue the gaps against the categories above and rank them by risk.

Weeks 2-6 — closing the most dangerous gaps. Authentication, database schema, observability, deploys. The work nobody markets but everybody needs.

Weeks 6-onwards — feature work alongside the foundation. The product can now grow without rotting underneath. We continue shipping features at full speed while the foundations get steadily better.

Always — maintenance is an option, not a tax. Most clients choose a monthly retainer for the year after launch. We use it for ongoing improvements, performance work, and on-call. The system gets better with time, not worse.

What we are not arguing

We are not arguing AI-generated code is bad. The opposite: it is good enough that the bottleneck has moved from "can someone write the code at all" to "can someone make this code last in production". That second question is the interesting one now.

We are not arguing the work we do is glamorous. It isn't. It is reading audit reports, tightening migrations, adding indexes, writing runbooks. The reason we are happy doing it is that it is precisely the work that AI hasn't displaced — and probably won't, for as long as production systems need humans on call when they break.

We are not arguing every project needs to go through this. If your AI-generated prototype will only ever serve thirty users in an internal tool, it does not need the same treatment as a customer-facing platform. Right-size the engineering to the goal.

We are arguing this: the easy part of building software has gotten dramatically easier. The hard part has stayed exactly as hard, and the gap between the two has widened. Closing it is a job for an experienced team. We are that team. If your AI-built app is starting to creak, that's where we come in.

Tagged

Methodology

Opinion

Production