Houdrik
Engineering
· February 20, 2025· 9 min read

The boring-infrastructure checklist we ship every project with

Every studio claims to build production-grade software. Production-grade is a checklist with sharp edges, not a vibe. Here's the one-page list we ship with every engagement.

Cover · boring-infra-checklist

Every studio claims to "build production-grade". The phrase has been worn so smooth it now means roughly nothing. Production-grade is not a vibe and it is not a tier on a pricing page. It is a checklist with sharp edges, and most of the items on it have nothing to do with the product the client is actually paying for.

We ship the same one-page checklist with every engagement. It is not exciting. It is not what the founder put on the roadmap. It is the boring foundation that decides whether the interesting work above it survives contact with real users. This post is that checklist, in paragraph form, with the reasoning we usually skip in the kickoff meeting because nobody wants to hear it.

Structured logs with a request ID, end to end

The first thing we add to any inherited codebase is structured logging with a request ID propagated from the entry point all the way through to the database query and any background job spawned along the way. Not "we use a logger". Not "we have a Sentry account". A JSON line per event with a stable schema, and a request ID that lets you reconstruct what happened to one specific user at one specific moment.

The reason is mundane. At 3am, with a vague bug report from a single customer, you have minutes — not hours — to find the right needle in the haystack. Greppable logs with a correlation ID turn that search from a half-day investigation into a two-minute query. The work to add this on day one is small. The work to add it retroactively, after six months of unstructured print statements, is a multi-week project that nobody will fund.

A real secrets store, not someone's .env

The number of production systems we have seen running on a .env file that lives on one developer's laptop and gets re-uploaded by hand every time it changes is genuinely depressing. This is not a secret store. It is a rumour about a secret.

A real secrets store has access control, audit logs, rotation, and a way to revoke a leaked credential without redeploying every service that uses it. The specific tool matters less than the properties. We pick what fits the deployment target — cloud-native managers where the cloud is already in play, a self-hosted vault where it isn't. The non-negotiable is that no secret lives in a Git repository, no secret is shared on Slack, and rotation is a routine operation rather than a panic.

Auth that survives OWASP basics

Most authentication code we inherit handles the happy path. A user with the right password gets in. The unhappy paths are where the bodies are buried. Session invalidation when a password changes. Rate limits on login, password reset, and any endpoint that returns "yes, this email exists" in a measurable timing difference. CSRF tokens scoped to the operations that actually mutate state, not sprinkled randomly across every form. A consistent permission model rather than if user.is_admin scattered across forty views.

None of this is novel research. Every item is on the first page of the OWASP cheat sheet. We are not claiming insight here — we are claiming discipline. The checklist exists so that the discipline doesn't depend on whether the engineer doing the work happened to remember every category on the day they wrote the login form.

Schema with real constraints

A database schema without constraints is a suggestion. We have lost count of the codebases where the team eventually discovers, the hard way, that user_id in some table is sometimes NULL, sometimes an orphaned reference, and once — memorably — a string. The application code that depends on this data behaving sensibly is full of defensive checks that exist because the schema isn't doing its job.

We add NOT NULL where the value is genuinely required. Foreign keys where the relationship exists, with explicit ON DELETE semantics rather than whatever Postgres defaults to. CHECK constraints for ranges and enums. UNIQUE indexes for things that are unique. The cost is some discomfort during migration. The benefit is that an entire category of "how did we get into this state?" bugs becomes structurally impossible.

Migrations that run forward and backward

A migration that only runs forward is a one-way door. You ship it, and from that moment your ability to roll back the application is bounded by your ability to roll back the database — which, if the migration was destructive, is zero.

We write migrations that have a defined down path, and we test it. For destructive changes — dropping a column, narrowing a type — we use the expand-and-contract pattern: ship the additive change first, deploy the application that no longer needs the old column, then drop it in a follow-up migration. This takes longer than the shortcut. It also means that "roll back the deploy" remains a real option for the full window during which a regression might be detected.

Deploys reversible in under two minutes

The single most valuable property of a deployment pipeline is the ability to undo it. Not the ability to do it. Anyone can do it. The question is what happens when the release ships, the error rate spikes, and someone needs to make the new code stop being live, right now, before the next sprint demo turns into an apology.

A two-minute rollback isn't a feature you bolt on. It is a property of the whole pipeline: immutable artifacts, atomic switches between versions, database migrations that don't burn the bridge behind you, secrets and configuration that aren't entangled with the release. We design for this from day one because retrofitting it onto a pipeline that wasn't built for it is harder than building it correctly the first time.

On-call runbooks readable at 3am

The last item is the one most often skipped, and the one that pays the largest dividend the first time a real incident happens. A runbook is a short, blunt document that tells the engineer on call what to do when a specific alert fires. Not "investigate the issue". A list of commands, dashboards to check, and the three most likely causes ranked by frequency.

The test for a good runbook is brutal: hand it to an engineer who has never seen this system, at 3am, with no context, and see if they can resolve the alert without paging anyone else. Most runbooks fail this test on the first attempt. That's fine — the value is in the second attempt, after the incident, when the team rewrites the runbook against what they actually wished they'd known at 3am. Over a few iterations, the runbook becomes the institutional memory the team would otherwise have to keep entirely in human heads.

The trade-off, out loud

This is unglamorous work. It takes roughly two weeks of focused effort at the start of an engagement, and it ships zero features the founder can demo to a customer. We say this clearly in every kickoff, because the alternative is that the work gets quietly skipped and then re-litigated under outage pressure six months later, when the cost of doing it is three times higher and the people doing it are not sleeping.

The pitch isn't that this is fun. The pitch is that the bill comes due either way. You can pay it now, in calm conditions, with a team that has the context to do it right. Or you can pay it later, in an incident channel at midnight, with whoever happens to be available. We strongly recommend the first option, and we structure our engagements so it's the default.

And one note about AI

This is also the bar that AI-generated code does not currently meet on its own. A model will give you a plausible draft of any item on this list. The draft will be roughly 70% correct and 100% silently wrong about the 30% that matters — the ON DELETE clauses, the rate-limit fail-open semantics, the rollback path that quietly assumes a single-instance deploy. Closing that gap is exactly the work we are paid to do. It is, for the moment, the work that does not yet get vibecoded.

Got an app that needs to last?

Take it from prototype to production.

Reply within one business day. Vibecoded MVP, AI-built draft, half-finished project, or a working product that's starting to crack — all welcome.

Start a project