Houdrik
Engineering
· November 4, 2025· 8 min read

The observability minimum that earns its keep

Enterprise vendors oversell observability and small teams underbuild it. There's a two-week setup that pays back forever, and a lot of stuff past that which only earns its keep at scale.

Cover · observability-minimum-that-earns-its-keep

Observability is one of those topics where the industry conversation is wildly out of step with what most teams actually need. Enterprise vendors sell platforms priced for fleets of services and headcount to run them. Small teams, reasonably suspicious of the bill, end up shipping with console.log and hoping nothing breaks on a Sunday. Both ends of the spectrum are wrong in the same direction.

The honest read is that there is a two-week chunk of observability work that pays back across the entire life of the system, and a much larger pile of work past that point which only earns its keep once you have the scale to use it. The teams that do well are the ones that finish the first chunk before they need it and resist the second pile until they actually do.

Here's what the first chunk looks like, in the order we build it.

Structured logs with a request ID

This is the single highest-leverage change a small system can make, and most teams skip it because the prototype already "has logging". It doesn't, in the way that matters.

What you want is JSON output, one event per line, with a stable set of keys: timestamp, level, service, request_id, user_id where applicable, and the message. Every internal hop — HTTP call between services, background job kicked off from a request, database query worth tracing — carries the same request_id forward. The middleware that generates it is one file. The discipline of propagating it is the work.

The moment this is in place, the debugging experience changes character. A user reports that a checkout failed at 14:32. You grep request_id=... across every log stream and get the entire story in order: the request came in, hit the API, called the payments worker, the worker timed out waiting for an external service, the API returned 502, the frontend showed the wrong error toast. That story used to take an hour of correlating timestamps across three terminals. Now it's three seconds. Compound that across every incident the system will ever have, and the case for structured logs writes itself.

Unstructured logs, by contrast, decay. They look fine on day one and become unusable the moment you have more than one service or more than one engineer.

One dashboard

Not twenty. One.

The mistake every team makes after they discover a metrics tool is to build a dashboard per service, per team, per concern, and then build a "main" dashboard that links to all of them. Nobody reads any of it. The dashboard that earns its keep is the one a tired engineer can pull up on a phone at 11pm and read in three seconds.

What goes on it: latency p95 and p99 per public endpoint, error rate per endpoint, queue depth for any async worker, count of slow queries in the last five minutes. Maybe one business metric if there's a single number that means "the product is working" — checkouts per minute, signups per hour, whatever the equivalent is. That's it.

The discipline is not in building the dashboard. It's in not adding the eighteen other panels that someone suggested in a meeting. Every panel you add halves the chance the dashboard gets read at all. The team that can't read it in three seconds is the team that won't read it.

Alerts wired to a channel a human will read

Email folders don't count. Email folders are where alerts go to die. The channel needs to be one that a specific human checks within the timeframe you actually need the alert to be useful — Slack, Discord, PagerDuty if you have on-call rotation, even SMS for the hard cases.

The harder discipline, by an order of magnitude, is pruning. The first week after wiring alerts up, you will get flooded. The temptation is always to add more alerts because each one feels like it's catching something real. The actual job is to delete the ones that fire without an action attached — the ones that everyone learns to swipe away. An alert that has been swiped away three times in a row is no longer an alert. It is noise that will hide the real one when it arrives.

The bar we hold is: every alert maps to either a runbook step or "page a human". If neither applies, it isn't an alert; it's a dashboard panel.

Two named runbooks

Write two runbooks before the first incident. "The database is down" and "deploys are broken". One page each. Plain prose, the steps you'd take, the commands you'd run, the people you'd call.

The reason it's two and not ten is that the first two cover most of the early-stage incidents you will actually see, and a runbook written before an incident is useful while one written during an incident is fiction. The point of writing them early isn't to be comprehensive. It's to force the team to confront the assumptions baked into the system while everyone is still calm enough to think.

The second-order benefit is that new engineers can read both pages on day one and understand more about how the system actually fails than three weeks of architecture meetings would teach them.

What we don't bother with

That's the minimum. Past it, there's a long list of things that genuinely pay off at scale and genuinely don't at small scale.

Distributed traces are the canonical example. OpenTelemetry is excellent technology. For a system with two services and a worker, it's overhead that buys you very little that structured logs with a shared request ID don't already give you. The day you have eight services and the request ID story stops fitting on one screen, traces become essential. Until then, they're a tax.

Custom metrics for every operation are similar. The instinct to instrument everything looks like rigour and is mostly clutter. Metrics earn their keep when you have enough traffic for the numbers to be statistically meaningful and enough engineers for someone to actually look at them.

Formal SLOs with error budgets are a tool for organisations large enough that the conversation "should we ship the feature or pay down reliability debt" needs a number to anchor it. For a five-person team, that conversation is a conversation. Adding SLO machinery to it doesn't make the answer better; it just adds ceremony.

None of this is an argument against any of these tools. It's an argument for sequencing. They pay off later, and "later" matters because the cost of adopting them too early is the cost of not finishing the four things above.

Why teams skip it

The reason teams skip the minimum is not that they don't understand its value. It's that nothing visibly breaks the day after launch. A system without structured logs, without a dashboard, without alerts, and without runbooks looks identical to one that has all four — until the first real incident.

The reason they regret it is that the day it does break is the day it could have been answered in three clicks and instead takes six hours of detective work, an apologetic email to customers, and a Monday morning conversation about why nobody saw it coming.

Two weeks of focused work, returned across the life of the system. We build it as default at the start of every engagement we take over. Nobody markets it. Everybody needs it.

Got an app that needs to last?

Take it from prototype to production.

Reply within one business day. Vibecoded MVP, AI-built draft, half-finished project, or a working product that's starting to crack — all welcome.

Start a project