Houdrik
Ai
· May 3, 2026· 11 min read

Shipping a RAG system that survives production traffic

A working demo and a production RAG system are two completely different artifacts. Here are the eight things we now always do, learned the hard way across half a dozen engagements.

Cover · rag-system-production-traffic

A working RAG demo is one of the easiest things to build in 2026. You can have one in an afternoon: chunk some documents, embed them, store them, retrieve top-k, hand them to a language model, ship a chat UI. It will look impressive.

A production RAG system that survives a Monday morning of real traffic from real users with real expectations is a fundamentally different artifact. We have shipped half a dozen of them across the last two years. These are the eight habits we now treat as non-negotiable.

1. Pick an eval set on day three

You cannot improve what you cannot measure. The single highest-leverage piece of work in any RAG engagement is sitting down with the client and producing fifty to two hundred real questions, with the answers a human expert would consider correct.

This is also the work clients are most likely to push back on. "We don't have time, just build the thing." Refuse politely. Without an eval set, every "this feels better now" assertion is a vibe — and vibes don't scale to three different prompts, two different models, and four different retrieval strategies.

The eval set runs in CI. Every change to retrieval, prompts, model choice, or even tokeniser triggers a re-score. Trends matter more than absolute numbers, but we publish both. Faithfulness (did the answer cite the right document?) and precision (was the cited document actually relevant?) are the two we always track.

2. Hybrid retrieval almost always beats dense alone

Pure vector similarity is great for "find me semantically similar things". It is bad at finding documents that mention a specific product code, a specific person's name, or a specific error message. Real questions ask about both.

Our default is: dense (cosine on embeddings) + sparse (BM25 over a tsvector) + reciprocal rank fusion. Then a cross-encoder reranker over the top fifty results. The reranker is the most expensive step in the pipeline; we cache its output aggressively.

In every engagement we've measured, hybrid retrieval has outscored dense-only by ten to twenty points on our eval set. The cost is about 30% more compute per query and a slightly more complex codebase. Worth it every single time.

3. Chunking is product design, not preprocessing

The naive approach — split documents into fixed-size chunks with some overlap — works on toy corpora and fails on real ones. Real corpora have structure: support tickets have a customer turn, an agent turn, and a resolution; contracts have clauses and exhibits; documentation has sections and code blocks.

Spend the time to honour that structure. We've shipped chunkers that split on <h2> boundaries in HTML, on speaker turns in transcripts, on semantic boundaries detected by a smaller LLM in unstructured prose. The right chunking scheme on a given corpus moves eval scores more than any single prompt change.

The hardest case is long documents (legal, technical manuals) where the answer needs evidence from two different sections. Multi-vector indexing — storing both fine-grained passage embeddings and coarse-grained section embeddings — solves this elegantly. It is also painful to implement well. Budget for it.

4. Cite or refuse

Two failure modes haunt every RAG system: hallucinated answers, and confident wrong answers. They are not the same. Hallucination — the model invents content not present in retrieval — is easier to detect. Confident wrong is worse: the model regurgitates content that is in retrieval but is wrong for the question, and the user has no signal to distrust it.

Our universal mitigation: the model must cite specific source chunks for every claim it makes, and the UI must render those citations as links to the source. If the model can't cite, it must refuse with a templated message. We tune the refusal threshold per customer — too eager and the system feels useless, too lax and the trust drops.

The refusal rate is a KPI, not a bug. A 0% refusal rate on a 250-question eval is a red flag — it means the model is confidently fabricating on the questions it doesn't know.

5. Cost ceilings are an engineering concern, not a finance concern

LLM provider bills are easy to lose track of. We set hard ceilings on day one and design the architecture to honour them:

  • Aggressive embedding caching. Embeddings are immutable for a given chunk; cache them forever.
  • Response caching with semantic deduplication. A SHA of the prompt is the cheap version; a normalised embedding match is the slightly fancier version. Either way, 20-40% of read-mostly RAG traffic will hit the cache.
  • Model-tier fallback. When monthly spend crosses 80% of budget, queries fall back to a cheaper model. Mark cached "frontier" answers for that customer so we don't double-pay later.
  • Per-tenant ceilings. SaaS RAG systems must enforce them at the database level, not in application code.

We have cut inherited RAG bills by 60-80% on first engagement, every single time, with these four moves. Nobody had been paying attention.

6. Latency is dominated by retrieval, not generation

A common assumption: "the LLM call is the bottleneck." It often isn't. With streaming, the user sees first tokens within 500ms even on frontier models. What they don't see — but feel — is the 1.8 seconds before that, where retrieval is running.

We profile end-to-end latency with OpenTelemetry traces and find the slow step. Usually it's the cross-encoder reranker on a cold cache, or a Postgres query missing an index because someone added a filter and forgot to add the index. Sometimes it's the embedding step on incoming queries — the LLM provider's embedding API isn't always fast.

Optimise the cold path. Cache the hot one.

7. Tenant isolation in two places

Cross-tenant data leakage is the worst possible failure mode for a B2B RAG system. We enforce isolation in two independent places:

  • SQL-level. Every retrieval query carries a tenant ID in the WHERE clause. Not optional. Not a Python-level filter. The database refuses.
  • Prompt-level. The prompt-builder explicitly asserts that all retrieved chunks share the same tenant ID. Mismatched chunks → exception → return a generic error to the user.

Belt and braces. If either layer is bypassed, the other catches it.

8. Red-team your own system before customers do

Once a month, on a real production replica, we run a deliberate red-team:

  • Can we get the system to leak a chunk from another tenant?
  • Can we get it to follow a prompt-injection instruction hidden in a retrieved document?
  • Can we exhaust the rate limiter and degrade service?
  • Can we trigger an expensive code path with cheap inputs?

We log every finding, fix the urgent ones, schedule the rest. By month six in production a system that has been red-teamed regularly is dramatically more robust than one that hasn't.

The point

RAG is engineering. It is testable, measurable, debuggable. The temptation to treat it as alchemy — try a thing, see if it feels better, ship — produces systems that work in demos and fail in production. Treating it as engineering produces systems that survive Monday mornings.

The eight habits above are not exhaustive, and the right balance of effort across them depends on your corpus and your traffic shape. But the meta-habit — measure, then change one thing at a time, then measure again — is universal.

Got an app that needs to last?

Take it from prototype to production.

Reply within one business day. Vibecoded MVP, AI-built draft, half-finished project, or a working product that's starting to crack — all welcome.

Start a project