Cost ceilings for LLM features are an engineering problem

Most LLM features in production are run by teams who treat the monthly bill as a finance problem to escalate. The CFO asks a question, someone forwards a dashboard, the engineers shrug and promise to look at it next quarter. By the time anyone looks seriously, the bill is a multiple of what it should be and the system is too entangled to fix cheaply.

Cost is an engineering problem. It has engineering levers. It is one of the highest-leverage things to design from day one rather than discover at month three, and the difference between a team that designs for it and a team that doesn't is usually an order of magnitude on the bill, not a few percent.

Here is the lever set we reach for, in roughly the order it matters.

1. A per-request ceiling, set at the start

The first thing we do on any LLM engagement is name a per-request ceiling. Not a budget. A ceiling: the number above which a single user-facing call is considered a defect, not an expense.

The ceiling is derived backwards from eval data, not forwards from a finance team's guess. You take your eval set, you run the cheapest model that passes the quality bar, you measure the actual distribution of tokens in and tokens out across realistic inputs, and you set the ceiling at a healthy multiple of the p95 — enough headroom for unusual inputs, tight enough that a runaway prompt or a context-window blowup trips an alarm rather than ten thousand silent invocations.

Once the ceiling exists, every design decision is checked against it. Adding a retrieval step? Show me the worst-case token count. Letting the model write a full document? Cap the output tokens, hard, and degrade gracefully if it hits the cap. The ceiling is the thing the team designs backwards from. Without it you are doing exploratory shopping with a corporate card.

2. A per-tenant monthly ceiling, configurable, with a graceful fallback

A per-request ceiling stops one runaway call. A per-tenant monthly ceiling stops one runaway customer.

Every multi-tenant LLM system we have seen at scale has had at least one tenant who, for entirely legitimate reasons, used ten or twenty times more than the median. Sometimes they are a power user. Sometimes a script. Sometimes an integration that nobody remembers writing. The right response is not to surprise them with a suspension; it is to fall back, smoothly, to a cheaper or local model once a healthy fraction of the monthly ceiling has been consumed.

The threshold matters more than the absolute number. Falling back at the last token is hostile. Falling back at the first token is theatre. Somewhere in between — well below the ceiling but well above the median — is a threshold that gives the user a degraded but functional experience and gives your team time to either upsell or investigate. The number should be configurable per tenant, because the answer for an enterprise plan is different from the answer for a free tier, and because the number will be wrong on day one.

3. Aggressive caching of two specific things

Caching is the single highest-leverage thing in this list, and almost nobody does it well.

Embeddings, cache indefinitely. Embeddings are idempotent: the same input and the same model produce the same vector forever. There is no reason to recompute them. Hash the normalised input, store the vector, never look up the same string twice. On any meaningfully large corpus this alone is the difference between a credible bill and an absurd one.

Responses, cache for short TTLs. Response caching is harder because outputs aren't deterministic and inputs vary in subtle ways. But on read-mostly workloads — "summarise this page", "explain this concept", "answer this FAQ-shaped question" — a cache TTL measured in minutes or hours has a disproportionate hit rate. Real production traffic is not uniformly distributed; a small number of queries account for a large fraction of the volume, and you only need to pay the model for one of them.

The cache-key surface needs care. You want the key to include the model, the prompt template version, the relevant retrieval context, and a normalised form of the user input — lowercased, whitespace-collapsed, with obvious PII either stripped or hashed before it touches the key. A SHA-256 over the normalised tuple is fine; it is one-way, it composes cleanly, and it doesn't leak anything to whoever inspects the cache later. Get the key surface right once and the cache pays for itself for the life of the system.

4. Model selection as a tiering exercise

There is no single best model. There is a best model for a given task at a given quality bar, and that model is almost never the frontier one.

We tier the workload. Cheap, small models for non-critical extraction and classification: pulling structured fields out of a document, deciding which of three buckets a query belongs to, normalising user input. Frontier models for the steps that genuinely need reasoning: synthesis across multiple sources, multi-step planning, anything where a wrong answer is expensive. Fine-tuned or distilled models for narrow, high-volume cases where you have enough data to train and the task is well-defined enough that frontier-model generality is wasted.

The rule of thumb is: pick the cheapest model that passes the eval bar, and that is usually one tier below the frontier. Frontier models are correctly used as the fallback when something harder shows up, not as the default. A system that uses the frontier model for every step is almost always over-paying by a factor that is embarrassing once you measure it.

5. Eval-driven, not vibe-driven

Everything above only works if you can measure quality. Without an eval set, the most you can ever say is: "the bill went down and the user complaints did not go up, that we know of, yet." The first half of that sentence is engineering. The second half is hope.

An eval set does not need to be elaborate. A few hundred representative inputs with known-good outputs, scored automatically where possible and reviewed by a human where not, is enough to make every cost lever above legible. You change the model — does the score move? You add a cache — does the score move? You shorten the prompt — does the score move? Without that loop you are guessing in public.

This is the work AI itself most reliably skips. Almost every inherited LLM system we have audited has had no eval set, no regression suite, no quality baseline. Adding one is usually a few days of work and unlocks every other lever in this list.

What this actually buys you

On inherited systems we have cut LLM bills by large factors — comfortably more than half, often much more — without measurable quality regressions. The trick is not a clever trick. It is caring early, measuring honestly, and designing backwards from a ceiling instead of forwards from a default.

The work is not glamorous. It is reading prompt logs, building eval harnesses, tagging cache keys, configuring per-tenant fallbacks. It is the operations layer underneath the demo, and it is exactly the layer the demo never has.

Why the engineer has to do this, not the model

There is a tempting symmetry to the idea that AI should help you put a leash on AI cost. It mostly doesn't, and the reason is structural. The model cannot reliably reason about its own cost envelope, because it is the thing being budgeted. It will happily recommend a more capable model when the cheaper one would do. It will happily expand a prompt when a shorter one would score the same. It will not, on its own, build the eval set that would tell you either of those calls was wrong.

The engineering judgement has to come from outside the system. That is the job. Done well, it is the difference between an LLM feature that pays for itself and one that quietly eats the margin of the product it was supposed to improve.

Tagged

Cost ceilings for LLM features are an engineering problem, not a finance one