Multi-Model Architecture: Frontier Models for Reasoning, Small Models for Everything Else

May 22, 20265 min read

AILLMArchitectureCost Optimization

Our first months in production, every single LLM call in the product went through the most capable model available. Agent conversation? Frontier model. Generating a one-line title for a chat thread? Same frontier model. Classifying whether an uploaded image was a logo? Also the frontier model.

It worked, which is exactly the problem. Nothing forces you to fix an architecture that works, until you look at the invoice, or until latency complaints arrive for features that should feel instant. We looked at both. The fix was not "use a cheaper model." It was accepting that model selection is an architecture decision, the same family of decision as choosing between Postgres and a queue.

Product details anonymized. Real engineering patterns.

The workload audit

Before changing anything, we listed every place the product calls a model and sorted them by what the call actually requires. The pattern that emerged:

Tier 1, orchestration and reasoning. The agent conversation itself: understanding intent, planning multi-step work, choosing tools, composing SQL, handling ambiguity. Low volume (one conversation at a time per user), highest stakes, genuinely needs frontier-level capability. When the orchestrator picks the wrong tool, everything downstream is garbage.

Tier 2, bounded generation. Producing a performance narrative for a dashboard, drafting ad copy variants from a brand kit, summarizing a campaign's week. Well-scoped tasks with clear inputs and a defined shape of output. A mid-tier model does these indistinguishably from the frontier model (we verified with blind comparisons on our eval set, not vibes).

Tier 3, high-frequency micro-tasks. Title generation, classification, extraction of structured fields, yes/no relevance checks, routing. Enormous volume, tiny individual stakes, latency-sensitive. Small, fast models. Some of these calls eventually stopped being LLM calls at all: a regex or a lookup table replaced two of them, which is its own lesson.

The distribution surprised me: by call count, Tier 3 was the overwhelming majority of our traffic. By cost, before the change, Tier 1 conversations and Tier 3 micro-tasks were comparable lines on the invoice. We were paying frontier prices for work a model a fraction of the cost does identically.

What the routing actually looks like

Nothing exotic. A model registry maps each task type (not each call site) to a model configuration: provider, model, fallback, max tokens, temperature. Code declares "this is a narrative_generation task" and the registry decides what runs it.

Loading diagram…

Two design choices paid for themselves repeatedly:

Centralize the mapping. When task-to-model decisions live in one file instead of scattered across call sites, swapping a tier after a new model release is a one-line change plus an eval run. We have re-tiered tasks four or five times as the model landscape shifted, and the registry made each migration boring, which is the highest compliment infrastructure can receive.

Eval per task, not per model. Each task type has its own small golden set. "Is model X good?" is not answerable. "Does model X produce acceptable narratives on our narrative cases?" is a question you can answer in minutes. Downgrading a task to a cheaper tier without its eval is gambling; with it, it is routine maintenance.

There is also a user-facing wrinkle: for the main agent, we expose a model selector. Some users want maximum capability, others want speed. That only works because the architecture already treats the model as a parameter rather than a constant.

The async trick that mattered more than the routing

One change improved perceived performance more than all the model swaps combined: moving generation out of the request path.

Dashboard narratives used to generate on page load, and the user stared at a shimmer while a model wrote three insightful sentences. Now narratives generate asynchronously when fresh data lands, get cached, and the dashboard renders instantly with the pre-computed text. Regeneration happens in the background on data changes.

The general principle: an LLM call in a synchronous user path needs to justify itself. Most generation is predictable enough to do ahead of time. Users do not experience your model's latency; they experience your architecture's latency.

Honest costs of the approach

Multi-model is not free. You maintain prompt variants per tier, because a prompt tuned for a frontier model often underperforms verbatim on a small one: smaller models want more explicit structure and fewer implied steps. Your eval surface multiplies. Fallback logic (provider outage, rate limits) needs testing per tier. And there is a constant temptation to over-engineer routing for calls that cost fractions of a cent, which is precisely why the audit matters, so you optimize the lines that show up on the invoice.

For us the trade was clearly worth it: a large cut in inference spend at equal output quality (per the evals), and Tier 3 features went from noticeably laggy to effectively instant.

The audit comes first

Start with the audit, not the router. One afternoon of listing every model call and asking "what does this actually require?" tells you whether you have a multi-model problem or just two expensive endpoints to fix.

And build the eval sets before the migration, not after. The entire approach rests on being able to say "the cheap model is good enough here" with evidence. Without that, tiering is just cost-cutting with extra steps, and the first quality regression will send everything back to the frontier model, this time with organizational scar tissue attached.

Working on a similar AI project? Let's talk about it.