Smart model routing is the key to AI governance (not just token savings)

The market is selling smart routing as a way to spend less on inference. We believe the operators who win will treat it as the place their policy lives — jurisdiction, customer tier, SLA, failover — and take the cost savings as a bonus.

If you're running an AI cloud, you probably have dozens of models serving traffic. A frontier model for the hard requests, a couple of open-weight workhorses for the rest, maybe a specialist for code or OCR or a particular language… and plenty of other experimental models that promised big things.

Someone has told you that you need "smart routing" to send each request to the cheapest model that can handle it.

They're right. It's also the least interesting thing routing does for you.

The cheapest model that works saves money, and it's the easy part

The price spread across the model menu is enormous and getting wider. Public API rates in 2026 run from around $0.10 per million tokens for the small, fast models up to $60 and beyond for frontier reasoning models.

Open-weight models keep landing within single digits of the closed frontier on performance benchmarks, while costing a tenth to a thirtieth as much per token. GPT-4-class capability that cost about $30 per million tokens in early 2023 now goes for under a dollar.

If you route the ‘easy’ 80% of your traffic to a small model and reserve the expensive one for the requests that truly need it, the savings can be enormous. LMSYS's RouteLLM work showed routing cut cost by around 85%, while holding 95% of GPT-4's quality.

So yes, do it. But using model routing purely to slash token costs misses some powerful value.

The router sees every request, so it's where policy belongs

The router or gateway is the one component in your multi-model workflow that inspects every request before it reaches a model, which makes it the natural home for your rules about what's allowed to happen.

Think about the decisions that an enterprise or an AI service provider might want to be enforced:

A request carrying personal or regulated data should only ever touch a model hosted in an approved jurisdiction.
A Platinum-tier customer gets the premium model with no cost ceiling.
An internal expense-categorization job gets the smallest model that clears the bar.
A legal workload routes to approved vendors only.
A request that blows its latency budget fails over to a faster provider.
When a tenant hits its spend cap, traffic spills to cheaper capacity instead of failing outright.

None of those are purely cost decisions. They're important policy decisions, and the question the router is answering stops being "which model is cheapest" and becomes "which model is allowed, available, and appropriate for this request, under this customer's policy, right now." That's a governance function.

From August, "which jurisdiction" stops being a preference

If you sell into regulated industries, governments, or your own national market, one of those policies just acquired legal teeth. The EU AI Act's high-risk obligations apply from 2 August 2026, with penalties reaching €35 million or 7% of global turnover at the top of the scale.

The compliance question has moved on, too: it's no longer only "where does the data sit" but "where is this inference processed, and who can see it while it happens." Self-hosted inference inside the right jurisdiction makes a whole category of cross-border-transfer paperwork disappear. And regulators have started penalizing missing controls, not only breaches, which means you need to log that a request was routed to a compliant model, not merely intend it to happen.

Routing as a resilience engine

When it comes to models, everyone benchmarks quality of output (coding, math, whatever the use case). Enterprises lose sleep over uptime. A provider outage at 2am should not become your outage, and a model that starts returning garbage or timing out should be routed around before your user or customer notices.

Routing is also a resilience layer: retry, fall back to a second deployment, switch provider if you have to, keep the same API to the caller, and carry on serving. The plumbing for this — fallback chains, cooldown on unhealthy backends, weighted failover within a model group before escalating across groups — is well-trodden ground in LiteLLM, which is the engine a lot of this is built on, including ours.

What you can do today

According to a show of hands at PlatformCon London this month, very few platform teams have implemented any kind of model routing gateway at all to date. But the technology is very much there. The proverbial low-hanging fruit is to route your easy traffic to small models, even if users are asking for Claude Opus on max effort for every request. The savings are immediate, the risk is low.

More strategically, embrace the idea of model routing as a way to enforce policy or governance. Jurisdiction, tier, SLA, approved vendors: express them as rules a non-engineer can read and an auditor can check, and make sure you can log every routing decision before August, not after your first compliance review.

Treat the router as your resilience layer from day one. We all hate it when Claude or ChatGPT goes down unexpectedly, and it happens too often. A single-provider dependency with no failover path is an outage you've agreed to in advance.

The operators who win the next few years won't be the ones with the cheapest route. They'll be the ones who can prove which model answered, why, in which jurisdiction, and under whose policy, and then bill for the confidence that gives their customers.

Routing is doing for models roughly what Kubernetes did for containers: you declare what you want, and the platform decides where and how to make it happen. The ‘cheapest model that works’ is just the first, smallest thing you'd ever ask it for.

‍

Jun 29, 2026