model: "auto" and Orbitrage runs your prompt through a six-step pipeline.
The goal: the cheapest model that can actually handle this task.
auto vs. pinning a model
Let Orbitrage route
Pass
auto (or router, default, orbitrage). The engine scores the
prompt and selects a model for you.Pin a specific model
Pass any concrete model id (e.g.
claude-sonnet-4-6, gpt-5.4,
DeepSeek-V4-Flash). Scoring is skipped — the request goes straight there.The six-step pipeline
Normalize
The request (chat, Responses, or legacy completion shape) is normalized to a
common internal form, so the rest of the pipeline is uniform.
Score
The prompt gets a difficulty score in
[0.05, 0.95]. Three signals compete:- ML classifier — a ~33M-param
bge-smallprompt-difficulty model. - 70+ heuristics — regex rules that lower the score for extraction, formatting, classification, and classic exercises; raise it for reasoning, debugging, strategy, and long/complex prompts.
- Explicit annotation — a caller-supplied priority overrides both.
Capability ceiling
If the call declares a capability type, the score is capped so trivial work
can’t escalate to an expensive tier. A
formatting task is capped low;
reasoning and planning are uncapped.Dial
A per-deployment dial (0.0–1.0) shifts tier thresholds. Lower =
conservative (stay cheap longer); higher = aggressive (escalate sooner).
Select tier + model
The score maps to a tier, then the engine picks a concrete model: a
vision-capable model when the prompt has images, a code-biased model for code,
the cheapest open model otherwise. Long prompts escalate automatically;
trivially simple code de-escalates.
Tiers
Models are grouped by capability and cost. Routing climbs only as high as the prompt needs.| Tier | For | Example models |
|---|---|---|
| basic | Formatting, classification, extraction, simple chat | gpt-5-nano, gpt-4o-mini, gpt-5.4-mini, llama-3.1-8b-instant |
| mid | Everyday chat and code, moderate reasoning | gpt-5.4-nano, gpt-4o, DeepSeek-V4-Flash, FW-MiniMax-M2.5 |
| high | Hard reasoning, serious code, long context | gpt-5.4, Kimi-K2.6, DeepSeek-V3.2, grok-4, MiniMax-M2.5 |
| frontier | The hardest tasks, when the dial or an annotation pushes there | claude-opus-4-8, claude-sonnet-4-6, gpt-5.5 |
| image | Image generation (separate endpoint) | gpt-image-2 |
Reading the routing decision
Every routed call records the model it chose and why. On the dashboard’s Routing page (and each span):- Requested → Routed to — the alias you sent vs. the model used
- Tier and priority score — what the prompt scored and where it landed
- Signals — the heuristics that fired (e.g.
code detected,long prompt) - Fallback chain — the models that would have been tried on failure
- Saved — the cost difference vs. a frontier baseline
X-Orbitrage-Overhead-Ms response header, so you can
see exactly how much latency Orbitrage added on top of the provider.
Forcing behavior
Always use one model
Always use one model
Pin the model id on every call. A concrete id is treated as an explicit pin
and skips scoring entirely.
Use your own provider credits
Use your own provider credits
Save a provider key on the Models page.
Matching models are forwarded with your key. See BYOK.
Bias the whole project cheaper or smarter
Bias the whole project cheaper or smarter
The operator dial shifts tier thresholds for your deployment. Lower it to keep
traffic on cheaper tiers; raise it to escalate sooner.