Skip to main content
Send model: "auto" and Orbitrage runs your prompt through a six-step pipeline. The goal: the cheapest model that can actually handle this task.

auto vs. pinning a model

Let Orbitrage route

Pass auto (or router, default, orbitrage). The engine scores the prompt and selects a model for you.

Pin a specific model

Pass any concrete model id (e.g. claude-sonnet-4-6, gpt-5.4, DeepSeek-V4-Flash). Scoring is skipped — the request goes straight there.
# Routed — cheapest capable model:
client.chat.completions.create(model="auto", messages=[...])

# Pinned — exactly this model, no scoring:
client.chat.completions.create(model="claude-sonnet-4-6", messages=[...])

The six-step pipeline

1

Normalize

The request (chat, Responses, or legacy completion shape) is normalized to a common internal form, so the rest of the pipeline is uniform.
2

Score

The prompt gets a difficulty score in [0.05, 0.95]. Three signals compete:
  • ML classifier — a ~33M-param bge-small prompt-difficulty model.
  • 70+ heuristics — regex rules that lower the score for extraction, formatting, classification, and classic exercises; raise it for reasoning, debugging, strategy, and long/complex prompts.
  • Explicit annotation — a caller-supplied priority overrides both.
A web-search signal (asking for live/recent data) bumps the score onto a capable tier.
3

Capability ceiling

If the call declares a capability type, the score is capped so trivial work can’t escalate to an expensive tier. A formatting task is capped low; reasoning and planning are uncapped.
4

Dial

A per-deployment dial (0.0–1.0) shifts tier thresholds. Lower = conservative (stay cheap longer); higher = aggressive (escalate sooner).
5

Select tier + model

The score maps to a tier, then the engine picks a concrete model: a vision-capable model when the prompt has images, a code-biased model for code, the cheapest open model otherwise. Long prompts escalate automatically; trivially simple code de-escalates.
6

Proxy + fallback

The request is proxied to the provider. On an infrastructure error (5xx, 429, connectivity), a fallback chain of 2–5 models is tried across providers. Client errors (4xx, content filters) do not cascade — they return immediately.

Tiers

Models are grouped by capability and cost. Routing climbs only as high as the prompt needs.
TierForExample models
basicFormatting, classification, extraction, simple chatgpt-5-nano, gpt-4o-mini, gpt-5.4-mini, llama-3.1-8b-instant
midEveryday chat and code, moderate reasoninggpt-5.4-nano, gpt-4o, DeepSeek-V4-Flash, FW-MiniMax-M2.5
highHard reasoning, serious code, long contextgpt-5.4, Kimi-K2.6, DeepSeek-V3.2, grok-4, MiniMax-M2.5
frontierThe hardest tasks, when the dial or an annotation pushes thereclaude-opus-4-8, claude-sonnet-4-6, gpt-5.5
imageImage generation (separate endpoint)gpt-image-2
See Models for the catalog, pricing, vision support, and context windows.

Reading the routing decision

Every routed call records the model it chose and why. On the dashboard’s Routing page (and each span):
  • Requested → Routed to — the alias you sent vs. the model used
  • Tier and priority score — what the prompt scored and where it landed
  • Signals — the heuristics that fired (e.g. code detected, long prompt)
  • Fallback chain — the models that would have been tried on failure
  • Saved — the cost difference vs. a frontier baseline
The gateway also returns an X-Orbitrage-Overhead-Ms response header, so you can see exactly how much latency Orbitrage added on top of the provider.

Forcing behavior

Pin the model id on every call. A concrete id is treated as an explicit pin and skips scoring entirely.
Save a provider key on the Models page. Matching models are forwarded with your key. See BYOK.
The operator dial shifts tier thresholds for your deployment. Lower it to keep traffic on cheaper tiers; raise it to escalate sooner.