Router

The Router finds where you’re overpaying. It replays a sample of a project’s recent production traffic against candidate models, has an LLM judge score each replay against what your current model actually produced, and only recommends a switch when a cheaper model holds your quality. Apply it and Orbitrage routes that workload to the new model at the gateway — your application code never changes. Open it at app.orbitrage.ai/router.

How it works

Sample

A weighted sample of the project’s recent production runs (whole traces, not just single calls) is frozen for the experiment. Text traffic only, last 30 days.

Replay

The exact recorded inputs are re-sent to each candidate model. Recorded tool results are injected so multi-step runs replay faithfully. Replays are billed to your org credits like any inference and never appear in production analytics.

Judge

An LLM judge scores every replay against your production output on correctness, completeness, instruction-following and format — producing a parity % (quality vs. your current model = 100) and a verdict (better / parity / worse / unusable).

Report

Aggregates per candidate with a confidence tier, then applies the recommendation gate below. You get one clear verdict per workload, plus a per-step routing plan.

The page

Three cards sit at the top, computed from your routing_steps over the last 90 days:

Card	Shows
Traffic distribution by model	Your top 5 models by request volume (share %). Hover a bar to see that model’s spend.
Frontier spend	Spend on closed frontier models (OpenAI, Anthropic, Google, xAI). These are BYOK, so this is what your provider billed you — it never touches Orbitrage credits.
Open-source spend	Spend on open-weight models (DeepSeek, Qwen, GLM, Mistral, …) — the models Orbitrage serves and debits from your credits.

Below them, the Experiments list shows every benchmark run with its outcome.

Running an experiment

Click New Experiment, pick a project, and set:

Field	Meaning
Models to test	`Open-source models` (cheapest) or `All models`.
Sample rate	% of the project’s recent traffic to replay.
Budget	A hard cap on what the run can spend — it stops before exceeding it.

The run executes in the background (you can leave the page) and is billed to org credits.

Outcomes

A finished experiment lands in one of these states — a completed run that finds no cheaper model is Optimal, not a failure:

Status	Meaning
Running	Benchmark in progress.
Suggested	A cheaper model held your accuracy — review and apply.
Optimal	No candidate held accuracy — your current routing is already well-optimised.
Low data	Not enough replayable traffic to reach a confident verdict.
Failed	The run errored out (see the panel for why).

The recommendation gate

Orbitrage only suggests a switch when it can stand behind it — never a hedged weak recommendation:

Quality parity ≥ 85% of your current model, and
Cost ≤ −20% (meaningfully cheaper), and
Confidence ≥ medium (a function of sample size and variance), with ≤ ~10% unusable verdicts.

There’s also a quality-upgrade path: a model judged clearly better (≥ 40% of steps “better”, minimal regressions) at comparable cost is surfaced even without a cost drop.

Full benchmark details

Open any experiment to see the verdict, monthly-savings estimate, and the models-tested matrix (accuracy / cost Δ / latency Δ / errors vs. your current model). Expand Full benchmark details for:

Per-step routing plan — one verdict per recurring LLM call site (planner, tool-caller, summariser, …), grouped by system prompt + toolset.
Replayed runs — the actual baseline-vs-candidate outputs for each replayed step, with the judge’s note.

Use Share (the export icon) to print a clean PDF snapshot of a single experiment.

Settings

Open Settings to control automatic benchmarking:

Setting	Meaning
Enable benchmarking	Master switch — off means no traffic is sampled or replayed.
Sample rate	% of traffic replayed as scheduled benchmarks.
Run benchmarks	Cadence — daily / weekly / biweekly / monthly.
Accuracy threshold	Only suggest a switch within this accuracy drop.
Max spend per run	Never spend above this on a single benchmark run.

Replays are tagged source = playground so they’re billed and visible in the Router, but kept out of your production Overview, Traces, and analytics.

Get Started

Core Concepts

SDKs

Integrations

Examples

Dashboard

Platform

Account & Billing

How it works

The page

Running an experiment

Outcomes

The recommendation gate

Full benchmark details

Settings

​How it works

​The page

​Running an experiment

​Outcomes

​The recommendation gate

​Full benchmark details

​Settings

How it works

The page

Running an experiment

Outcomes

The recommendation gate

Full benchmark details

Settings