Skip to main content
The Router finds where you’re overpaying. It replays a sample of a project’s recent production traffic against candidate models, has an LLM judge score each replay against what your current model actually produced, and only recommends a switch when a cheaper model holds your quality. Apply it and Orbitrage routes that workload to the new model at the gateway — your application code never changes. Open it at app.orbitrage.ai/router.

How it works

1

Sample

A weighted sample of the project’s recent production runs (whole traces, not just single calls) is frozen for the experiment. Text traffic only, last 30 days.
2

Replay

The exact recorded inputs are re-sent to each candidate model. Recorded tool results are injected so multi-step runs replay faithfully. Replays are billed to your org credits like any inference and never appear in production analytics.
3

Judge

An LLM judge scores every replay against your production output on correctness, completeness, instruction-following and format — producing a parity % (quality vs. your current model = 100) and a verdict (better / parity / worse / unusable).
4

Report

Aggregates per candidate with a confidence tier, then applies the recommendation gate below. You get one clear verdict per workload, plus a per-step routing plan.

The page

Three cards sit at the top, computed from your routing_steps over the last 90 days:
CardShows
Traffic distribution by modelYour top 5 models by request volume (share %). Hover a bar to see that model’s spend.
Frontier spendSpend on closed, frontier-provider models (OpenAI, Anthropic, Google, …).
Open-source spendSpend on open-source models (DeepSeek, Qwen, Llama, Mistral, …) — your cheaper alternative.
Below them, the Experiments list shows every benchmark run with its outcome.

Running an experiment

Click New Experiment, pick a project, and set:
FieldMeaning
Models to testOpen-source models (cheapest) or All models.
Sample rate% of the project’s recent traffic to replay.
BudgetA hard cap on what the run can spend — it stops before exceeding it.
The run executes in the background (you can leave the page) and is billed to org credits.

Outcomes

A finished experiment lands in one of these states — a completed run that finds no cheaper model is Optimal, not a failure:
StatusMeaning
RunningBenchmark in progress.
SuggestedA cheaper model held your accuracy — review and apply.
OptimalNo candidate held accuracy — your current routing is already well-optimised.
Low dataNot enough replayable traffic to reach a confident verdict.
FailedThe run errored out (see the panel for why).

The recommendation gate

Orbitrage only suggests a switch when it can stand behind it — never a hedged weak recommendation:
  • Quality parity ≥ 85% of your current model, and
  • Cost ≤ −20% (meaningfully cheaper), and
  • Confidence ≥ medium (a function of sample size and variance), with ≤ ~10% unusable verdicts.
There’s also a quality-upgrade path: a model judged clearly better (≥ 40% of steps “better”, minimal regressions) at comparable cost is surfaced even without a cost drop.

Full benchmark details

Open any experiment to see the verdict, monthly-savings estimate, and the models-tested matrix (accuracy / cost Δ / latency Δ / errors vs. your current model). Expand Full benchmark details for:
  • Per-step routing plan — one verdict per recurring LLM call site (planner, tool-caller, summariser, …), grouped by system prompt + toolset.
  • Replayed runs — the actual baseline-vs-candidate outputs for each replayed step, with the judge’s note.
Use Share (the export icon) to print a clean PDF snapshot of a single experiment.

Settings

Open Settings to control automatic benchmarking:
SettingMeaning
Enable benchmarkingMaster switch — off means no traffic is sampled or replayed.
Sample rate% of traffic replayed as scheduled benchmarks.
Run benchmarksCadence — daily / weekly / biweekly / monthly.
Accuracy thresholdOnly suggest a switch within this accuracy drop.
Max spend per runNever spend above this on a single benchmark run.
Replays are tagged source = playground so they’re billed and visible in the Router, but kept out of your production Overview, Traces, and analytics.