How it works
Sample
A weighted sample of the project’s recent production runs (whole traces, not
just single calls) is frozen for the experiment. Text traffic only, last 30 days.
Replay
The exact recorded inputs are re-sent to each candidate model. Recorded tool
results are injected so multi-step runs replay faithfully. Replays are billed to
your org credits like any inference and never appear in production analytics.
Judge
An LLM judge scores every replay against your production output on correctness,
completeness, instruction-following and format — producing a parity % (quality
vs. your current model = 100) and a verdict (better / parity / worse / unusable).
The page
Three cards sit at the top, computed from yourrouting_steps over the last 90 days:
| Card | Shows |
|---|---|
| Traffic distribution by model | Your top 5 models by request volume (share %). Hover a bar to see that model’s spend. |
| Frontier spend | Spend on closed, frontier-provider models (OpenAI, Anthropic, Google, …). |
| Open-source spend | Spend on open-source models (DeepSeek, Qwen, Llama, Mistral, …) — your cheaper alternative. |
Running an experiment
Click New Experiment, pick a project, and set:| Field | Meaning |
|---|---|
| Models to test | Open-source models (cheapest) or All models. |
| Sample rate | % of the project’s recent traffic to replay. |
| Budget | A hard cap on what the run can spend — it stops before exceeding it. |
Outcomes
A finished experiment lands in one of these states — a completed run that finds no cheaper model is Optimal, not a failure:| Status | Meaning |
|---|---|
| Running | Benchmark in progress. |
| Suggested | A cheaper model held your accuracy — review and apply. |
| Optimal | No candidate held accuracy — your current routing is already well-optimised. |
| Low data | Not enough replayable traffic to reach a confident verdict. |
| Failed | The run errored out (see the panel for why). |
The recommendation gate
Orbitrage only suggests a switch when it can stand behind it — never a hedged weak recommendation:- Quality parity ≥ 85% of your current model, and
- Cost ≤ −20% (meaningfully cheaper), and
- Confidence ≥ medium (a function of sample size and variance), with ≤ ~10% unusable verdicts.
Full benchmark details
Open any experiment to see the verdict, monthly-savings estimate, and the models-tested matrix (accuracy / cost Δ / latency Δ / errors vs. your current model). Expand Full benchmark details for:- Per-step routing plan — one verdict per recurring LLM call site (planner, tool-caller, summariser, …), grouped by system prompt + toolset.
- Replayed runs — the actual baseline-vs-candidate outputs for each replayed step, with the judge’s note.
Settings
Open Settings to control automatic benchmarking:| Setting | Meaning |
|---|---|
| Enable benchmarking | Master switch — off means no traffic is sampled or replayed. |
| Sample rate | % of traffic replayed as scheduled benchmarks. |
| Run benchmarks | Cadence — daily / weekly / biweekly / monthly. |
| Accuracy threshold | Only suggest a switch within this accuracy drop. |
| Max spend per run | Never spend above this on a single benchmark run. |