Skip to main content
Audio works through the standard OpenAI Audio API. Point it at Orbitrage and we route it to a managed Deepgram model — no BYOK or extra account needed — bill it (per minute of audio for speech-to-text, per 1,000 characters for text-to-speech), and trace each call alongside your chat and image calls.
Deepgram audio is included with Orbitrage as a managed service: your prepaid credits cover it at the provider rate plus the standard 2.5% infra markup. You can also bring your own audio provider via BYOK.

Speech-to-text (transcription)

Use the OpenAI transcription endpoint. Set model to a Deepgram speech model (default nova-3):
import os, orbitrage
orbitrage.init(os.environ["ORBITRAGE_API_KEY"], user_id="customer_42")

from openai import OpenAI
client = OpenAI()

with open("call.wav", "rb") as f:
    tx = client.audio.transcriptions.create(model="nova-3", file=f)
print(tx.text)
The response is OpenAI-compatible:
{ "text": "Yeah, as much as it's worth celebrating…", "duration": 25.93, "model": "nova-3", "provider": "deepgram" }
Pass response_format=verbose_json to receive Deepgram’s full payload (words, timestamps, confidence).

Text-to-speech

Use the OpenAI speech endpoint. Set model (or voice) to a Deepgram Aura voice (default aura-2-thalia-en). The audio streams back for low latency:
resp = client.audio.speech.create(
    model="aura-2-thalia-en",
    input="Hello from Orbitrage. This voice is managed for you.",
    response_format="mp3",
)
resp.stream_to_file("hello.mp3")
response_format maps to a Deepgram container/encoding: mp3 (default), wav, opus, flac, aac.

Models

ModelTypeUse
nova-3Speech-to-textFastest accurate transcription (default)
nova-3-multilingualSpeech-to-text30+ languages
nova-3-medicalSpeech-to-textClinical vocabulary
nova-2Speech-to-textCheaper general-purpose
aura-2-thalia-enText-to-speechNatural English voice (default)
aura-2-*-enText-to-speechOther Aura-2 voices

Billing & tracing

Every audio call records a routing_steps row with tier: "audio", provider: "deepgram", the model, the exact cost, and latency — so it appears in your dashboard analytics and in the workflow trajectory graph next to your chat, tool, and image calls. Speech-to-text is billed per minute of processed audio; text-to-speech per 1,000 characters synthesized (both + 2.5% markup).