Audio and speech models

Developer documentation

Audio and speech models

Text-to-speech, GPT audio chat, realtime sessions, transcription-oriented catalog rows, and voice settings.

Model Reference

Audio and speech models

Text-to-speech, GPT audio chat, realtime sessions, transcription-oriented catalog rows, and voice settings. Endpoint: http://omixa.cloud/api/v1/audio

Chirp

chirp

Chirp for speech, transcription, translation, or voice generation workflows.

Audio
minimum hold $0.010000
Integration docs

Chirp 2

chirp-2

Chirp 2 for speech, transcription, translation, or voice generation workflows.

Audio
minimum hold $0.010000
Integration docs

Chirp 3

chirp-3

Chirp 3 for speech, transcription, translation, or voice generation workflows.

Audio
minimum hold $0.010000
Integration docs

GPT Audio

gpt-audio

GPT Audio for speech, transcription, translation, or voice generation workflows.

Audio Context window: 128,000 tokens Max output: 16,384 tokens
input per 1m tokens $2.500000
output per 1m tokens $10.000000
minimum hold $0.010000
Integration docs

GPT Audio 1.5

gpt-audio-1.5

GPT Audio 1.5 for speech, transcription, translation, or voice generation workflows.

Audio Context window: 128,000 tokens Max output: 16,384 tokens
input per 1m tokens $2.500000
output per 1m tokens $10.000000
minimum hold $0.010000
Integration docs

GPT Audio Mini

gpt-audio-mini

GPT Audio Mini for speech, transcription, translation, or voice generation workflows.

Audio Context window: 128,000 tokens Max output: 16,384 tokens
input per 1m tokens $0.600000
output per 1m tokens $2.400000
minimum hold $0.010000
Integration docs

GPT Realtime

gpt-realtime

GPT Realtime for speech, transcription, translation, or voice generation workflows.

Audio Context window: 32,000 tokens Max output: 4,096 tokens
input per 1m tokens $4.000000
cached input per 1m tokens $0.400000
output per 1m tokens $16.000000
Integration docs

GPT Realtime 1.5

gpt-realtime-1.5

GPT Realtime 1.5 for speech, transcription, translation, or voice generation workflows.

Audio Context window: 32,000 tokens Max output: 4,096 tokens
input per 1m tokens $4.000000
cached input per 1m tokens $0.400000
output per 1m tokens $16.000000
Integration docs

GPT Realtime 2

gpt-realtime-2

GPT Realtime 2 for speech, transcription, translation, or voice generation workflows.

Audio Context window: 32,000 tokens Max output: 4,096 tokens
input per 1m tokens $4.000000
cached input per 1m tokens $0.400000
output per 1m tokens $24.000000
Integration docs

GPT Realtime Mini

gpt-realtime-mini

GPT Realtime Mini for speech, transcription, translation, or voice generation workflows.

Audio Context window: 32,000 tokens Max output: 4,096 tokens
input per 1m tokens $0.600000
cached input per 1m tokens $0.060000
output per 1m tokens $2.400000
Integration docs

GPT-4o Audio Preview

gpt-4o-audio-preview

GPT-4o Audio Preview for speech, transcription, translation, or voice generation workflows.

Audio Context window: 128,000 tokens Max output: 16,384 tokens
input per 1m tokens $2.500000
output per 1m tokens $10.000000
minimum hold $0.010000
Integration docs

GPT-4o Mini Audio Preview

gpt-4o-mini-audio-preview

GPT-4o Mini Audio Preview for speech, transcription, translation, or voice generation workflows.

Audio Context window: 128,000 tokens Max output: 16,384 tokens
input per 1m tokens $0.150000
output per 1m tokens $0.600000
minimum hold $0.010000
Integration docs

GPT-4o Mini Realtime Preview

gpt-4o-mini-realtime-preview

GPT-4o Mini Realtime Preview for speech, transcription, translation, or voice generation workflows.

Audio Context window: 128,000 tokens Max output: 4,096 tokens
input per 1m tokens $0.600000
cached input per 1m tokens $0.300000
output per 1m tokens $2.400000
Integration docs

GPT-4o Mini TTS

gpt-4o-mini-tts

GPT-4o Mini TTS for speech, transcription, translation, or voice generation workflows.

Audio
input per 1m tokens $0.600000
output per 1m tokens $12.000000
audio per minute $0.020000
Integration docs

GPT-4o Mini Transcribe

gpt-4o-mini-transcribe

GPT-4o Mini Transcribe for speech, transcription, translation, or voice generation workflows.

Audio
input per 1m tokens $1.250000
output per 1m tokens $5.000000
audio per minute $0.003000
Integration docs

GPT-4o Realtime Preview

gpt-4o-realtime-preview

GPT-4o Realtime Preview for speech, transcription, translation, or voice generation workflows.

Audio Context window: 32,000 tokens Max output: 4,096 tokens
input per 1m tokens $5.000000
cached input per 1m tokens $2.500000
output per 1m tokens $20.000000
Integration docs

GPT-4o Transcribe

gpt-4o-transcribe

GPT-4o Transcribe for speech, transcription, translation, or voice generation workflows.

Audio
input per 1m tokens $2.500000
output per 1m tokens $10.000000
audio per minute $0.006000
Integration docs

GPT-4o Transcribe Diarize

gpt-4o-transcribe-diarize

GPT-4o Transcribe Diarize for speech, transcription, translation, or voice generation workflows.

Audio
input per 1m tokens $2.500000
output per 1m tokens $10.000000
minimum hold $0.010000
Integration docs

Gemini 2.0 Flash Live

gemini-2.0-flash-live-001

Gemini 2.0 Flash Live for speech, transcription, translation, or voice generation workflows.

Audio Streaming Tools Context window: 1,048,576 tokens Max output: 8,192 tokens
input per 1m tokens $0.500000
output per 1m tokens $2.000000
audio per minute $0.018000
Integration docs

Gemini 2.5 Flash Live Preview

gemini-2.5-flash-live-preview

Gemini 2.5 Flash Live Preview for speech, transcription, translation, or voice generation workflows.

Audio Streaming Tools Context window: 1,048,576 tokens Max output: 8,192 tokens
input per 1m tokens $0.500000
output per 1m tokens $2.000000
audio per minute $0.018000
Integration docs

Gemini 2.5 Flash TTS

gemini-2.5-flash-tts

Gemini 2.5 Flash TTS for speech, transcription, translation, or voice generation workflows.

Audio Streaming Streaming supported Reasoning controls: minimal, low, medium, high
input per 1m tokens $0.500000
output per 1m tokens $10.000000
audio per minute $0.015000
Integration docs

Gemini 2.5 Flash-Lite TTS Preview

gemini-2.5-flash-lite-preview-tts

Gemini 2.5 Flash-Lite TTS Preview for speech, transcription, translation, or voice generation workflows.

Audio Streaming Streaming supported Reasoning controls: minimal, low, medium, high
input per 1m tokens $0.500000
output per 1m tokens $10.000000
audio per minute $0.015000
Integration docs

Gemini 2.5 Pro TTS

gemini-2.5-pro-tts

Gemini 2.5 Pro TTS for speech, transcription, translation, or voice generation workflows.

Audio Streaming Streaming supported Reasoning controls: low, medium, high
input per 1m tokens $1.000000
output per 1m tokens $20.000000
audio per minute $0.030000
Integration docs

Gemini 3.1 Flash Live Preview

gemini-3.1-flash-live-preview

Gemini 3.1 Flash Live Preview for speech, transcription, translation, or voice generation workflows.

Audio Streaming Tools Context window: 1,048,576 tokens Max output: 8,192 tokens
input per 1m tokens $0.750000
output per 1m tokens $4.500000
audio per minute $0.018000
Integration docs

Gemini 3.1 Flash TTS Preview

gemini-3.1-flash-tts-preview

Gemini 3.1 Flash TTS Preview for speech, transcription, translation, or voice generation workflows.

Audio Streaming Streaming supported Reasoning controls: minimal, low, medium, high
input per 1m tokens $1.000000
output per 1m tokens $20.000000
audio per minute $0.030000
Integration docs

TTS

tts

TTS for speech, transcription, translation, or voice generation workflows.

Audio
audio per minute $0.020000
minimum hold $0.010000
Integration docs

TTS HD

tts-hd

TTS HD for speech, transcription, translation, or voice generation workflows.

Audio
audio per minute $0.020000
minimum hold $0.010000
Integration docs

Whisper

whisper

Whisper for speech, transcription, translation, or voice generation workflows.

Audio
audio per minute $0.020000
minimum hold $0.010000
Integration docs
Copied Markdown