OpenAI on May 7, 2026 launched three new audio models inside its Realtime API — a step the company framed as moving live voice from demoware into an enterprise-ready surface. The headline model, GPT-Realtime-2, runs on GPT-5-class reasoning and sits alongside two specialist siblings: GPT-Realtime-Translate for live multilingual conversation, and GPT-Realtime-Whisper for streaming speech-to-text. The Realtime API itself has been generally available since August 28, 2025, when it graduated from beta alongside the original gpt-realtime model.
The release lands during a sprint of voice-agent launches across the industry, but OpenAI is leaning on raw capability rather than packaging. GPT-Realtime-2 ships with a 128K context window, up from 32K in the previous generation, and adds parallel tool calls, short spoken preambles like "let me check that," and recovery behavior when a task fails midway through a call.
What the three models do
GPT-Realtime-2 is positioned as the agentic voice model. According to OpenAI, it is designed to manage context across longer conversations, use tools, accept corrections, and avoid the rigid turn-taking that plagued earlier voice systems. The company reported a roughly 15-point improvement on internal audio benchmarks compared with the prior realtime model.
GPT-Realtime-Translate handles live cross-lingual speech, accepting input in more than 70 languages and producing output in 13. OpenAI is pitching it for customer support, cross-border sales calls, education, and event interpretation. GPT-Realtime-Whisper, meanwhile, brings streaming transcription to the API for use cases like meeting capture, accessibility tooling, and call-center pipelines.
Pricing and availability
Pricing reflects the heavier reasoning load on the flagship model. GPT-Realtime-2 is billed at $32 per million audio input tokens, $0.40 per million cached input tokens, and $64 per million audio output tokens. The two specialist models are metered by the minute: $0.034 per minute for Translate and $0.017 per minute for Whisper. All three are available immediately through the Realtime API, which has been generally available since August 2025.
OpenAI told developers the company embedded safeguards intended to halt conversations automatically when its content guidelines are violated, a response to ongoing concerns about voice cloning, fraud, and synthetic-call abuse.
Why it matters
Voice has lagged text in agentic AI, largely because realtime systems struggled to reason while staying in sync with a speaker. By compressing GPT-5-class reasoning, a 4x larger context window, and parallel tool use into a single conversational model, OpenAI is signalling that the next wave of voice agents will be able to handle support tickets, translate cross-border calls, and run live multi-step workflows without falling back to a script.
For enterprises sitting on top of legacy IVR stacks, the metered translation and transcription models are the cleaner upgrade path. For developers building on a Realtime API that has been GA for months, the new models remove the last capability excuses to keep voice features in private beta — and tighten the competitive vise on rivals shipping voice products this quarter.



