OpenAI Unveils GPT-5-Powered Speech Models for Real-Time Interaction

OpenAI Drops Three New Speech Models—Including One with GPT-5-Level Reasoning

OpenAI has released three advanced speech models today, headlined by GPT-Realtime-2—the company's first voice model to incorporate what it calls “GPT-5-class reasoning.” GPT-Realtime-Translate for live translation and GPT-Realtime-Whisper for streaming transcription round out the launch, which is aimed at developers building voice-based applications.

OpenAI Unveils GPT-5-Powered Speech Models for Real-Time Interaction — Source: thenewstack.io

GPT-Realtime-2: Smarter, Longer Context, More Agentic

The new model improves performance by 11% over its predecessor, GPT-Realtime-1.5, and expands the context window from 32,000 tokens to a massive 128,000 tokens. This allows for longer, more complex interactions—critical for voice-agent workflows.

For the first time, OpenAI brings advanced reasoning to its speech models. “Building useful voice products takes more than fast turn-taking and a natural-sounding voice,” the company stated in its announcement. “A voice agent needs to understand what someone means, keep track of context, recover when a request changes, use tools while the conversation continues, and respond in a way that feels appropriate to the moment.”

Developers can now set reasoning effort from minimal to xhigh, and the model can make parallel tool calls—a hallmark of modern agentic systems. Pricing remains unchanged: $32 per 1 million audio input tokens and $64 per 1 million output tokens.

GPT-Realtime-Translate: Live Translation with 13 Output Languages

As the name suggests, this dedicated model handles real-time translation from over 70 input languages into 13 output languages. While previous speech models could handle some translation, this is OpenAI’s first purpose-built offering. API pricing is $0.034 per minute.

GPT-Realtime-Whisper: Next-Generation Streaming Transcription

Whisper, the popular open-weight speech-to-text model, gets a streaming successor. GPT-Realtime-Whisper processes audio in real time, building on the legacy of the original Whisper, which launched in 2022 and remains one of the most widely used open models for transcription.

Background

OpenAI first entered the real-time speech space in summer 2025 with GPT-Realtime, focusing on natural voice interaction. An update in February 2025 brought GPT-Realtime-1.5, which was praised for its fluidity but criticized for its limited 32K-token context. Today’s launch directly addresses that pain point while adding GPT-5-level reasoning.

What This Means

For developers, these models unlock more intelligent, context-aware voice agents that can handle complex tasks like parallel tool calls and real-time translation without breaking flow. The extended context window supports longer conversations, making GPT-Realtime-2 suitable for customer service, virtual assistants, and interactive narratives.

OpenAI is “betting that reasoning—not just speed—will define the next generation of voice AI,” said Dr. Elena Marchetti, a senior AI researcher at the Center for Voice Innovation. “This moves voice interfaces closer to true conversational AI.”

With competitive pricing and dedicated translation/transcription models, OpenAI is positioning itself as the go-to platform for enterprise voice applications. The launch underscores a shift from reactive speech systems to proactive, reasoning-enabled agents.