OpenAI Adds GPT-Realtime-2 Voice Reasoning and Live Translation

According to OpenAI, the API now includes three realtime audio models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. OpenAI wrote "GPT-Realtime-2, our first voice model with GPT-5-class reasoning," which the company presents as able to handle harder requests and carry conversations forward naturally. OpenAI says GPT-Realtime-2 supports a 128k token context window. GPT-Realtime-Translate translates speech from 70+ input languages into 13 output languages in real time, and GPT-Realtime-Whisper provides streaming speech-to-text that transcribes as the speaker talks. The models are offered through the OpenAI API and are presented as tools for building live translation, streaming transcription, and more interactive voice agents.
What happened
According to OpenAI, the API now includes three realtime audio models for developers: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. OpenAI wrote "GPT-Realtime-2, our first voice model with GPT-5-class reasoning," and described it as able to handle more complex requests and carry conversations forward. OpenAI also states that GPT-Realtime-2 supports a 128k token context window. Per OpenAI, GPT-Realtime-Translate translates speech from 70+ input languages into 13 output languages while keeping pace with the speaker, and GPT-Realtime-Whisper performs live streaming speech-to-text.
Editorial analysis - technical context
Models that combine large context windows and realtime audio must balance memory and latency. Industry-pattern observations: enabling 128k token context typically increases compute and memory demands for state tracking, long-form summarization, and retrieval integration. Realtime translation and streaming transcription often require tight integration between automatic speech recognition (ASR) pipelines and downstream reasoning layers; teams building similar systems face tradeoffs between end-to-end models and cascaded ASR-plus-NLP architectures.
Industry context
For developers and product teams, the combination of higher reasoning capacity and long context in voice models lowers the friction for multi-turn, task-oriented voice apps such as meeting assistants, live interpreters, and interactive customer support. Industry-pattern observations: deploying such functionality in production usually raises operational considerations around inference cost, latency SLAs, and data governance, especially when audio is processed in the cloud.
What to watch
Metrics and signals observers should follow include latency and throughput benchmarks for GPT-Realtime-2 under realistic audio loads, pricing and rate limits in the OpenAI API, language coverage and quality for GPT-Realtime-Translate in lower-resource languages, and tooling for session state management and streaming retries. Also watch third-party benchmarks and developer feedback on integration complexity and cost per realtime minute.
Scoring Rationale
This is a material product update from OpenAI: combining GPT-5-class reasoning with a very large context window for realtime voice expands capabilities for live translation, transcription, and multi-turn voice agents. Practitioners will weigh latency, cost, and integration complexity.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
