Hugging Face And Cerebras Launch Open Speech-To-Speech AI Pipeline
Hugging Face and Cerebras open-sourced a full speech-to-speech AI pipeline on July 1, 2026, chaining Nvidia's Parakeet for speech recognition, Google DeepMind's Gemma 4 31B running on Cerebras hardware for reasoning, and Alibaba's Qwen3-TTS for voice output. The stack already powers more than 9,000 Reachy Mini robots in production. According to Cerebras, Gemma 4 31B runs at 1,851 tokens per second on its chips, about 35 times faster than a typical GPU endpoint, cutting the multi-second stalls that make conversational AI feel unreliable even when average response times look fine. Every stage of the pipeline is open and swappable, so developers can substitute models without rebuilding the whole system. For practitioners, it is a concrete, reproducible reference architecture for voice agents and embodied AI that does not require locking into a single vendor's models or infrastructure.
Voice AI has been bottlenecked less by model quality than by latency variance: systems that respond quickly on average still stall for seconds often enough to break the feel of a live conversation. This release is a concrete demonstration that an entirely open, swappable stack, with no single vendor controlling any layer, can match the response times normally associated with closed, proprietary assistants. That matters for any team building voice agents, robots, or embodied AI that wants to avoid single-vendor lock-in on models or inference hardware.
What happened
Hugging Face and Cerebras published an open cascaded speech-to-speech pipeline on Hugging Face's official blog on July 1, 2026, alongside code in the open speech-to-speech GitHub repository. The pipeline chains four stages: voice activity detection with Silero VAD, speech recognition via Nvidia's Parakeet-TDT, language understanding and generation via Google DeepMind's Gemma 4 31B running on Cerebras inference hardware, and text-to-speech via Alibaba's Qwen3-TTS. The companies said the same pipeline already powers more than 9,000 Reachy Mini robots in active use, giving the release a production track record rather than a benchmark-only claim.
Technical context
Hugging Face and Cerebras designed the stack around P95 tail latency rather than median speed, arguing that occasional multi-second stalls, not average response time, are what make conversational AI feel unreliable. According to Cerebras's own June 29, 2026 announcement, Gemma 4 31B runs on its wafer-scale chips at a record 1,851 output tokens per second as measured by Artificial Analysis, about 35 times faster than a typical GPU endpoint, with first-token latency of 1.5 seconds. Cerebras says Gemma 4 scores comparably to Claude Haiku 4.5 on the Artificial Analysis Intelligence Index while running roughly 18 times faster on its hardware.
For practitioners
Every stage of the pipeline, VAD, STT, LLM, and TTS, is independently swappable, and the speech-to-speech package (installable via pip) supports alternative backends including local Whisper checkpoints, MLX for Apple Silicon, and OpenAI-compatible APIs. Teams building voice agents or robotics products can use this as a reference architecture to benchmark their own latency budgets against a documented, production-deployed baseline rather than starting from a closed API's black-box numbers.
What to watch
Cerebras says multimodal support, which Gemma 4 introduces to its platform, will extend to additional models going forward, and the Reachy Mini deployment count, already past 9,000 with community reports nearing 10,000, is a rough proxy for how quickly the stack is being adopted outside the demo.
Editorial analysis
The release fits a broader pattern in open-source AI where latency and reliability, not just benchmark scores, are becoming a competitive axis against closed voice-assistant products from major labs. Whether the approach generalizes beyond Cerebras's chips depends on other inference providers matching its tail-latency profile, which the companies have not documented for competing hardware.
Key Points
- 1Hugging Face and Cerebras open-sourced a full speech-to-speech AI pipeline built from Nvidia, Google DeepMind, and Alibaba open models running on Cerebras inference chips.
- 2The design targets P95 tail latency, since occasional multi-second stalls, not average speed, are what make voice AI conversations feel broken.
- 3The pipeline already powers over 9,000 Reachy Mini robots, giving practitioners a production-tested, vendor-agnostic reference architecture instead of a proprietary black box.
Scoring Rationale
Well-documented open-source milestone independently confirmed by both Hugging Face and Cerebras with concrete, verifiable performance data (1,851 tok/s, ~35x a typical GPU) and a genuine production deployment across 9,000+ Reachy Mini robots, not just a benchmark demo. Kept in the 'solid' band rather than 'major' because it is a reference-architecture and partnership showcase, not a new model or capability.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

