Speed of Sound Brings Offline Voice Typing to Linux

Speed of Sound is a new open-source Linux desktop app that provides fast, private, offline voice typing by running Whisper and other on-device ASR models locally. The app inserts transcriptions directly into the focused application using XDG Desktop Portals, supports multiple activation methods (including a global shortcut), and ships with a lightweight multilingual Whisper model by default with optional downloads for higher accuracy. It also offers optional post-processing with LLMs and supports self-hosted inference backends such as vLLM, Ollama, and llama.cpp. Packaged on Flathub, Snap, AppImage, Deb, and RPM, the tool targets accessibility, rapid drafting, and privacy-conscious users on both X11 and Wayland desktops.
What happened
Speed of Sound, an open-source Linux desktop application, delivers fast, offline voice typing by running on-device automatic speech recognition and typing the result into any focused app. The project, authored by Antonio Zugaldia and distributed via Flathub and other packages, ships with a built-in `Whisper` model and now supports additional model families such as `Parakeet` and `Canary` to balance latency and accuracy. The app uses XDG Desktop Portals to simulate typing across GNOME, KDE, X11, and Wayland, and it includes optional LLM-based text polishing and self-hosted inference backends for more advanced workflows.
Technical details
Speed of Sound performs transcription locally using on-device ASR with an ONNX runtime stack and JVM bindings. The app defaults to a lightweight `Whisper` model for broad multilingual coverage and low resource usage, while offering downloads for larger models to improve accuracy. Supported ASR families and integration points include:
- •`Whisper`, `Parakeet`, and `Canary` model families for offline transcription
- •Self-hosted LLM/text-polishing backends: vLLM, Ollama, llama.cpp
- •Cloud provider options when local compute is constrained
- •Desktop integration via XDG Desktop Portals for typed output across X11 and Wayland
Activation and UX are intentionally simple: users start recording with an in-app button, a global keyboard shortcut (the developer documentation cites Super+Z as an example), or system-tray controls, then stop recording to commit the transcription into the active text field. The app supports a primary and a secondary language and allows supplying contextual hints such as custom vocabulary and writing style to improve recognition. Recent Flathub release notes for version 0.12.0 added `NVIDIA Canary` and `Parakeet` support, fixed non-Latin script typing error propagation, and improved portal support detection.
Context and significance
Desktop voice typing has lagged behind mobile because of latency, privacy, and integration limits. The 2022 release of Whisper catalyzed a wave of on-device ASR projects; Speed of Sound packs that momentum into a polished Linux-native experience. For practitioners this matters because it makes private, offline speech-to-text accessible without cloud dependencies, and because it embraces self-hosted model stacks that teams can integrate into reproducible workflows. The combination of packaging on Flathub, Snap, and traditional Linux packages plus ONNX-based models means the app runs on both x86_64 and aarch64 hardware, with clear upgrade paths to GPU-accelerated or server-backed inference for heavier workloads.
Practical trade-offs
The app is designed for discrete recordings rather than continuous streaming dictation; users must trigger and stop recordings which reduces complexity but limits hands-free workflows. Accuracy scales with model size and compute: the bundled multilingual Whisper is a sensible default, but professional transcription or noisy environments will require larger models or cloud/self-hosted inference. Optional LLM polishing introduces privacy and compute considerations when using external cloud LLMs; Speed of Sound explicitly supports self-hosted LLMs to retain local control.
What to watch
Expect incremental improvements: streaming/real-time wake-word capture, tighter GPU acceleration on aarch64, more robust personal vocabulary adaptation, and enterprise packaging. Also watch for third-party extensions that automate post-processing pipelines and integrations with note-taking or accessibility tools.
Bottom line
Speed of Sound is a pragmatic, privacy-first voice typing tool for Linux that leverages modern on-device ASR ecosystems. It lowers the barrier for practitioners and power users to add offline voice input into everyday desktop workflows while offering clear upgrade paths to higher-accuracy or cloud-backed setups.
Scoring Rationale
This release makes on-device voice typing practical and private for Linux users and practitioners, lowering integration friction and supporting self-hosted inference. It is a solid productivity and accessibility win, but not a frontier research or infrastructure breakthrough, so it rates as a useful tools-level story.
Practice with real Telecom & ISP data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Telecom & ISP problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.


