feat(stt): in-process Parakeet STT backend (Apple Silicon / MLX)#462
Open
nickmeinhold wants to merge 1 commit into
Open
feat(stt): in-process Parakeet STT backend (Apple Silicon / MLX)#462nickmeinhold wants to merge 1 commit into
nickmeinhold wants to merge 1 commit into
Conversation
Add a `parakeet://` STT provider that transcribes in-process via `parakeet-mlx` (NVIDIA Parakeet on Apple's MLX), with no HTTP hop. Unlike the whisper.cpp / mlx-audio / OpenAI providers, which are reached over an OpenAI-compatible endpoint, this runs inference directly in the process. Select it by adding `parakeet://local` to VOICEMODE_STT_BASE_URLS. It slots into the existing STT failover chain as just another ordered endpoint, so the recommended config keeps whisper/OpenAI as fallback for the language tail (Parakeet v3 covers English + 25 European languages, not e.g. Thai): VOICEMODE_STT_BASE_URLS=parakeet://local,http://127.0.0.1:2022/v1,https://api.openai.com/v1 On an M-series Mac this measured ~3.7x faster than the whisper.cpp server (~0.19s vs ~1.05s on a ~9s clip, in-process) at equal-or-better English accuracy. - provider_discovery: detect `parakeet://` and treat it as a local provider - stt_backends/parakeet.py: lazy cached model load + threaded transcribe, preserving the audio file's read position; deferred import so non-Mac / missing-dep installs cleanly fall through to the next endpoint - simple_failover: dispatch `parakeet` provider to the in-process backend, inside the existing try/except so failover + empty-detection + metrics hold - config: VOICEMODE_PARAKEET_MODEL (default mlx-community/parakeet-tdt-0.6b-v3) - pyproject: `parakeet` optional extra, arm64-gated (no effect off Apple Silicon) - tests: detection, transcribe (text + position + load-once cache), failover dispatch, and graceful fall-through on backend failure Apple Silicon only, opt-in. Install with: uv tool install voice-mode[parakeet] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an in-process Parakeet STT backend for Apple Silicon. Add
parakeet://localtoVOICEMODE_STT_BASE_URLSand VoiceMode transcribes with NVIDIA's Parakeet ASR (mlx-community/parakeet-tdt-0.6b-v3) running directly via MLX, with no HTTP hop.Unlike the whisper.cpp / mlx-audio / OpenAI providers (reached over an OpenAI-compatible endpoint), this resolves in-process. It still slots into the existing STT failover chain as just another ordered endpoint, so failover, metrics, and empty-speech detection all keep working unchanged.
This follows the precedent of the
kokoro-onnxTTS backend (#261), one provider added behind the existing abstraction.Why
On Apple Silicon, Parakeet is dramatically faster than the whisper.cpp server for English at equal-or-better accuracy. Measured on an M-series Mac, same 8.78s clip, model resident:
About 3.7x faster through a server, ~5.5x in-process, and it correctly transcribed coined proper nouns ("Imagineering") and "Parakeet" from live mic input in real-world testing.
Design
provider_discovery.py: detectparakeet://and treat it as a local provider.stt_backends/parakeet.py(new): lazy cached model load (weights load once, not per call) + transcription offloaded to a thread viaasyncio.to_thread; preserves the audio file's read position; deferred import so non-Mac / missing-dep installs cleanly fall through to the next endpoint.simple_failover.py: dispatch theparakeetprovider to the in-process backend, inside the existing try/except so the failover contract holds.config.py:VOICEMODE_PARAKEET_MODEL(defaultmlx-community/parakeet-tdt-0.6b-v3).pyproject.toml: newparakeetoptional extra, arm64-gated so it has no effect on Intel Macs / Linux / Windows.Config
Recommended (keeps whisper/OpenAI as fallback for the language tail; Parakeet v3 covers English plus 25 European languages, not e.g. Thai):
Apple Silicon only, opt-in. Install with the extra:
Tests
tests/test_parakeet_backend.py(5 tests): provider detection, transcription (text + position preservation + load-once caching), failover dispatch, and graceful fall-through when the backend raises (e.g. dep missing). All pass.test_provider_selectionneedsOPENAI_API_KEYand fails identically onmaster).Heads-up (separate from this PR): mlx-audio's STT server is currently broken for Parakeet
While testing, I tried routing Parakeet through
mlx_audio.server's/v1/audio/transcriptions(the "unified server" path) and hit two bugs in the current release, which is why this PR runs Parakeet in-process instead:get_model_categorysplits the model name on., soparakeet-tdt-0.6b-v3becomes moduleparakeet_tdt_0(truncated at the.6) and it looks undertts.models->ModuleNotFoundError: No module named 'mlx_audio.tts.models.parakeet_tdt_0'.-asr-fp16variant.mlx-community/whisper-large-v3-turboloads but throws"Processor not found"at decode;mlx-community/whisper-large-v3-turbo-asr-fp16works.Flagging in case it affects the mlx-audio install path (VM-1330). Happy to file upstream on mlx-audio if useful.
Built with Claude Code. Glad to adjust naming, the
parakeet://scheme, or the config surface to match how you'd want it to land.