feat(plugins): add FunASR self-hosted STT plugin#6129
Conversation
Adds `livekit-plugins-funasr`: a non-streaming STT plugin backed by [FunASR](https://github.com/modelscope/FunASR) (SenseVoice / Paraformer / Fun-ASR-Nano), running fully locally with no cloud API. Strong on Chinese and 50+ languages; SenseVoice also returns language/emotion/event tags. Implements `STT._recognize_impl` (combine frames -> FunASR -> SpeechEvent) and declares `STTCapabilities(streaming=False)`, so LiveKit wraps it with a VAD StreamAdapter for real-time agents. Tested: transcribes a Chinese clip via the STT interface and returns a FINAL_TRANSCRIPT event. Resolves livekit#5897.
| def _ensure_model(self): | ||
| if self._model is None: | ||
| from funasr import AutoModel | ||
|
|
||
| kwargs = dict(model=self._opts.model, device=self._opts.device, hub=self._opts.hub, disable_update=True) | ||
| if self._vad_model: | ||
| kwargs.update(vad_model=self._vad_model, vad_kwargs={"max_single_segment_time": 30000}) | ||
| logger.info("loading FunASR model %s on %s", self._opts.model, self._opts.device) | ||
| self._model = AutoModel(**kwargs) | ||
| return self._model |
There was a problem hiding this comment.
π΄ No thread-safety for lazy model init and concurrent inference in thread executor
_ensure_model() is called from _run() which executes in a thread pool via run_in_executor (stt.py:98). There is no lock guarding the check-then-set on self._model (stt.py:58), so concurrent _recognize_impl calls can race: two threads both see self._model is None, both load the model (wasting resources and time), and one loaded instance is silently discarded. More critically, after the model is initialized, concurrent _run invocations will call model.generate() simultaneously on the same FunASR/PyTorch model instance. PyTorch models are not thread-safe for forward passes (they share internal buffers), which can cause crashes (especially on CUDA) or silently produce incorrect transcription results.
Prompt for agents
The _ensure_model method and subsequent model.generate() call in _run are not protected by any lock, yet they run in a thread pool executor where concurrent execution is possible. Add a threading.Lock to the STT class (initialized in __init__) and acquire it in the _run function (or at minimum around _ensure_model and model.generate). This ensures: (1) the model is loaded exactly once, and (2) inference calls are serialized to avoid PyTorch thread-safety issues. For example, in __init__ add self._lock = threading.Lock(), then in _run wrap the body with 'with self._lock:'. Alternatively, separate initialization locking (which only needs to guard _ensure_model) from inference locking (which guards model.generate).
Was this helpful? React with π or π to provide feedback.
| try: | ||
| text = await asyncio.get_event_loop().run_in_executor(None, _run) | ||
| except Exception as e: # noqa: BLE001 | ||
| raise APIConnectionError() from e |
There was a problem hiding this comment.
π© Broad exception catch masks local errors as retriable APIConnectionError
At stt.py:99-100, all exceptions (including KeyError, ValueError, CUDA OOM, etc.) are caught and re-raised as APIConnectionError. The base class recognize() method at livekit-agents/livekit/agents/stt/stt.py:204-248 retries on APIError (parent of APIConnectionError). This means local model inference errorsβwhich are not transient and won't resolve on retryβwill be retried up to max_retry times, wasting time and obscuring the real error. While some other plugins follow a similar pattern, those are wrapping actual network calls where retries make sense. For a local model, a more targeted exception filter (e.g., only catching FunASR-specific errors) would be more appropriate. Not flagged as a bug because this pattern exists in other plugins, but it's worth noting.
Was this helpful? React with π or π to provide feedback.
| class STT(stt.STT): | ||
| """FunASR self-hosted speech-to-text. | ||
|
|
||
| Runs FunASR models (SenseVoice, Paraformer, Fun-ASR-Nano) locally β no cloud | ||
| API. Non-streaming; LiveKit wraps it with a VAD StreamAdapter for agents. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| *, | ||
| model: str = _DEFAULT_MODEL, | ||
| language: str = "auto", | ||
| device: str = "cpu", | ||
| hub: str = "ms", | ||
| use_itn: bool = True, | ||
| vad_model: str | None = "fsmn-vad", | ||
| ) -> None: | ||
| super().__init__(capabilities=STTCapabilities(streaming=False, interim_results=False)) | ||
| self._opts = _STTOptions(model=model, language=language, device=device, hub=hub, use_itn=use_itn) | ||
| self._vad_model = vad_model | ||
| self._model = None |
There was a problem hiding this comment.
π© Missing model and provider property overrides
The base STT class at livekit-agents/livekit/agents/stt/stt.py:161-182 defines model and provider properties that return "unknown" by default, with docstrings explicitly stating plugins should override them. Other STT plugins (deepgram at stt.py:199-203, openai at stt.py:199-203, fal at stt.py:47-52) all override these properties. This FunASR plugin does not, meaning metrics emitted by the base class (at stt.py:220-221) will report model_name="unknown" and model_provider="unknown", reducing observability. This is not a correctness bug but an incomplete integration.
Was this helpful? React with π or π to provide feedback.
831964d to
b68cd98
Compare
| except Exception as e: # noqa: BLE001 | ||
| raise APIConnectionError() from e |
There was a problem hiding this comment.
π‘ Blanket Exception catch wraps non-transient local errors as APIConnectionError, causing futile retries
At lines 99-100, every exception (including KeyError, RuntimeError, torch.cuda.OutOfMemoryError, model-loading failures, etc.) is caught and re-raised as APIConnectionError. The base class recognize() method (livekit-agents/livekit/agents/stt/stt.py:204-248) catches APIError (parent of APIConnectionError) and retries up to conn_options.max_retry times (default 3). Since this is a local model with no network involved, none of these errors are transient connection problems β retrying an OOM or a model-loading failure 3 times is wasteful and delays the real error from surfacing. Other plugins (e.g., livekit-plugins-fal/livekit/plugins/fal/stt.py:84) catch only the provider-specific exception class.
Prompt for agents
The broad except Exception clause at line 99 catches all errors and wraps them as APIConnectionError, which causes the base class to retry them. For a local inference model, most errors (OOM, model load failure, bad audio format) are non-transient and should not be retried. Consider either: (1) narrowing the catch to only FunASR-specific exceptions that could be transient, or (2) re-raising non-transient errors directly without wrapping in APIConnectionError. You may want to import specific exception types from funasr if available, and let programming errors like KeyError/TypeError propagate naturally.
Was this helpful? React with π or π to provide feedback.
| return stt.SpeechEvent( | ||
| type=SpeechEventType.FINAL_TRANSCRIPT, | ||
| alternatives=[stt.SpeechData(text=text, language=str(lang))], | ||
| ) |
There was a problem hiding this comment.
π© Language "auto" is passed to SenseVoice model β valid but semantically lossy in response
When using the default SenseVoice model with default language "auto", line 91's condition "SenseVoice" in self._opts.model is always true, so gen_kwargs["language"] = "auto" is always set. SenseVoice supports this (it auto-detects language). However, the response at line 104 reports the language as "auto" in SpeechData.language, rather than the actually-detected language. FunASR's result object may contain detected language info that could be extracted. This isn't incorrect (the fal plugin also passes through the configured language), but it means downstream consumers can't know what language was actually spoken.
Was this helpful? React with π or π to provide feedback.
Adds
livekit-plugins-funasrβ a non-streaming, self-hosted STT plugin backed by FunASR (SenseVoice / Paraformer / Fun-ASR-Nano). Runs fully locally, no cloud API; strong on Chinese + 50+ languages.Design
STT._recognize_impl(buffer)β combine frames β FunASRAutoModel.generateβSpeechEvent(FINAL_TRANSCRIPT).STTCapabilities(streaming=False), so LiveKit wraps it with a VADStreamAdapterfor real-time agents (same pattern as other non-streaming plugins).Tested
On an H100:
STT(model='FunAudioLLM/SenseVoiceSmall', hub='hf', device='cuda')transcribes a Chinese clip and returns aFINAL_TRANSCRIPTevent with the correct text. Package imports + registers cleanly.Usage
Resolves #5897. Happy to add CHANGELOG / CI wiring per your conventions β let me know what's needed.