Skip to content

feat(plugins): add FunASR self-hosted STT plugin#6129

Open
LauraGPT wants to merge 1 commit into
livekit:mainfrom
LauraGPT:feat/funasr-stt-plugin
Open

feat(plugins): add FunASR self-hosted STT plugin#6129
LauraGPT wants to merge 1 commit into
livekit:mainfrom
LauraGPT:feat/funasr-stt-plugin

Conversation

@LauraGPT

Copy link
Copy Markdown

Adds livekit-plugins-funasr β€” a non-streaming, self-hosted STT plugin backed by FunASR (SenseVoice / Paraformer / Fun-ASR-Nano). Runs fully locally, no cloud API; strong on Chinese + 50+ languages.

Design

  • Implements STT._recognize_impl(buffer) β†’ combine frames β†’ FunASR AutoModel.generate β†’ SpeechEvent(FINAL_TRANSCRIPT).
  • Declares STTCapabilities(streaming=False), so LiveKit wraps it with a VAD StreamAdapter for real-time agents (same pattern as other non-streaming plugins).
  • Lazy model load; FunASR runs in an executor (non-blocking).

Tested

On an H100: STT(model='FunAudioLLM/SenseVoiceSmall', hub='hf', device='cuda') transcribes a Chinese clip and returns a FINAL_TRANSCRIPT event with the correct text. Package imports + registers cleanly.

Usage

from livekit.plugins import funasr
stt = funasr.STT(model='iic/SenseVoiceSmall', device='cuda')          # ModelScope
stt = funasr.STT(model='FunAudioLLM/SenseVoiceSmall', hub='hf', device='cuda')  # HuggingFace

Resolves #5897. Happy to add CHANGELOG / CI wiring per your conventions β€” let me know what's needed.

Adds `livekit-plugins-funasr`: a non-streaming STT plugin backed by
[FunASR](https://github.com/modelscope/FunASR) (SenseVoice / Paraformer /
Fun-ASR-Nano), running fully locally with no cloud API. Strong on Chinese and
50+ languages; SenseVoice also returns language/emotion/event tags.

Implements `STT._recognize_impl` (combine frames -> FunASR -> SpeechEvent) and
declares `STTCapabilities(streaming=False)`, so LiveKit wraps it with a VAD
StreamAdapter for real-time agents.

Tested: transcribes a Chinese clip via the STT interface and returns a
FINAL_TRANSCRIPT event. Resolves livekit#5897.
@LauraGPT LauraGPT requested a review from a team as a code owner June 16, 2026 19:06
@CLAassistant

CLAassistant commented Jun 16, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

Open in Devin Review

Comment on lines +57 to +66
def _ensure_model(self):
if self._model is None:
from funasr import AutoModel

kwargs = dict(model=self._opts.model, device=self._opts.device, hub=self._opts.hub, disable_update=True)
if self._vad_model:
kwargs.update(vad_model=self._vad_model, vad_kwargs={"max_single_segment_time": 30000})
logger.info("loading FunASR model %s on %s", self._opts.model, self._opts.device)
self._model = AutoModel(**kwargs)
return self._model

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ No thread-safety for lazy model init and concurrent inference in thread executor

_ensure_model() is called from _run() which executes in a thread pool via run_in_executor (stt.py:98). There is no lock guarding the check-then-set on self._model (stt.py:58), so concurrent _recognize_impl calls can race: two threads both see self._model is None, both load the model (wasting resources and time), and one loaded instance is silently discarded. More critically, after the model is initialized, concurrent _run invocations will call model.generate() simultaneously on the same FunASR/PyTorch model instance. PyTorch models are not thread-safe for forward passes (they share internal buffers), which can cause crashes (especially on CUDA) or silently produce incorrect transcription results.

Prompt for agents
The _ensure_model method and subsequent model.generate() call in _run are not protected by any lock, yet they run in a thread pool executor where concurrent execution is possible. Add a threading.Lock to the STT class (initialized in __init__) and acquire it in the _run function (or at minimum around _ensure_model and model.generate). This ensures: (1) the model is loaded exactly once, and (2) inference calls are serialized to avoid PyTorch thread-safety issues. For example, in __init__ add self._lock = threading.Lock(), then in _run wrap the body with 'with self._lock:'. Alternatively, separate initialization locking (which only needs to guard _ensure_model) from inference locking (which guards model.generate).
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Comment on lines +97 to +100
try:
text = await asyncio.get_event_loop().run_in_executor(None, _run)
except Exception as e: # noqa: BLE001
raise APIConnectionError() from e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Broad exception catch masks local errors as retriable APIConnectionError

At stt.py:99-100, all exceptions (including KeyError, ValueError, CUDA OOM, etc.) are caught and re-raised as APIConnectionError. The base class recognize() method at livekit-agents/livekit/agents/stt/stt.py:204-248 retries on APIError (parent of APIConnectionError). This means local model inference errorsβ€”which are not transient and won't resolve on retryβ€”will be retried up to max_retry times, wasting time and obscuring the real error. While some other plugins follow a similar pattern, those are wrapping actual network calls where retries make sense. For a local model, a more targeted exception filter (e.g., only catching FunASR-specific errors) would be more appropriate. Not flagged as a bug because this pattern exists in other plugins, but it's worth noting.

Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Comment on lines +35 to +55
class STT(stt.STT):
"""FunASR self-hosted speech-to-text.

Runs FunASR models (SenseVoice, Paraformer, Fun-ASR-Nano) locally β€” no cloud
API. Non-streaming; LiveKit wraps it with a VAD StreamAdapter for agents.
"""

def __init__(
self,
*,
model: str = _DEFAULT_MODEL,
language: str = "auto",
device: str = "cpu",
hub: str = "ms",
use_itn: bool = True,
vad_model: str | None = "fsmn-vad",
) -> None:
super().__init__(capabilities=STTCapabilities(streaming=False, interim_results=False))
self._opts = _STTOptions(model=model, language=language, device=device, hub=hub, use_itn=use_itn)
self._vad_model = vad_model
self._model = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Missing model and provider property overrides

The base STT class at livekit-agents/livekit/agents/stt/stt.py:161-182 defines model and provider properties that return "unknown" by default, with docstrings explicitly stating plugins should override them. Other STT plugins (deepgram at stt.py:199-203, openai at stt.py:199-203, fal at stt.py:47-52) all override these properties. This FunASR plugin does not, meaning metrics emitted by the base class (at stt.py:220-221) will report model_name="unknown" and model_provider="unknown", reducing observability. This is not a correctness bug but an incomplete integration.

Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

@LauraGPT LauraGPT force-pushed the feat/funasr-stt-plugin branch from 831964d to b68cd98 Compare June 17, 2026 10:10

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

Open in Devin Review

Comment on lines +99 to +100
except Exception as e: # noqa: BLE001
raise APIConnectionError() from e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟑 Blanket Exception catch wraps non-transient local errors as APIConnectionError, causing futile retries

At lines 99-100, every exception (including KeyError, RuntimeError, torch.cuda.OutOfMemoryError, model-loading failures, etc.) is caught and re-raised as APIConnectionError. The base class recognize() method (livekit-agents/livekit/agents/stt/stt.py:204-248) catches APIError (parent of APIConnectionError) and retries up to conn_options.max_retry times (default 3). Since this is a local model with no network involved, none of these errors are transient connection problems β€” retrying an OOM or a model-loading failure 3 times is wasteful and delays the real error from surfacing. Other plugins (e.g., livekit-plugins-fal/livekit/plugins/fal/stt.py:84) catch only the provider-specific exception class.

Prompt for agents
The broad except Exception clause at line 99 catches all errors and wraps them as APIConnectionError, which causes the base class to retry them. For a local inference model, most errors (OOM, model load failure, bad audio format) are non-transient and should not be retried. Consider either: (1) narrowing the catch to only FunASR-specific exceptions that could be transient, or (2) re-raising non-transient errors directly without wrapping in APIConnectionError. You may want to import specific exception types from funasr if available, and let programming errors like KeyError/TypeError propagate naturally.
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Comment on lines +102 to +105
return stt.SpeechEvent(
type=SpeechEventType.FINAL_TRANSCRIPT,
alternatives=[stt.SpeechData(text=text, language=str(lang))],
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Language "auto" is passed to SenseVoice model β€” valid but semantically lossy in response

When using the default SenseVoice model with default language "auto", line 91's condition "SenseVoice" in self._opts.model is always true, so gen_kwargs["language"] = "auto" is always set. SenseVoice supports this (it auto-detects language). However, the response at line 104 reports the language as "auto" in SpeechData.language, rather than the actually-detected language. FunASR's result object may contain detected language info that could be extracted. This isn't incorrect (the fal plugin also passes through the configured language), but it means downstream consumers can't know what language was actually spoken.

Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Plugin] livekit-plugins-funasr β€” self-hosted STT with built-in diarization (170x realtime)

2 participants