feat(mtmd): add FunASR fbank and LFR audio encoding path#8
Merged
Merged
Conversation
Add Kaldi-compatible fbank extraction and LFR frame stacking helpers for FunASR/SenseVoice-style models. Parse lfr_m/lfr_n from audio config and route SMT audio encoding through the FunASR frontend shape when enabled. Also support SMT audio backends with either hidden_states-only or hidden_states-plus-attention_mask inputs, and add FunASR architecture detection in the SMT vision server.
There was a problem hiding this comment.
Pull request overview
Adds a FunASR/SenseVoice-style SMT audio encoding pipeline that computes Kaldi-like fbank features, applies LFR frame stacking, and routes audio through a frontend/backed ONNX split when audio_model.lfr_m > 0, while preserving the existing Qwen3-ASR mel/chunk path.
Changes:
- Extend SMT audio config parsing to support
lfr_m/lfr_nand select an alternate FunASR encoding path when enabled. - Add reusable helpers for Kaldi-style fbank extraction and LFR stacking in
tools/mtmd/mtmd-audio.{h,cpp}. - Relax backend ONNX signature handling to accept either
{hidden_states}or{hidden_states, attention_mask}.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| tools/server/server-smt-vision.cpp | Adds a FunASR architecture helper (currently unused). |
| tools/mtmd/smt-audio-wrapper.cpp | Adds config fields and implements the FunASR fbank+LFR frontend path and optional backend attention mask handling. |
| tools/mtmd/mtmd-audio.h | Declares new fbank and LFR helper APIs for MTMD audio. |
| tools/mtmd/mtmd-audio.cpp | Implements fbank extraction + LFR stacking utilities used by the new FunASR path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+617
to
+619
| static bool arch_is_funasr(const std::string & arch_name) { | ||
| return contains_icase(arch_name, "funasr"); | ||
| } |
Comment on lines
+691
to
+700
| const auto frontend_output_info = frontend_outputs[0].GetTensorTypeAndShapeInfo(); | ||
| if (frontend_output_info.GetElementType() != ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT) { | ||
| throw std::runtime_error("SMT audio warmup frontend output must be float32"); | ||
| } | ||
| auto shape = frontend_output_info.GetShape(); | ||
| warmup_t_out = (int) shape[1]; | ||
| warmup_hidden.resize((size_t) warmup_t_out * (size_t) d.config.d_model, 0.0f); | ||
| const float * frontend_output = frontend_outputs[0].GetTensorData<float>(); | ||
| std::memcpy(warmup_hidden.data(), frontend_output, warmup_hidden.size() * sizeof(float)); | ||
| } else { |
Comment on lines
+822
to
+826
| const auto frontend_shape = frontend_outputs[0].GetTensorTypeAndShapeInfo().GetShape(); | ||
| t_out = (int) frontend_shape[1]; | ||
| hidden_states.resize((size_t) t_out * (size_t) d.config.d_model); | ||
| std::memcpy(hidden_states.data(), frontend_outputs[0].GetTensorData<float>(), | ||
| hidden_states.size() * sizeof(float)); |
Comment on lines
+1177
to
+1180
| if (n_samples == 0 || n_mel <= 0) { | ||
| return false; | ||
| } | ||
|
|
Comment on lines
+1198
to
+1201
| std::vector<float> window(frame_len); | ||
| for (int i = 0; i < frame_len; i++) { | ||
| window[i] = 0.54f - 0.46f * cosf(2.0f * (float) M_PI * i / frame_len); | ||
| } |
Comment on lines
+1242
to
+1244
| if (n_frames <= 0 || lfr_n <= 0) { | ||
| return false; | ||
| } |
Comment on lines
+1096
to
+1101
| static void funasr_fft_inplace(float * data, int n) { | ||
| for (int i = 1, j = 0; i < n; i++) { | ||
| int bit = n >> 1; | ||
| for (; j & bit; bit >>= 1) { | ||
| j ^= bit; | ||
| } |
cbfc6a2
into
spacemit-com:spacemit-mtmd
10 of 11 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds FunASR/SenseVoice-style audio support to the SMT multimodal audio path.
The new path is selected by
audio_model.lfr_m > 0in the SMT config. It computesKaldi-compatible fbank features, applies LFR frame stacking, runs the FunASR
frontend ONNX model, then projects the result with the backend ONNX model before
injecting the audio embeddings into the llama.cpp context.
This enables the Fun-ASR-Nano-2512 split deployment layout:
Changes
tools/mtmd/mtmd-audio.{h,cpp}.lfr_mandlfr_nfromaudio_modelconfig.while keeping the existing Qwen3-ASR mel/chunk path unchanged.
hidden_stateshidden_states+attention_maskhidden_size, so the final audioembedding dimension must still match the text model embedding size.
Model Contract
The tested Fun-ASR-Nano-2512 split config is:
{ "architectures": ["FunASRNano"], "audio_model": { "frontend_model_path": "frontend.onnx", "backend_model_path": "backend.onnx", "d_model": 512, "hidden_size": 1024, "num_mel_bins": 80, "sample_rate": 16000, "n_fft": 400, "window_len": 400, "hop_len": 160, "lfr_m": 7, "lfr_n": 6 } }hidden_size: 1024is required because the backend ONNX output is injected asLLM input embeddings and must match the Qwen3-0.6B GGUF embedding dimension.
Validation
Validated with the following end-to-end workflow.
1. Prepare the Fun-ASR-Nano-2512 model
2. Export split ONNX and HF LLM weights
The split export script will be attached to this PR as additional review
material.
This produces the SMT audio config, frontend/backend ONNX models, and extracted
Qwen3-0.6B HF-format LLM weights under
<FUNASR_SPLIT_DIR>.3. Convert and quantize the text model
4. Build llama.cpp with SMT support
5. Run llama-server
Observed startup:
Benchmark
Accuracy summary from
funasr-bq-q4km_accuracy_report.md, focusing on the04_fast_666_16000hz_1ch.wavfast-speech sample:
For this sample, the FunASR split deployment is +2.20% over the qwen3asr-ddq40
baseline.
RTF on
gold.wav:The reported FunASR config uses quantized ONNX frontend/backend models
(
frontend.q.onnx,backend.q.onnx) and a Q4_K_M GGUF text model(
qwen3-0.6b-q4km.gguf).Additional information
The Fun-ASR-Nano-2512 export script used to prepare the split model is attached
to this PR for review:
export_split_model.pyThe script exports:
frontend.onnx: LFR features[1, T, 560]to hidden states[1, T, 512]backend.onnx: hidden states[1, T, 512]to audio embeddings[1, T, 1024]Requirements
lfr_mis not set.lfr_m/lfr_nto enable the new path.attention_maskare accepted.hidden_size.llama-serverwith OpenAI-compatible
input_audiorequests.