feat(mtmd): add FunASR fbank and LFR audio encoding path by LFF28 · Pull Request #8 · spacemit-com/llama.cpp

LFF28 · 2026-06-15T06:07:08Z

Overview

This PR adds FunASR/SenseVoice-style audio support to the SMT multimodal audio path.

The new path is selected by audio_model.lfr_m > 0 in the SMT config. It computes
Kaldi-compatible fbank features, applies LFR frame stacking, runs the FunASR
frontend ONNX model, then projects the result with the backend ONNX model before
injecting the audio embeddings into the llama.cpp context.

This enables the Fun-ASR-Nano-2512 split deployment layout:

audio file
  -> Kaldi fbank + LFR [1, T, 560]
  -> frontend.onnx [1, T, 512]
  -> backend.onnx [1, T, 1024]
  -> GGUF Qwen3-0.6B text model

Changes

Add reusable Kaldi-compatible fbank extraction and LFR frame stacking helpers in
tools/mtmd/mtmd-audio.{h,cpp}.
Parse lfr_m and lfr_n from audio_model config.
Route SMT audio encoding through the FunASR feature layout when LFR is enabled,
while keeping the existing Qwen3-ASR mel/chunk path unchanged.
Support SMT audio backend ONNX signatures with either:
- hidden_states
- hidden_states + attention_mask
Add FunASR architecture detection in the SMT server glue.
Keep backend output validation based on hidden_size, so the final audio
embedding dimension must still match the text model embedding size.

Model Contract

The tested Fun-ASR-Nano-2512 split config is:

{
  "architectures": ["FunASRNano"],
  "audio_model": {
    "frontend_model_path": "frontend.onnx",
    "backend_model_path": "backend.onnx",
    "d_model": 512,
    "hidden_size": 1024,
    "num_mel_bins": 80,
    "sample_rate": 16000,
    "n_fft": 400,
    "window_len": 400,
    "hop_len": 160,
    "lfr_m": 7,
    "lfr_n": 6
  }
}

hidden_size: 1024 is required because the backend ONNX output is injected as
LLM input embeddings and must match the Qwen3-0.6B GGUF embedding dimension.

Validation

Validated with the following end-to-end workflow.

1. Prepare the Fun-ASR-Nano-2512 model

cd <WORK_DIR>

GIT_LFS_SKIP_SMUDGE=1 git clone \
  https://www.modelscope.cn/FunAudioLLM/Fun-ASR-Nano-2512.git \
  Fun-ASR-Nano-2512

cd Fun-ASR-Nano-2512
git lfs pull --include="model.pt,multilingual.tiktoken,Qwen3-0.6B/*"

2. Export split ONNX and HF LLM weights

The split export script will be attached to this PR as additional review
material.

cd <WORK_DIR>

python3 Fun-ASR-Nano-2512/export_split_model.py \
  --model-dir Fun-ASR-Nano-2512 \
  --output-dir Fun-ASR-Nano-2512/split \
  --opset-version 13

This produces the SMT audio config, frontend/backend ONNX models, and extracted
Qwen3-0.6B HF-format LLM weights under <FUNASR_SPLIT_DIR>.

3. Convert and quantize the text model

cd <LLAMA_CPP_DIR>

python3 convert_hf_to_gguf.py \
  <FUNASR_SPLIT_DIR>/Qwen3-0.6B \
  --outtype f16 \
  --outfile <FUNASR_SPLIT_DIR>/qwen3-0.6b-f16.gguf

./build/bin/llama-quantize \
  <FUNASR_SPLIT_DIR>/qwen3-0.6b-f16.gguf \
  <FUNASR_SPLIT_DIR>/qwen3-0.6b-q4km.gguf \
  Q4_K_M

4. Build llama.cpp with SMT support

cd <LLAMA_CPP_DIR>

export SPACEMIT_ORT_DIR=<SPACEMIT_ORT_DIR>
export LD_LIBRARY_PATH=${SPACEMIT_ORT_DIR}/lib:${LD_LIBRARY_PATH:-}

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CPU_RISCV64_SPACEMIT=ON \
  -DGGML_CPU_REPACK=OFF \
  -DLLAMA_CURL=OFF \
  -DGGML_RVV=ON \
  -DGGML_RV_ZVFH=ON \
  -DGGML_RV_ZFH=ON \
  -DGGML_RV_ZICBOP=ON \
  -DGGML_RV_ZIHINTPAUSE=ON \
  -DGGML_RV_ZBA=ON \
  -DCMAKE_INSTALL_PREFIX=build/installed \
  -DCMAKE_TOOLCHAIN_FILE=${PWD}/cmake/riscv64-spacemit-linux-gnu-gcc.cmake \
  -DLLAMA_SERVER_SMT_VISION=ON \
  -DSPACEMIT_ORT_DIR=${SPACEMIT_ORT_DIR}

cmake --build build --parallel "$(nproc)" --config Release
cmake --install build --config Release

5. Run llama-server

SPLIT_DIR=<FUNASR_SPLIT_DIR>

<LLAMA_CPP_DIR>/build/installed/bin/llama-server \
  -m ${SPLIT_DIR}/qwen3-0.6b-q4km.gguf \
  --media-backend smt \
  --smt-config-dir ${SPLIT_DIR} \
  -t 4 \
  --host 0.0.0.0 \
  --port 8080 \
  --warmup \
  --no-webui

Observed startup:

[SMT][audio] warmup frontend ONNX session (FunASR): .../frontend.onnx
[SMT][audio] warmup backend ONNX session: .../backend.onnx
loaded multimodal model (smt), '.../split'
server is listening on http://0.0.0.0:8080

Benchmark

Accuracy summary from funasr-bq-q4km_accuracy_report.md, focusing on the
04_fast_666_16000hz_1ch.wav
fast-speech sample:

Model/config	Reference chars	Edit distance	Char accuracy	Clear errors
funasr-bq-q4km	544	51	90.62%	17
qwen3asr-ddq40	544	-	88.42%	-

For this sample, the FunASR split deployment is +2.20% over the qwen3asr-ddq40
baseline.

RTF on gold.wav:

Model/config	RTF
funasr-bq-q4km	0.5076
qwen3asr-ddq40	0.203

The reported FunASR config uses quantized ONNX frontend/backend models
(frontend.q.onnx, backend.q.onnx) and a Q4_K_M GGUF text model
(qwen3-0.6b-q4km.gguf).

Additional information

The Fun-ASR-Nano-2512 export script used to prepare the split model is attached
to this PR for review:

export_split_model.py

The script exports:

frontend.onnx: LFR features [1, T, 560] to hidden states [1, T, 512]
backend.onnx: hidden states [1, T, 512] to audio embeddings [1, T, 1024]
Qwen3-0.6B LLM weights for GGUF conversion

Requirements

The existing Qwen3-ASR SMT audio path remains available when lfr_m is not set.
FunASR/SenseVoice-style configs can provide lfr_m/lfr_n to enable the new path.
Backend ONNX models with and without attention_mask are accepted.
Audio embeddings are still validated against the configured hidden_size.
The Fun-ASR-Nano-2512 split model has been exercised through llama-server
with OpenAI-compatible input_audio requests.

Add Kaldi-compatible fbank extraction and LFR frame stacking helpers for FunASR/SenseVoice-style models. Parse lfr_m/lfr_n from audio config and route SMT audio encoding through the FunASR frontend shape when enabled. Also support SMT audio backends with either hidden_states-only or hidden_states-plus-attention_mask inputs, and add FunASR architecture detection in the SMT vision server.

Copilot

Pull request overview

Adds a FunASR/SenseVoice-style SMT audio encoding pipeline that computes Kaldi-like fbank features, applies LFR frame stacking, and routes audio through a frontend/backed ONNX split when audio_model.lfr_m > 0, while preserving the existing Qwen3-ASR mel/chunk path.

Changes:

Extend SMT audio config parsing to support lfr_m / lfr_n and select an alternate FunASR encoding path when enabled.
Add reusable helpers for Kaldi-style fbank extraction and LFR stacking in tools/mtmd/mtmd-audio.{h,cpp}.
Relax backend ONNX signature handling to accept either {hidden_states} or {hidden_states, attention_mask}.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
tools/server/server-smt-vision.cpp	Adds a FunASR architecture helper (currently unused).
tools/mtmd/smt-audio-wrapper.cpp	Adds config fields and implements the FunASR fbank+LFR frontend path and optional backend attention mask handling.
tools/mtmd/mtmd-audio.h	Declares new fbank and LFR helper APIs for MTMD audio.
tools/mtmd/mtmd-audio.cpp	Implements fbank extraction + LFR stacking utilities used by the new FunASR path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+static bool arch_is_funasr(const std::string & arch_name) {
+    return contains_icase(arch_name, "funasr");
+}


+            const auto frontend_output_info = frontend_outputs[0].GetTensorTypeAndShapeInfo();
+            if (frontend_output_info.GetElementType() != ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT) {
+                throw std::runtime_error("SMT audio warmup frontend output must be float32");
+            }
+            auto shape    = frontend_output_info.GetShape();
+            warmup_t_out  = (int) shape[1];
+            warmup_hidden.resize((size_t) warmup_t_out * (size_t) d.config.d_model, 0.0f);
+            const float * frontend_output = frontend_outputs[0].GetTensorData<float>();
+            std::memcpy(warmup_hidden.data(), frontend_output, warmup_hidden.size() * sizeof(float));
+        } else {


+        const auto frontend_shape = frontend_outputs[0].GetTensorTypeAndShapeInfo().GetShape();
+        t_out                     = (int) frontend_shape[1];
+        hidden_states.resize((size_t) t_out * (size_t) d.config.d_model);
+        std::memcpy(hidden_states.data(), frontend_outputs[0].GetTensorData<float>(),
+                    hidden_states.size() * sizeof(float));


+    if (n_samples == 0 || n_mel <= 0) {
+        return false;
+    }
+


+    std::vector<float> window(frame_len);
+    for (int i = 0; i < frame_len; i++) {
+        window[i] = 0.54f - 0.46f * cosf(2.0f * (float) M_PI * i / frame_len);
+    }


+    if (n_frames <= 0 || lfr_n <= 0) {
+        return false;
+    }


+static void funasr_fft_inplace(float * data, int n) {
+    for (int i = 1, j = 0; i < n; i++) {
+        int bit = n >> 1;
+        for (; j & bit; bit >>= 1) {
+            j ^= bit;
+        }


github-actions Bot added examples server labels Jun 15, 2026

alex-spacemit requested a review from Copilot June 22, 2026 06:39

Copilot started reviewing on behalf of alex-spacemit June 22, 2026 06:40 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

alex-spacemit merged commit cbfc6a2 into spacemit-com:spacemit-mtmd Jun 22, 2026
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(mtmd): add FunASR fbank and LFR audio encoding path#8

feat(mtmd): add FunASR fbank and LFR audio encoding path#8
alex-spacemit merged 1 commit into
spacemit-com:spacemit-mtmdfrom
LFF28:spacemit-mtmd

LFF28 commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LFF28 commented Jun 15, 2026

Overview

Changes

Model Contract

Validation

1. Prepare the Fun-ASR-Nano-2512 model

2. Export split ONNX and HF LLM weights

3. Convert and quantize the text model

4. Build llama.cpp with SMT support

5. Run llama-server

Benchmark

Additional information

Requirements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants