Skip to content

feat(mtmd): add FunASR fbank and LFR audio encoding path#8

Merged
alex-spacemit merged 1 commit into
spacemit-com:spacemit-mtmdfrom
LFF28:spacemit-mtmd
Jun 22, 2026
Merged

feat(mtmd): add FunASR fbank and LFR audio encoding path#8
alex-spacemit merged 1 commit into
spacemit-com:spacemit-mtmdfrom
LFF28:spacemit-mtmd

Conversation

@LFF28

@LFF28 LFF28 commented Jun 15, 2026

Copy link
Copy Markdown

Overview

This PR adds FunASR/SenseVoice-style audio support to the SMT multimodal audio path.

The new path is selected by audio_model.lfr_m > 0 in the SMT config. It computes
Kaldi-compatible fbank features, applies LFR frame stacking, runs the FunASR
frontend ONNX model, then projects the result with the backend ONNX model before
injecting the audio embeddings into the llama.cpp context.

This enables the Fun-ASR-Nano-2512 split deployment layout:

audio file
  -> Kaldi fbank + LFR [1, T, 560]
  -> frontend.onnx [1, T, 512]
  -> backend.onnx [1, T, 1024]
  -> GGUF Qwen3-0.6B text model

Changes

  • Add reusable Kaldi-compatible fbank extraction and LFR frame stacking helpers in
    tools/mtmd/mtmd-audio.{h,cpp}.
  • Parse lfr_m and lfr_n from audio_model config.
  • Route SMT audio encoding through the FunASR feature layout when LFR is enabled,
    while keeping the existing Qwen3-ASR mel/chunk path unchanged.
  • Support SMT audio backend ONNX signatures with either:
    • hidden_states
    • hidden_states + attention_mask
  • Add FunASR architecture detection in the SMT server glue.
  • Keep backend output validation based on hidden_size, so the final audio
    embedding dimension must still match the text model embedding size.

Model Contract

The tested Fun-ASR-Nano-2512 split config is:

{
  "architectures": ["FunASRNano"],
  "audio_model": {
    "frontend_model_path": "frontend.onnx",
    "backend_model_path": "backend.onnx",
    "d_model": 512,
    "hidden_size": 1024,
    "num_mel_bins": 80,
    "sample_rate": 16000,
    "n_fft": 400,
    "window_len": 400,
    "hop_len": 160,
    "lfr_m": 7,
    "lfr_n": 6
  }
}

hidden_size: 1024 is required because the backend ONNX output is injected as
LLM input embeddings and must match the Qwen3-0.6B GGUF embedding dimension.

Validation

Validated with the following end-to-end workflow.

1. Prepare the Fun-ASR-Nano-2512 model

cd <WORK_DIR>

GIT_LFS_SKIP_SMUDGE=1 git clone \
  https://www.modelscope.cn/FunAudioLLM/Fun-ASR-Nano-2512.git \
  Fun-ASR-Nano-2512

cd Fun-ASR-Nano-2512
git lfs pull --include="model.pt,multilingual.tiktoken,Qwen3-0.6B/*"

2. Export split ONNX and HF LLM weights

The split export script will be attached to this PR as additional review
material.

cd <WORK_DIR>

python3 Fun-ASR-Nano-2512/export_split_model.py \
  --model-dir Fun-ASR-Nano-2512 \
  --output-dir Fun-ASR-Nano-2512/split \
  --opset-version 13

This produces the SMT audio config, frontend/backend ONNX models, and extracted
Qwen3-0.6B HF-format LLM weights under <FUNASR_SPLIT_DIR>.

3. Convert and quantize the text model

cd <LLAMA_CPP_DIR>

python3 convert_hf_to_gguf.py \
  <FUNASR_SPLIT_DIR>/Qwen3-0.6B \
  --outtype f16 \
  --outfile <FUNASR_SPLIT_DIR>/qwen3-0.6b-f16.gguf

./build/bin/llama-quantize \
  <FUNASR_SPLIT_DIR>/qwen3-0.6b-f16.gguf \
  <FUNASR_SPLIT_DIR>/qwen3-0.6b-q4km.gguf \
  Q4_K_M

4. Build llama.cpp with SMT support

cd <LLAMA_CPP_DIR>

export SPACEMIT_ORT_DIR=<SPACEMIT_ORT_DIR>
export LD_LIBRARY_PATH=${SPACEMIT_ORT_DIR}/lib:${LD_LIBRARY_PATH:-}

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CPU_RISCV64_SPACEMIT=ON \
  -DGGML_CPU_REPACK=OFF \
  -DLLAMA_CURL=OFF \
  -DGGML_RVV=ON \
  -DGGML_RV_ZVFH=ON \
  -DGGML_RV_ZFH=ON \
  -DGGML_RV_ZICBOP=ON \
  -DGGML_RV_ZIHINTPAUSE=ON \
  -DGGML_RV_ZBA=ON \
  -DCMAKE_INSTALL_PREFIX=build/installed \
  -DCMAKE_TOOLCHAIN_FILE=${PWD}/cmake/riscv64-spacemit-linux-gnu-gcc.cmake \
  -DLLAMA_SERVER_SMT_VISION=ON \
  -DSPACEMIT_ORT_DIR=${SPACEMIT_ORT_DIR}

cmake --build build --parallel "$(nproc)" --config Release
cmake --install build --config Release

5. Run llama-server

SPLIT_DIR=<FUNASR_SPLIT_DIR>

<LLAMA_CPP_DIR>/build/installed/bin/llama-server \
  -m ${SPLIT_DIR}/qwen3-0.6b-q4km.gguf \
  --media-backend smt \
  --smt-config-dir ${SPLIT_DIR} \
  -t 4 \
  --host 0.0.0.0 \
  --port 8080 \
  --warmup \
  --no-webui

Observed startup:

[SMT][audio] warmup frontend ONNX session (FunASR): .../frontend.onnx
[SMT][audio] warmup backend ONNX session: .../backend.onnx
loaded multimodal model (smt), '.../split'
server is listening on http://0.0.0.0:8080

Benchmark

Accuracy summary from funasr-bq-q4km_accuracy_report.md, focusing on the
04_fast_666_16000hz_1ch.wav
fast-speech sample:

Model/config Reference chars Edit distance Char accuracy Clear errors
funasr-bq-q4km 544 51 90.62% 17
qwen3asr-ddq40 544 - 88.42% -

For this sample, the FunASR split deployment is +2.20% over the qwen3asr-ddq40
baseline.

RTF on gold.wav:

Model/config RTF
funasr-bq-q4km 0.5076
qwen3asr-ddq40 0.203

The reported FunASR config uses quantized ONNX frontend/backend models
(frontend.q.onnx, backend.q.onnx) and a Q4_K_M GGUF text model
(qwen3-0.6b-q4km.gguf).

Additional information

The Fun-ASR-Nano-2512 export script used to prepare the split model is attached
to this PR for review:

export_split_model.py

The script exports:

  • frontend.onnx: LFR features [1, T, 560] to hidden states [1, T, 512]
  • backend.onnx: hidden states [1, T, 512] to audio embeddings [1, T, 1024]
  • Qwen3-0.6B LLM weights for GGUF conversion

Requirements

  • The existing Qwen3-ASR SMT audio path remains available when lfr_m is not set.
  • FunASR/SenseVoice-style configs can provide lfr_m/lfr_n to enable the new path.
  • Backend ONNX models with and without attention_mask are accepted.
  • Audio embeddings are still validated against the configured hidden_size.
  • The Fun-ASR-Nano-2512 split model has been exercised through llama-server
    with OpenAI-compatible input_audio requests.

Add Kaldi-compatible fbank extraction and LFR frame stacking helpers for
FunASR/SenseVoice-style models. Parse lfr_m/lfr_n from audio config and route
SMT audio encoding through the FunASR frontend shape when enabled.

Also support SMT audio backends with either hidden_states-only or
hidden_states-plus-attention_mask inputs, and add FunASR architecture detection
in the SMT vision server.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a FunASR/SenseVoice-style SMT audio encoding pipeline that computes Kaldi-like fbank features, applies LFR frame stacking, and routes audio through a frontend/backed ONNX split when audio_model.lfr_m > 0, while preserving the existing Qwen3-ASR mel/chunk path.

Changes:

  • Extend SMT audio config parsing to support lfr_m / lfr_n and select an alternate FunASR encoding path when enabled.
  • Add reusable helpers for Kaldi-style fbank extraction and LFR stacking in tools/mtmd/mtmd-audio.{h,cpp}.
  • Relax backend ONNX signature handling to accept either {hidden_states} or {hidden_states, attention_mask}.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
tools/server/server-smt-vision.cpp Adds a FunASR architecture helper (currently unused).
tools/mtmd/smt-audio-wrapper.cpp Adds config fields and implements the FunASR fbank+LFR frontend path and optional backend attention mask handling.
tools/mtmd/mtmd-audio.h Declares new fbank and LFR helper APIs for MTMD audio.
tools/mtmd/mtmd-audio.cpp Implements fbank extraction + LFR stacking utilities used by the new FunASR path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +617 to +619
static bool arch_is_funasr(const std::string & arch_name) {
return contains_icase(arch_name, "funasr");
}
Comment on lines +691 to +700
const auto frontend_output_info = frontend_outputs[0].GetTensorTypeAndShapeInfo();
if (frontend_output_info.GetElementType() != ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT) {
throw std::runtime_error("SMT audio warmup frontend output must be float32");
}
auto shape = frontend_output_info.GetShape();
warmup_t_out = (int) shape[1];
warmup_hidden.resize((size_t) warmup_t_out * (size_t) d.config.d_model, 0.0f);
const float * frontend_output = frontend_outputs[0].GetTensorData<float>();
std::memcpy(warmup_hidden.data(), frontend_output, warmup_hidden.size() * sizeof(float));
} else {
Comment on lines +822 to +826
const auto frontend_shape = frontend_outputs[0].GetTensorTypeAndShapeInfo().GetShape();
t_out = (int) frontend_shape[1];
hidden_states.resize((size_t) t_out * (size_t) d.config.d_model);
std::memcpy(hidden_states.data(), frontend_outputs[0].GetTensorData<float>(),
hidden_states.size() * sizeof(float));
Comment thread tools/mtmd/mtmd-audio.cpp
Comment on lines +1177 to +1180
if (n_samples == 0 || n_mel <= 0) {
return false;
}

Comment thread tools/mtmd/mtmd-audio.cpp
Comment on lines +1198 to +1201
std::vector<float> window(frame_len);
for (int i = 0; i < frame_len; i++) {
window[i] = 0.54f - 0.46f * cosf(2.0f * (float) M_PI * i / frame_len);
}
Comment thread tools/mtmd/mtmd-audio.cpp
Comment on lines +1242 to +1244
if (n_frames <= 0 || lfr_n <= 0) {
return false;
}
Comment thread tools/mtmd/mtmd-audio.cpp
Comment on lines +1096 to +1101
static void funasr_fft_inplace(float * data, int n) {
for (int i = 1, j = 0; i < n; i++) {
int bit = n >> 1;
for (; j & bit; bit >>= 1) {
j ^= bit;
}
@alex-spacemit alex-spacemit merged commit cbfc6a2 into spacemit-com:spacemit-mtmd Jun 22, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants