Skip to content

Eval bug: CUDA: llama-bench prefill crashes on MoE/SSM models (ggml_cuda_mul_mat_q / mm_ids_helper) — regression since a6cc43c28; llama-cli OK #24937

Description

@youyoulyz

Name and Version

llama-cli --version (repro build, commit 32120c1):

ggml_cuda_init: found 2 CUDA devices (Total VRAM: 35896 MiB):
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20047 MiB
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
version: 9668 (32120c1)
built with GNU 13.3.0 for Linux x86_64

Crash tool: llama-bench (same build tree; llama-bench does not support --version)

Known-good baseline for comparison:
version: 8857 (a6cc43c), same command prefill PASS on RTX 3080

Operating systems

Linux

GGML backends

CUDA

Hardware

3900X + 3080-20G/ 5060Ti 16GB (CUDA 13.2)

Models

https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF Q4_K_M Q8_0

Problem description & steps to reproduce

Eval bug: llama-bench prefill aborts on MoE models with CUDA error: invalid argument in ggml_cuda_mul_mat_q

Name and Version

Broken build (reproduced):

version: 9668 (32120c10e)
built with GNU 13.3.0 for Linux x86_64

Also present on latest master (code unchanged):

dec5ca5577d6042b4e870fadf4087c5b9b8d3a70 (2026-06-22)
# ggml/src/ggml-cuda/mmq.cu is byte-identical to 32120c10 for the MoE ids path

Known-good baseline:

a6cc43c286a2ebc429aa69b9a4d16de082cedb51 (2026-04-20)
version reports as 0 (unknown) on our pinned build

Operating systems

Linux (Ubuntu 24.04, kernel 6.x)

GGML backends

CUDA

Hardware

Reproduced on both:

GPU Compute capability VRAM Driver
NVIDIA GeForce RTX 3080 8.6 (Ampere) 20 GB 595.71.05
NVIDIA GeForce RTX 5060 Ti 12.0 (Blackwell) 16 GB 595.71.05

Build: -DGGML_CUDA=ON, CMAKE_CUDA_ARCHITECTURES=86;89;120a, Release.

Models

Model GGUF Architecture Reproduces
LFM2.5-8B-A1B LFM2.5-8B-A1B-Q4_K_M.gguf (~5 GB) MoE + SSM hybrid Yes
Qwen3.6-35B-A3B Qwen3.6-35B-A3B-UD-Q4_K_M.gguf (~22 GB) Sparse MoE Yes (same stack)

Dense models (no crash with same protocol):

  • Meta-Llama-3.1-8B-Instruct Q4_K_M
  • Qwen2.5-7B-Instruct Q4_K_M
  • Gemma-4-12B-it Q4_K_M

Problem description & steps to reproduce

llama-bench aborts during prompt processing (prefill) on MoE / MoE+SSM models when -p > 0. The failure is in the CUDA MMQ path for expert-routed matrix multiplies (ids != nullptr), specifically right after launching ggml_cuda_launch_mm_ids_helper.

Important: The same binary and model work fine with:

  • llama-cli interactive generation (-p 64 -n 8 etc.)
  • llama-bench decode-only (-p 0 -n 128)

So this is not a general "MoE can't run on CUDA" failure — it is tied to llama-bench's prefill / prompt-eval path (or batch shape used there).

Minimal repro (LFM2.5-8B, single GPU)

# Fails — prefill only
llama-bench -m LFM2.5-8B-A1B-Q4_K_M.gguf \
  -p 512 -n 0 -ngl 99 -t 1 -r 1 --no-warmup

# Fails — standard bench config (prefill then decode)
llama-bench -m LFM2.5-8B-A1B-Q4_K_M.gguf \
  -p 512 -n 128 -ngl 99 -t 1 -r 5 --no-warmup -o json

# Works — decode-only (same binary, same model)
llama-bench -m LFM2.5-8B-A1B-Q4_K_M.gguf \
  -p 0 -n 128 -ngl 99 -t 1 -r 5 --no-warmup -o json

# Works — llama-cli
llama-cli -m LFM2.5-8B-A1B-Q4_K_M.gguf \
  -p 64 -n 8 -ngl 99 -t 4 --simple-io -st

Set CUDA_VISIBLE_DEVICES=0 when testing multi-GPU hosts.

Regression check

Commit Date llama-bench -p 512 -n 0 LFM Q4_K_M
a6cc43c28 2026-04-20 PASS (~2120 t/s prefill on RTX 3080)
32120c10 2026-06-16 FAIL (CUDA abort)
dec5ca55 (master) 2026-06-22 Expected FAIL (mmq.cu MoE ids block unchanged vs 32120c1)

We have not bisected the exact introducing commit between a6cc43c28 and 32120c10 (~811 commits). The crash site in mmq.cu (MoE ids branch) is present in both commits; the regression may be in batch/ubatch construction, graph scheduling, or kernel launch parameters rather than the helper call itself.

Expected behavior

llama-bench -p 512 -n 128 should complete and print JSON timing rows for MoE models, as it does for dense models and as it did on a6cc43c28.

Actual behavior

Process aborts with CUDA error during prefill on MoE/SSM models.

Impact

  • llama-bench is unusable for MoE model prefill/decode sweeps on recent master (blocks quantization / tile / mmq_x benchmarking on LFM, Qwen3.6 MoE, gpt-oss, etc.)
  • llama-cli remains usable — we use it as a workaround for e2e timing
  • llama-bench -p 0 -n N remains usable for decode-only MoE benchmarks

Workarounds

  1. Pin to a6cc43c28 (or earlier) for MoE llama-bench prefill experiments
  2. Use llama-bench -p 0 -n 128 for MoE decode-only metrics on recent master
  3. Use llama-cli + logged tok/s for MoE e2e (less convenient than llama-bench -o json)

Possibly related issues

This report focuses specifically on llama-bench prefill regression with a pinned good/bad commit pair and dense-vs-MoE control on RTX 3080 + 5060 Ti.

Additional notes for maintainers

  • LFM2.5 decode path is dominated by mul_mat_vec_q (MMVQ), not MMQ — our separate MMVQ nwarps work is orthogonal to this crash
  • Crash occurs in prefill MMQ when routing experts (ids tensor present) — consistent with MoE FFN / gate paths during prompt eval
  • SYCL MoE prefill fix in SYCL: fix use-after-free bug with async memcpy in MoE prefill #24676 does not touch CUDA mmq.cu

First Bad Commit

Likely first bad commit: 9725a313be0528214c4a02fed906ddaf7b3f712e (2026-04-25)

9725a313b  CUDA: reduce MMQ stream-k overhead (#22298)
Author: Johannes Gäßler
Date:   Sat Apr 25 14:15:03 2026 +0200

Confidence: high (mechanism confirmed; single-file revert rebuild inconclusive due to long CUDA rebuild cycle — see below).

What we verified (no full bisect rebuild needed)

Build Commit -ub 512 (default) -ub 9 -ub 8
Good (llama.cpp-bench) a6cc43c28 PASS PASS PASS
Bad (llama.cpp-nvidia) 32120c10 FAIL FAIL PASS

The threshold is n_ubatch > 8, which matches MMVQ_MAX_BATCH_SIZE in the CUDA dispatcher:

  • ne2 ≤ 8 → MoE MMVQ path (ggml_cuda_mul_mat_vec_q with ids) → works on bad commit
  • ne2 > 8 → MoE MMQ path (ggml_cuda_mul_mat_q with ids, ggml_cuda_launch_mm_ids_helper) → crashes on bad commit

So this is not a llama-bench-only bug — any prefill with ubatch > 8 hits the broken MoE MMQ path. llama-bench defaults (-ub 512) just make it obvious.

On a6cc43c28, the same MoE MMQ path works at -ub 9 and -ub 512, so the regression is in CUDA MMQ/MoE handling, not in batch construction per se.

Why 9725a313b is the prime suspect

Between a6cc43c28 (good) and 32120c10 (bad), only two commits touch ggml/src/ggml-cuda/mmq.cu / mmq.cuh:

Commit Date Change
9725a313b 2026-04-25 mmq.cuh only — stream-k refactor: kbc/kbc_stop from int64_tint32_t, new fastdiv/fastmodulo; MoE stream-k fixup (ids_dst, expert_bounds) rewritten
fc2b0053f 2026-04-29 mmq.cu NVFP4/MXFP4 naming — irrelevant for Q4_K_M on Ampere

ggml/src/ggml-cuda/mmid.cu (the mm_ids_helper kernel) is byte-identical at both endpoints — the helper source did not change; the failure is in the MMQ stream-k path taken once batch exceeds MMVQ cutoff.

Parent (expected good): d1649047ametal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962)

What does not fix it

  • -fa off — still crashes (not flash-attn default change)
  • GGML_CUDA_DISABLE_FUSION=1 — still crashes
  • -ub 8works (workaround: forces MMVQ MoE path)

Relevant log output

Console (32120c1, RTX 3080):

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 20047 MiB):
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20047 MiB
/home/.../llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:103: CUDA error

Backtrace (32120c1, libggml-cuda.so):

ggml_cuda_error(...)
ggml_cuda_mul_mat_q(...)                    # ggml-cuda/mmq.cu
ggml_backend_cuda_graph_compute(...)
ggml_backend_sched_graph_compute_async(...)
llama_context::graph_compute(...)
llama_context::process_ubatch(...)
llama_context::decode

[04_bad_32120c10_ub9_blocking_crash.log](https://github.com/user-attachments/files/29242002/04_bad_32120c10_ub9_blocking_crash.log)
[03_good_a6cc43c28_ub512_pass.log](https://github.com/user-attachments/files/29242003/03_good_a6cc43c28_ub512_pass.log)
[02_bad_32120c10_ub8_pass.log](https://github.com/user-attachments/files/29242001/02_bad_32120c10_ub8_pass.log)
[01_bad_32120c10_ub512_crash.log](https://github.com/user-attachments/files/29242000/01_bad_32120c10_ub512_crash.log)

[01_bad_32120c10_ub512_crash.log](https://github.com/user-attachments/files/29242018/01_bad_32120c10_ub512_crash.log)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions