You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
llama-bench aborts during prompt processing (prefill) on MoE / MoE+SSM models when -p > 0. The failure is in the CUDA MMQ path for expert-routed matrix multiplies (ids != nullptr), specifically right after launching ggml_cuda_launch_mm_ids_helper.
Important: The same binary and model work fine with:
Set CUDA_VISIBLE_DEVICES=0 when testing multi-GPU hosts.
Regression check
Commit
Date
llama-bench -p 512 -n 0 LFM Q4_K_M
a6cc43c28
2026-04-20
PASS (~2120 t/s prefill on RTX 3080)
32120c10
2026-06-16
FAIL (CUDA abort)
dec5ca55 (master)
2026-06-22
Expected FAIL (mmq.cu MoE ids block unchanged vs 32120c1)
We have not bisected the exact introducing commit between a6cc43c28 and 32120c10 (~811 commits). The crash site in mmq.cu (MoE ids branch) is present in both commits; the regression may be in batch/ubatch construction, graph scheduling, or kernel launch parameters rather than the helper call itself.
Expected behavior
llama-bench -p 512 -n 128 should complete and print JSON timing rows for MoE models, as it does for dense models and as it did on a6cc43c28.
Actual behavior
Process aborts with CUDA error during prefill on MoE/SSM models.
Impact
llama-bench is unusable for MoE model prefill/decode sweeps on recent master (blocks quantization / tile / mmq_x benchmarking on LFM, Qwen3.6 MoE, gpt-oss, etc.)
llama-cli remains usable — we use it as a workaround for e2e timing
llama-bench -p 0 -n N remains usable for decode-only MoE benchmarks
Workarounds
Pin to a6cc43c28 (or earlier) for MoE llama-bench prefill experiments
Use llama-bench -p 0 -n 128 for MoE decode-only metrics on recent master
Use llama-cli + logged tok/s for MoE e2e (less convenient than llama-bench -o json)
Likely first bad commit: 9725a313be0528214c4a02fed906ddaf7b3f712e (2026-04-25)
9725a313b CUDA: reduce MMQ stream-k overhead (#22298)
Author: Johannes Gäßler
Date: Sat Apr 25 14:15:03 2026 +0200
Confidence: high (mechanism confirmed; single-file revert rebuild inconclusive due to long CUDA rebuild cycle — see below).
What we verified (no full bisect rebuild needed)
Build
Commit
-ub 512 (default)
-ub 9
-ub 8
Good (llama.cpp-bench)
a6cc43c28
PASS
PASS
PASS
Bad (llama.cpp-nvidia)
32120c10
FAIL
FAIL
PASS
The threshold is n_ubatch > 8, which matches MMVQ_MAX_BATCH_SIZE in the CUDA dispatcher:
ne2 ≤ 8 → MoE MMVQ path (ggml_cuda_mul_mat_vec_q with ids) → works on bad commit
ne2 > 8 → MoE MMQ path (ggml_cuda_mul_mat_q with ids, ggml_cuda_launch_mm_ids_helper) → crashes on bad commit
So this is not a llama-bench-only bug — any prefill with ubatch > 8 hits the broken MoE MMQ path. llama-bench defaults (-ub 512) just make it obvious.
On a6cc43c28, the same MoE MMQ path works at -ub 9 and -ub 512, so the regression is in CUDA MMQ/MoE handling, not in batch construction per se.
Why 9725a313b is the prime suspect
Between a6cc43c28 (good) and 32120c10 (bad), only two commits touch ggml/src/ggml-cuda/mmq.cu / mmq.cuh:
Commit
Date
Change
9725a313b
2026-04-25
mmq.cuh only — stream-k refactor: kbc/kbc_stop from int64_t → int32_t, new fastdiv/fastmodulo; MoE stream-k fixup (ids_dst, expert_bounds) rewritten
fc2b0053f
2026-04-29
mmq.cu NVFP4/MXFP4 naming — irrelevant for Q4_K_M on Ampere
ggml/src/ggml-cuda/mmid.cu (the mm_ids_helper kernel) is byte-identical at both endpoints — the helper source did not change; the failure is in the MMQ stream-k path taken once batch exceeds MMVQ cutoff.
Parent (expected good):d1649047a — metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962)
What does not fix it
-fa off — still crashes (not flash-attn default change)
Name and Version
llama-cli --version (repro build, commit 32120c1):
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 35896 MiB):
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20047 MiB
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
version: 9668 (32120c1)
built with GNU 13.3.0 for Linux x86_64
Crash tool: llama-bench (same build tree; llama-bench does not support --version)
Known-good baseline for comparison:
version: 8857 (a6cc43c), same command prefill PASS on RTX 3080
Operating systems
Linux
GGML backends
CUDA
Hardware
3900X + 3080-20G/ 5060Ti 16GB (CUDA 13.2)
Models
https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF Q4_K_M Q8_0
Problem description & steps to reproduce
Eval bug:
llama-benchprefill aborts on MoE models withCUDA error: invalid argumentinggml_cuda_mul_mat_qName and Version
Broken build (reproduced):
Also present on latest master (code unchanged):
Known-good baseline:
Operating systems
Linux (Ubuntu 24.04, kernel 6.x)
GGML backends
CUDA
Hardware
Reproduced on both:
Build:
-DGGML_CUDA=ON,CMAKE_CUDA_ARCHITECTURES=86;89;120a, Release.Models
LFM2.5-8B-A1B-Q4_K_M.gguf(~5 GB)Qwen3.6-35B-A3B-UD-Q4_K_M.gguf(~22 GB)Dense models (no crash with same protocol):
Problem description & steps to reproduce
llama-benchaborts during prompt processing (prefill) on MoE / MoE+SSM models when-p > 0. The failure is in the CUDA MMQ path for expert-routed matrix multiplies (ids != nullptr), specifically right after launchingggml_cuda_launch_mm_ids_helper.Important: The same binary and model work fine with:
llama-cliinteractive generation (-p 64 -n 8etc.)llama-benchdecode-only (-p 0 -n 128)So this is not a general "MoE can't run on CUDA" failure — it is tied to
llama-bench's prefill / prompt-eval path (or batch shape used there).Minimal repro (LFM2.5-8B, single GPU)
Set
CUDA_VISIBLE_DEVICES=0when testing multi-GPU hosts.Regression check
llama-bench -p 512 -n 0LFM Q4_K_Ma6cc43c2832120c10dec5ca55(master)mmq.cuMoE ids block unchanged vs 32120c1)We have not bisected the exact introducing commit between
a6cc43c28and32120c10(~811 commits). The crash site inmmq.cu(MoEidsbranch) is present in both commits; the regression may be in batch/ubatch construction, graph scheduling, or kernel launch parameters rather than the helper call itself.Expected behavior
llama-bench -p 512 -n 128should complete and print JSON timing rows for MoE models, as it does for dense models and as it did ona6cc43c28.Actual behavior
Process aborts with
CUDA errorduring prefill on MoE/SSM models.Impact
llama-benchis unusable for MoE model prefill/decode sweeps on recent master (blocks quantization / tile / mmq_x benchmarking on LFM, Qwen3.6 MoE, gpt-oss, etc.)llama-cliremains usable — we use it as a workaround for e2e timingllama-bench -p 0 -n Nremains usable for decode-only MoE benchmarksWorkarounds
a6cc43c28(or earlier) for MoEllama-benchprefill experimentsllama-bench -p 0 -n 128for MoE decode-only metrics on recent masterllama-cli+ logged tok/s for MoE e2e (less convenient thanllama-bench -o json)Possibly related issues
ggml_cuda_mul_mat_q@ mmq.cu:~179,invalid argumentvia llama-server (open, regression/CUDA)MUL_MAT_IDMMVQ invalid launch (open)This report focuses specifically on
llama-benchprefill regression with a pinned good/bad commit pair and dense-vs-MoE control on RTX 3080 + 5060 Ti.Additional notes for maintainers
mul_mat_vec_q(MMVQ), not MMQ — our separate MMVQnwarpswork is orthogonal to this crashidstensor present) — consistent with MoE FFN / gate paths during prompt evalmmq.cuFirst Bad Commit
Likely first bad commit:
9725a313be0528214c4a02fed906ddaf7b3f712e(2026-04-25)Confidence: high (mechanism confirmed; single-file revert rebuild inconclusive due to long CUDA rebuild cycle — see below).
What we verified (no full bisect rebuild needed)
-ub 512(default)-ub 9-ub 8llama.cpp-bench)a6cc43c28llama.cpp-nvidia)32120c10The threshold is
n_ubatch > 8, which matchesMMVQ_MAX_BATCH_SIZEin the CUDA dispatcher:ne2 ≤ 8→ MoE MMVQ path (ggml_cuda_mul_mat_vec_qwithids) → works on bad commitne2 > 8→ MoE MMQ path (ggml_cuda_mul_mat_qwithids,ggml_cuda_launch_mm_ids_helper) → crashes on bad commitSo this is not a llama-bench-only bug — any prefill with ubatch > 8 hits the broken MoE MMQ path.
llama-benchdefaults (-ub 512) just make it obvious.On
a6cc43c28, the same MoE MMQ path works at-ub 9and-ub 512, so the regression is in CUDA MMQ/MoE handling, not in batch construction per se.Why
9725a313bis the prime suspectBetween
a6cc43c28(good) and32120c10(bad), only two commits touchggml/src/ggml-cuda/mmq.cu/mmq.cuh:9725a313bmmq.cuhonly — stream-k refactor:kbc/kbc_stopfromint64_t→int32_t, newfastdiv/fastmodulo; MoE stream-k fixup (ids_dst,expert_bounds) rewrittenfc2b0053fmmq.cuNVFP4/MXFP4 naming — irrelevant for Q4_K_M on Ampereggml/src/ggml-cuda/mmid.cu(themm_ids_helperkernel) is byte-identical at both endpoints — the helper source did not change; the failure is in the MMQ stream-k path taken once batch exceeds MMVQ cutoff.Parent (expected good):
d1649047a—metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962)What does not fix it
-fa off— still crashes (not flash-attn default change)GGML_CUDA_DISABLE_FUSION=1— still crashes-ub 8— works (workaround: forces MMVQ MoE path)Relevant log output
Console (32120c1, RTX 3080):
Backtrace (32120c1,
libggml-cuda.so):