Eval bug: CUDA: llama-bench prefill crashes on MoE/SSM models (ggml_cuda_mul_mat_q / mm_ids_helper) — regression since a6cc43c28; llama-cli OK

### Name and Version

llama-cli --version (repro build, commit 32120c10):

ggml_cuda_init: found 2 CUDA devices (Total VRAM: 35896 MiB):
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20047 MiB
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
version: 9668 (32120c10e)
built with GNU 13.3.0 for Linux x86_64

Crash tool: llama-bench (same build tree; llama-bench does not support --version)

Known-good baseline for comparison:
version: 8857 (a6cc43c28), same command prefill PASS on RTX 3080

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

3900X + 3080-20G/ 5060Ti 16GB （CUDA 13.2）

### Models

https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF Q4_K_M Q8_0

### Problem description & steps to reproduce



## Eval bug: `llama-bench` prefill aborts on MoE models with `CUDA error: invalid argument` in `ggml_cuda_mul_mat_q`

### Name and Version

**Broken build (reproduced):**

```
version: 9668 (32120c10e)
built with GNU 13.3.0 for Linux x86_64
```

**Also present on latest master (code unchanged):**

```
dec5ca5577d6042b4e870fadf4087c5b9b8d3a70 (2026-06-22)
# ggml/src/ggml-cuda/mmq.cu is byte-identical to 32120c10 for the MoE ids path
```

**Known-good baseline:**

```
a6cc43c286a2ebc429aa69b9a4d16de082cedb51 (2026-04-20)
version reports as 0 (unknown) on our pinned build
```

### Operating systems

Linux (Ubuntu 24.04, kernel 6.x)

### GGML backends

CUDA

### Hardware

Reproduced on **both**:

| GPU | Compute capability | VRAM | Driver |
|-----|-------------------|------|--------|
| NVIDIA GeForce RTX 3080 | 8.6 (Ampere) | 20 GB | 595.71.05 |
| NVIDIA GeForce RTX 5060 Ti | 12.0 (Blackwell) | 16 GB | 595.71.05 |

Build: `-DGGML_CUDA=ON`, `CMAKE_CUDA_ARCHITECTURES=86;89;120a`, Release.

### Models

| Model | GGUF | Architecture | Reproduces |
|-------|------|--------------|------------|
| LFM2.5-8B-A1B | `LFM2.5-8B-A1B-Q4_K_M.gguf` (~5 GB) | MoE + SSM hybrid | **Yes** |
| Qwen3.6-35B-A3B | `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` (~22 GB) | Sparse MoE | **Yes** (same stack) |

Dense models (**no crash** with same protocol):

- Meta-Llama-3.1-8B-Instruct Q4_K_M
- Qwen2.5-7B-Instruct Q4_K_M
- Gemma-4-12B-it Q4_K_M

### Problem description & steps to reproduce

`llama-bench` aborts during **prompt processing (prefill)** on MoE / MoE+SSM models when `-p > 0`. The failure is in the CUDA **MMQ** path for **expert-routed** matrix multiplies (`ids != nullptr`), specifically right after launching `ggml_cuda_launch_mm_ids_helper`.

**Important:** The same binary and model work fine with:

- `llama-cli` interactive generation (`-p 64 -n 8` etc.)
- `llama-bench` **decode-only** (`-p 0 -n 128`)

So this is **not** a general "MoE can't run on CUDA" failure — it is tied to **`llama-bench`'s prefill / prompt-eval path** (or batch shape used there).

#### Minimal repro (LFM2.5-8B, single GPU)

```bash
# Fails — prefill only
llama-bench -m LFM2.5-8B-A1B-Q4_K_M.gguf \
  -p 512 -n 0 -ngl 99 -t 1 -r 1 --no-warmup

# Fails — standard bench config (prefill then decode)
llama-bench -m LFM2.5-8B-A1B-Q4_K_M.gguf \
  -p 512 -n 128 -ngl 99 -t 1 -r 5 --no-warmup -o json

# Works — decode-only (same binary, same model)
llama-bench -m LFM2.5-8B-A1B-Q4_K_M.gguf \
  -p 0 -n 128 -ngl 99 -t 1 -r 5 --no-warmup -o json

# Works — llama-cli
llama-cli -m LFM2.5-8B-A1B-Q4_K_M.gguf \
  -p 64 -n 8 -ngl 99 -t 4 --simple-io -st
```

Set `CUDA_VISIBLE_DEVICES=0` when testing multi-GPU hosts.

#### Regression check

| Commit | Date | `llama-bench -p 512 -n 0` LFM Q4_K_M |
|--------|------|--------------------------------------|
| `a6cc43c28` | 2026-04-20 | **PASS** (~2120 t/s prefill on RTX 3080) |
| `32120c10` | 2026-06-16 | **FAIL** (CUDA abort) |
| `dec5ca55` (master) | 2026-06-22 | **Expected FAIL** (`mmq.cu` MoE ids block unchanged vs 32120c10) |

We have **not** bisected the exact introducing commit between `a6cc43c28` and `32120c10` (~811 commits). The crash site in `mmq.cu` (MoE `ids` branch) is present in both commits; the regression may be in batch/ubatch construction, graph scheduling, or kernel launch parameters rather than the helper call itself.


### Expected behavior

`llama-bench -p 512 -n 128` should complete and print JSON timing rows for MoE models, as it does for dense models and as it did on `a6cc43c28`.

### Actual behavior

Process aborts with `CUDA error` during prefill on MoE/SSM models.

### Impact

- **`llama-bench` is unusable for MoE model prefill/decode sweeps** on recent master (blocks quantization / tile / mmq_x benchmarking on LFM, Qwen3.6 MoE, gpt-oss, etc.)
- **`llama-cli` remains usable** — we use it as a workaround for e2e timing
- **`llama-bench -p 0 -n N`** remains usable for **decode-only** MoE benchmarks

### Workarounds

1. Pin to **`a6cc43c28`** (or earlier) for MoE `llama-bench` prefill experiments
2. Use **`llama-bench -p 0 -n 128`** for MoE decode-only metrics on recent master
3. Use **`llama-cli`** + logged tok/s for MoE e2e (less convenient than `llama-bench -o json`)

### Possibly related issues

- #18996 — gpt-oss MoE, `ggml_cuda_mul_mat_q` @ mmq.cu:~179, `invalid argument` via **llama-server** (open, regression/CUDA)
- #23972 — LFM2.5-8B, different MMQ/mmid shared-memory failure on **llama-server** second turn (open)
- #24064 — MoE `MUL_MAT_ID` MMVQ invalid launch (open)
- PR #22252 / #22298 — MoE MMQ stream-k tuning (merged; different symptom)

This report focuses specifically on **`llama-bench` prefill regression** with a pinned good/bad commit pair and dense-vs-MoE control on RTX 3080 + 5060 Ti.

### Additional notes for maintainers

- LFM2.5 decode path is dominated by **`mul_mat_vec_q` (MMVQ)**, not MMQ — our separate MMVQ `nwarps` work is orthogonal to this crash
- Crash occurs in **prefill MMQ** when routing experts (`ids` tensor present) — consistent with MoE FFN / gate paths during prompt eval
- SYCL MoE prefill fix in #24676 does not touch CUDA `mmq.cu`



### First Bad Commit

**Likely first bad commit: `9725a313be0528214c4a02fed906ddaf7b3f712e`** (2026-04-25)

```
9725a313b  CUDA: reduce MMQ stream-k overhead (#22298)
Author: Johannes Gäßler
Date:   Sat Apr 25 14:15:03 2026 +0200
```

**Confidence:** high (mechanism confirmed; single-file revert rebuild inconclusive due to long CUDA rebuild cycle — see below).

#### What we verified (no full bisect rebuild needed)

| Build | Commit | `-ub 512` (default) | `-ub 9` | `-ub 8` |
|-------|--------|---------------------|---------|---------|
| **Good** (`llama.cpp-bench`) | `a6cc43c28` | **PASS** | **PASS** | **PASS** |
| **Bad** (`llama.cpp-nvidia`) | `32120c10` | **FAIL** | **FAIL** | **PASS** |

The threshold is **`n_ubatch > 8`**, which matches `MMVQ_MAX_BATCH_SIZE` in the CUDA dispatcher:

- **`ne2 ≤ 8`** → MoE **MMVQ** path (`ggml_cuda_mul_mat_vec_q` with `ids`) → works on bad commit
- **`ne2 > 8`** → MoE **MMQ** path (`ggml_cuda_mul_mat_q` with `ids`, `ggml_cuda_launch_mm_ids_helper`) → crashes on bad commit

So this is **not** a llama-bench-only bug — any prefill with ubatch > 8 hits the broken MoE MMQ path. `llama-bench` defaults (`-ub 512`) just make it obvious.

On **`a6cc43c28`**, the same MoE MMQ path works at `-ub 9` and `-ub 512`, so the regression is in CUDA MMQ/MoE handling, not in batch construction per se.

#### Why `9725a313b` is the prime suspect

Between `a6cc43c28` (good) and `32120c10` (bad), only **two commits** touch `ggml/src/ggml-cuda/mmq.cu` / `mmq.cuh`:

| Commit | Date | Change |
|--------|------|--------|
| **`9725a313b`** | 2026-04-25 | **`mmq.cuh` only** — stream-k refactor: `kbc`/`kbc_stop` from `int64_t` → `int32_t`, new `fastdiv`/`fastmodulo`; MoE stream-k fixup (`ids_dst`, `expert_bounds`) rewritten |
| `fc2b0053f` | 2026-04-29 | `mmq.cu` NVFP4/MXFP4 naming — irrelevant for Q4_K_M on Ampere |

`ggml/src/ggml-cuda/mmid.cu` (the `mm_ids_helper` kernel) is **byte-identical** at both endpoints — the helper source did not change; the failure is in the **MMQ stream-k path** taken once batch exceeds MMVQ cutoff.

**Parent (expected good):** `d1649047a` — `metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962)`

#### What does *not* fix it

- `-fa off` — still crashes (not flash-attn default change)
- `GGML_CUDA_DISABLE_FUSION=1` — still crashes
- `-ub 8` — **works** (workaround: forces MMVQ MoE path)


### Relevant log output

**Console (32120c10, RTX 3080):**

```
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 20047 MiB):
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20047 MiB
/home/.../llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:103: CUDA error
```

**Backtrace (32120c10, `libggml-cuda.so`):**

```
ggml_cuda_error(...)
ggml_cuda_mul_mat_q(...)                    # ggml-cuda/mmq.cu
ggml_backend_cuda_graph_compute(...)
ggml_backend_sched_graph_compute_async(...)
llama_context::graph_compute(...)
llama_context::process_ubatch(...)
llama_context::decode

[04_bad_32120c10_ub9_blocking_crash.log](https://github.com/user-attachments/files/29242002/04_bad_32120c10_ub9_blocking_crash.log)
[03_good_a6cc43c28_ub512_pass.log](https://github.com/user-attachments/files/29242003/03_good_a6cc43c28_ub512_pass.log)
[02_bad_32120c10_ub8_pass.log](https://github.com/user-attachments/files/29242001/02_bad_32120c10_ub8_pass.log)
[01_bad_32120c10_ub512_crash.log](https://github.com/user-attachments/files/29242000/01_bad_32120c10_ub512_crash.log)

[01_bad_32120c10_ub512_crash.log](https://github.com/user-attachments/files/29242018/01_bad_32120c10_ub512_crash.log)

Commit	Date	`llama-bench -p 512 -n 0` LFM Q4_K_M
`a6cc43c28`	2026-04-20	PASS (~2120 t/s prefill on RTX 3080)
`32120c10`	2026-06-16	FAIL (CUDA abort)
`dec5ca55` (master)	2026-06-22	Expected FAIL (`mmq.cu` MoE ids block unchanged vs `32120c1`)

Build	Commit	`-ub 512` (default)	`-ub 9`	`-ub 8`
Good (`llama.cpp-bench`)	`a6cc43c28`	PASS	PASS	PASS
Bad (`llama.cpp-nvidia`)	`32120c10`	FAIL	FAIL	PASS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: CUDA: llama-bench prefill crashes on MoE/SSM models (ggml_cuda_mul_mat_q / mm_ids_helper) — regression since a6cc43c28; llama-cli OK #24937

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Eval bug: `llama-bench` prefill aborts on MoE models with `CUDA error: invalid argument` in `ggml_cuda_mul_mat_q`

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Minimal repro (LFM2.5-8B, single GPU)

Regression check

Expected behavior

Actual behavior

Impact

Workarounds

Possibly related issues

Additional notes for maintainers

First Bad Commit

What we verified (no full bisect rebuild needed)

Why `9725a313b` is the prime suspect

What does not fix it

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GPU	Compute capability	VRAM	Driver
NVIDIA GeForce RTX 3080	8.6 (Ampere)	20 GB	595.71.05
NVIDIA GeForce RTX 5060 Ti	12.0 (Blackwell)	16 GB	595.71.05

Model	GGUF	Architecture	Reproduces
LFM2.5-8B-A1B	`LFM2.5-8B-A1B-Q4_K_M.gguf` (~5 GB)	MoE + SSM hybrid	Yes
Qwen3.6-35B-A3B	`Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` (~22 GB)	Sparse MoE	Yes (same stack)

Commit	Date	Change
`9725a313b`	2026-04-25	`mmq.cuh` only — stream-k refactor: `kbc`/`kbc_stop` from `int64_t` → `int32_t`, new `fastdiv`/`fastmodulo`; MoE stream-k fixup (`ids_dst`, `expert_bounds`) rewritten
`fc2b0053f`	2026-04-29	`mmq.cu` NVFP4/MXFP4 naming — irrelevant for Q4_K_M on Ampere

Uh oh!

Eval bug: CUDA: llama-bench prefill crashes on MoE/SSM models (ggml_cuda_mul_mat_q / mm_ids_helper) — regression since a6cc43c28; llama-cli OK #24937

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Eval bug: llama-bench prefill aborts on MoE models with CUDA error: invalid argument in ggml_cuda_mul_mat_q

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Minimal repro (LFM2.5-8B, single GPU)

Regression check

Expected behavior

Actual behavior

Impact

Workarounds

Possibly related issues

Additional notes for maintainers

First Bad Commit

What we verified (no full bisect rebuild needed)

Why 9725a313b is the prime suspect

What does not fix it

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Eval bug: `llama-bench` prefill aborts on MoE models with `CUDA error: invalid argument` in `ggml_cuda_mul_mat_q`

Why `9725a313b` is the prime suspect