Gemma-4 MTP: skip pooling/sampling epilogue for MTP graphs (fixes decode crash) + widen DKQ=512 TILE routing to all gqa_ratio#25
Open
PhilEgly wants to merge 2 commits into
Conversation
…ers == n_layer The GEMMA4 hparam-loading path already disables KV reuse when shared_kv_layers leaves no dedicated KV layers, but the GEMMA4_ASSISTANT path next to it does not. For 26B/31B assistants where block_count == shared_kv_layers == 4, this leaves hparams.n_layer_kv_from_start at 0 and downstream tensor-creation code hits a 0-length vector subscript (visible on Windows debug-iterators as "invalid vector subscript"; UB elsewhere). Mirrors the existing GEMMA4 protection a few lines above. Reproduces with google/gemma-4-26B-A4B-it-assistant converted via convert_hf_to_gguf.py. Edge variants (E2B/E4B) and the new 2026-06-03 12B Unified assistant likely have different shared_kv_layers values that avoid this edge case, which is why current AtomicChat-published GGUFs do not exhibit it.
…ode crash) Gemma-4 MTP speculative decode crashed with a silent access violation on the first draft step. Root cause: the speculative path calls llama_set_embeddings( ctx, true) so the main decode emits backbone hidden states for the drafter, and that flag stays set. llama_model::build_graph() runs a shared build_pooling()/ build_sampling() epilogue on every graph; build_pooling only early-returns when !cparams.embeddings, so it ran against the MTP graph's t_embd (= h_post, the backbone projection, which has no pooling inputs) and dereferenced bad memory. Fix: gate the epilogue on params.gtype != LLM_GRAPH_TYPE_MTP. The MTP graph is self-contained (own logits / h_post / on-device argmax) and must not get the main-decode pooling or backend-sampling layers. This also stops build_sampling from attaching backend samplers to MTP logits, which MTP never consumes. Supporting changes found during the investigation: - ggml-cuda/fattn.cu: route ALL DKQ=512 cases to the TILE kernel. The previous gqa_ratio<3 guard returned BEST_FATTN_KERNEL_NONE for Gemma-4 12B/26B/31B (gqa_ratio=8), which aborts in the MMA dispatcher. TILE supports DKQ=512 with ncols2 fallback. Required for those targets to reach the MTP path at all. - llama-graph.cpp: guard llm_graph_input_embd::set_input on embd != null (mirrors the existing can_reuse() check) so gemma4's per-layer-token input, which reuses this class but only allocates tokens, can't deref a null embd. - llama-context.cpp: re-reserve on a real set_embeddings() mode change (topology change); synchronize the main scheduler in decode_mtp_async before handing off so the MTP worker doesn't race the preceding decode's in-flight KV writes. - convert_hf_to_gguf.py: register Gemma4UnifiedAssistantForCausalLM as an alias of Gemma4AssistantForCausalLM so newer assistant checkpoints convert. Validated: E4B and 12B MTP speculative decode now run end-to-end (4/4 requests each, server healthy, draft tokens accepted). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gemma-4 MTP speculative decode crashed with a silent access violation (Windows
0xC0000005) on the first draft step, for both E4B and 12B targets. This PR fixes the crash and makes MTP speculative decode work end-to-end.The crash was traced by binary-searching the decode path with flushed log probes on an E4B repro (E4B has
gqa_ratio=2, so it clears the fattn DKQ=512 abort and reaches the real bug; the 12B's DKQ=512 abort masks it). Instrumentation proved the verify batch'sset_inputsandgraph_computeboth complete with status 0 anddecode()returns 0 — the crash is later, in MTP graph construction.Root cause
The MTP speculative path calls
llama_set_embeddings(ctx, true)so the main decode emits backbone hidden states for the drafter, and that flag stays set.llama_model::build_graph()runs a sharedbuild_pooling()+build_sampling()epilogue on every graph.build_pooling()early-returns onlyif (!cparams.embeddings). With embeddings forced true, it runs against the MTP graph'st_embd(=h_post, the backbone projection — which has no pooling inputs) and dereferences bad memory.A second latent bug rode along:
build_sampling()would attach backend samplers to the MTP logits, which MTP never consumes (it does its own on-device argmax).Fix
src/llama-model.cpp— gate the shared epilogue:if (params.gtype != LLM_GRAPH_TYPE_MTP) { build_pooling(...); build_sampling(); }. The MTP graph is self-contained and must not receive the main-decode pooling/sampling layers. This is the actual unblock.ggml/src/ggml-cuda/fattn.cu— route allDKQ=512cases to the TILE kernel. This generalizes the existingfix/cuda-mma-dkq512-fallbackbranch, whosegqa_ratio < 3guard only covers E4B (gqa_ratio=2). Gemma-4 12B/26B/31B havegqa_ratio=8, fail that guard, and still hit the MMAGGML_ABORT. Routing all DKQ=512 to TILE covers them and the no-mask MTP cross-attention path. (Happy to rebase this ontofix/cuda-mma-dkq512-fallbackif you'd prefer it land there.)src/llama-graph.cpp— guardllm_graph_input_embd::set_inputonembd != null(mirrors the existingcan_reuse()check), so gemma4's per-layer-token input — which reuses this class but only allocatestokens— can't deref a nullembd.src/llama-context.cpp— re-reserve on a realset_embeddings()mode change (topology change); synchronize the main scheduler indecode_mtp_asyncbefore handoff so the MTP worker doesn't race the preceding decode's in-flight KV writes.convert_hf_to_gguf.py— registerGemma4UnifiedAssistantForCausalLMas an alias ofGemma4AssistantForCausalLMso newer assistant checkpoints convert.Note on branch overlap
The
build_graphepilogue guard logically belongs with the MTP work onfeature/gemma-mtp, and thefattn.cuchange generalizesfix/cuda-mma-dkq512-fallback. I've targeted the default branch with the combined set for reviewability — happy to split into per-branch PRs if that fits your workflow better.Validation
Clean rebuild (all diagnostics stripped), 4/4 requests each, server healthy, clean shutdown:
gemma-4-E4B-it-Q4_K_M+gemma-4-E4B-it-assistant.Q4_K_M):Hello! How can I help you today? 😊;2, 3, 5;17+26 => 43; draft acceptance 1/4–12/44.gemma-4-12b-it-IQ4_XS+ convertedgemma-4-12B-it-assistant-f16): exercises both the fattn DKQ=512 fix (gqa_ratio=8) and the pooling fix;17+26 => 43; draft acceptance 22/36–49/58. (Empty content on some prompts is thinking-mode token-budget behavior — 12B hasthinking=1and burns the 80-tok cap on<channel>thought— not a crash.)Tested on RTX 5070 Ti (Blackwell sm_120), CUDA 13.3, but the epilogue guard is not GPU-specific — it would affect any Gemma-4 MTP target on any backend.
🤖 Generated with Claude Code