Feature/qwen36 mtp#47
Conversation
The speculative trunk (prefill, verify, rollback re-advance) previously ran only on the per-sequence linear-cache path, whose per-layer host/device syncs leave the GPU idle (~74% of spec decode in the GDN blocks). Models that implement IMtpBatchedSpeculativeModel now serve every trunk pass through ForwardBatch instead: paged KV via the sequence block table, per-slot GDN state with snapshot/restore for verify rollback, per-row hidden/logits capture for drafting and verification. Speculation now rides the same kernels as the non-speculative batched baseline, keeps the sequence K/V in paged storage (prefix caching and concurrency transitions compose), and disarms gracefully when a concurrent batch interrupts. - MtpSpeculativeExecution: IMtpSpecTrunk abstraction (LinearMtpTrunk keeps the standalone decoder and non-batched models on the old path) - BatchExecutor: BatchedMtpTrunk + TryExecuteStepMtpBatchedTrunk routed before the fused/batched dispatch; shared step core for both trunks - BatchedForwardContext: OverrideFlatTokens (drafted tokens are not in the sequence token list) + CaptureHiddenAll/CaptureLogitsAll - Qwen35Model: SpecForwardBatched + per-slot GDN snapshot/restore (pointer copies into reused buffers; GetElementsAsFloat allocated ~3 MB per layer per verify step - measured 92 ms/step of GC churn vs 12 ms) Measured on Qwen3.6-27B-UD-IQ2_XXS ggml_cuda (RTX 3080 Laptop), interleaved rounds, 128-token chat with default sampling: 3.01-3.07 tok/s vs baseline 2.72-2.73 (+11-13%); greedy stream stays a 64/64 exact match vs plain decode; engine tests assert the batched path is used exclusively while armed and the per-slot snapshot protocol holds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ff79e5a15
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| private static void TruncateUnpublishedTail(SequenceState seq, int keepCount) | ||
| { | ||
| if (seq.OutputTokens.Count > keepCount) | ||
| seq.OutputTokens.RemoveRange(keepCount, seq.OutputTokens.Count - keepCount); |
There was a problem hiding this comment.
Rewind computed-token state when truncating drafts
When EOS or the max-token cap lands inside an accepted speculative window, this removes the unpublished tokens from OutputTokens but leaves NumComputedTokens and the block table advanced to the full verified window. If that window crossed an additional block boundary (for example a prompt ending at token 15 with an 8-token accepted window and EOS at the first emitted token), NotifyStop calls CacheFullBlocksForSequence, which hashes up to seq.NumComputedTokens and TokenAt can read past the truncated output, causing the finish path to throw instead of completing the request.
Useful? React with 👍 / 👎.
| // Batched steps clobber the linear-cache world the speculative | ||
| // context depends on (e.g. per-slot GDN state swaps model-level | ||
| // references), so speculation must re-arm from a fresh prefill. | ||
| _mtpCtx = null; |
There was a problem hiding this comment.
Preserve pending sampled token before dropping MTP context
If a speculative step produced a PendingNextToken and the next scheduler iteration includes another sequence, this batched path clears _mtpCtx before that already-drawn token is emitted. The sequence then falls through to normal decoding and samples again from LastLogits; under non-greedy sampling this changes the token stream and loses the exact sample that verification already committed to emit.
Useful? React with 👍 / 👎.
TensorSharp Test MatrixNo report artifacts produced. |
…ul + fused verify attention)
On the pure-C# CUDA backend a B-token forward cost ~B x a single-token forward,
so the MTP speculative verify batch never amortized: verifying B drafts cost
about B separate decodes, making --mtp-spec net-negative despite high acceptance.
Root cause (isolated with GdnDecodeBench TS_BENCH_MODE=matmul/scaling +
Qwen35Model.DebugTimeQuantMatmul): two CUDA paths re-do per-row work across the
verify rows.
1. k-quant matmuls (ffn_down Q2_K, lm_head Q5_K, ...) ran the scalar
ts_quant_matmul_f32 kernel that re-reads/re-dequantizes the whole weight row
once per output row -> flat ms/row. New ts_quant_matmul_batched_f32 kernel:
warp-per-column, 4-row tile (grid.y), decodes each weight ONCE and reuses it
across the tile's rows (mirrors the q8_0 4-row kernel). Wired into
CudaQuantizedOps.TryAddmmQuantizedToFloat32 for 2<=rows<=32, excluding
IQ2_XXS/Q4_0/Q8_0 which already amortize via their per-row dp4a/tiled
kernels (a batched IQ2_XXS variant measured 4-5x SLOWER). 2.7-2.9x on
ffn_down and the LM head; TS_CUDA_QMM_BATCHED=0 disables. Also speeds prompt
prefill / any batched k-quant matmul.
2. Verify-window attention (seqLen>1) fell to the slow legacy ExpandKVHeads +
separate softmax path on CUDA. Routed through
CudaFusedOps.TryGqaPrefillAttentionWithSinks (gated to text continuations,
kvLen<=8192, try/catch fallback). attn 39% -> 35% of spec time.
Result (RTX 3080 Laptop, 27B IQ2_XXS greedy): MTP decode 2.77 -> 4.41 tok/s
(+59%) on an English prompt, 4.03 -> 4.68 (+16%) on a Chinese prompt; output
token-identical to non-MTP greedy (64/64, 96/96). All 38 CudaBackendTests pass
(the Q4_K/Q5_K/Q6_K matmul test uses rows=3 and now covers the batched kernel).
GdnDecodeBench gains MtpSpecStats phase printing, matmul/scaling probe modes, a
TS_MTP_PROMPT override and a spec-vs-baseline token-stream check;
MtpSpeculativeDecoder exposes Stats for it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MtpSnapshotRecurrentState/Restore drained the GatedDeltaNet state to host bytes (EnsureHostReadable DtoH per recurrent layer = 48 stalls) on every speculative verify step, but the snapshot is only consumed on a partial-rejection rollback (~1 per request on this model). On the sync-bound CUDA backend those DtoH stalls were almost entirely wasted. Add a CUDA fast path that snapshots/restores _deltaStateTensor and _cudaGdnConvStateTensor device-to-device via Ops.Copy (async cuMemcpyDtoD; CopyDeviceFrom marks the destination device-modified, so the restored state is read directly by the recurrence kernel and never re-clobbered by a stale host mirror). The host-bytes path is kept for non-CUDA backends. Snapshot phase 3.6% -> 0.1% (515 -> 17 ms over a 64-token greedy decode). Spec output stays token-identical to non-MTP greedy including the rollback step (64/64, 96/96). Cumulative CUDA MTP gain: 2.77 -> 4.46 tok/s on an English prompt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No description provided.