Feature/qwen36 mtp by zhongkaifu · Pull Request #47 · zhongkaifu/TensorSharp

zhongkaifu · 2026-06-12T23:21:46Z

No description provided.

The speculative trunk (prefill, verify, rollback re-advance) previously ran only on the per-sequence linear-cache path, whose per-layer host/device syncs leave the GPU idle (~74% of spec decode in the GDN blocks). Models that implement IMtpBatchedSpeculativeModel now serve every trunk pass through ForwardBatch instead: paged KV via the sequence block table, per-slot GDN state with snapshot/restore for verify rollback, per-row hidden/logits capture for drafting and verification. Speculation now rides the same kernels as the non-speculative batched baseline, keeps the sequence K/V in paged storage (prefix caching and concurrency transitions compose), and disarms gracefully when a concurrent batch interrupts. - MtpSpeculativeExecution: IMtpSpecTrunk abstraction (LinearMtpTrunk keeps the standalone decoder and non-batched models on the old path) - BatchExecutor: BatchedMtpTrunk + TryExecuteStepMtpBatchedTrunk routed before the fused/batched dispatch; shared step core for both trunks - BatchedForwardContext: OverrideFlatTokens (drafted tokens are not in the sequence token list) + CaptureHiddenAll/CaptureLogitsAll - Qwen35Model: SpecForwardBatched + per-slot GDN snapshot/restore (pointer copies into reused buffers; GetElementsAsFloat allocated ~3 MB per layer per verify step - measured 92 ms/step of GC churn vs 12 ms) Measured on Qwen3.6-27B-UD-IQ2_XXS ggml_cuda (RTX 3080 Laptop), interleaved rounds, 128-token chat with default sampling: 3.01-3.07 tok/s vs baseline 2.72-2.73 (+11-13%); greedy stream stays a 64/64 exact match vs plain decode; engine tests assert the batched path is used exclusively while armed and the per-slot snapshot protocol holds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ff79e5a15

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T23:25:49Z

+        private static void TruncateUnpublishedTail(SequenceState seq, int keepCount)
+        {
+            if (seq.OutputTokens.Count > keepCount)
+                seq.OutputTokens.RemoveRange(keepCount, seq.OutputTokens.Count - keepCount);


Rewind computed-token state when truncating drafts

When EOS or the max-token cap lands inside an accepted speculative window, this removes the unpublished tokens from OutputTokens but leaves NumComputedTokens and the block table advanced to the full verified window. If that window crossed an additional block boundary (for example a prompt ending at token 15 with an 8-token accepted window and EOS at the first emitted token), NotifyStop calls CacheFullBlocksForSequence, which hashes up to seq.NumComputedTokens and TokenAt can read past the truncated output, causing the finish path to throw instead of completing the request.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T23:25:50Z

+            // Batched steps clobber the linear-cache world the speculative
+            // context depends on (e.g. per-slot GDN state swaps model-level
+            // references), so speculation must re-arm from a fresh prefill.
+            _mtpCtx = null;


Preserve pending sampled token before dropping MTP context

If a speculative step produced a PendingNextToken and the next scheduler iteration includes another sequence, this batched path clears _mtpCtx before that already-drawn token is emitted. The sequence then falls through to normal decoding and samples again from LastLogits; under non-greedy sampling this changes the token stream and loses the exact sample that verification already committed to emit.

Useful? React with 👍 / 👎.

github-actions · 2026-06-13T16:33:56Z

TensorSharp Test Matrix

No report artifacts produced.

…ul + fused verify attention) On the pure-C# CUDA backend a B-token forward cost ~B x a single-token forward, so the MTP speculative verify batch never amortized: verifying B drafts cost about B separate decodes, making --mtp-spec net-negative despite high acceptance. Root cause (isolated with GdnDecodeBench TS_BENCH_MODE=matmul/scaling + Qwen35Model.DebugTimeQuantMatmul): two CUDA paths re-do per-row work across the verify rows. 1. k-quant matmuls (ffn_down Q2_K, lm_head Q5_K, ...) ran the scalar ts_quant_matmul_f32 kernel that re-reads/re-dequantizes the whole weight row once per output row -> flat ms/row. New ts_quant_matmul_batched_f32 kernel: warp-per-column, 4-row tile (grid.y), decodes each weight ONCE and reuses it across the tile's rows (mirrors the q8_0 4-row kernel). Wired into CudaQuantizedOps.TryAddmmQuantizedToFloat32 for 2<=rows<=32, excluding IQ2_XXS/Q4_0/Q8_0 which already amortize via their per-row dp4a/tiled kernels (a batched IQ2_XXS variant measured 4-5x SLOWER). 2.7-2.9x on ffn_down and the LM head; TS_CUDA_QMM_BATCHED=0 disables. Also speeds prompt prefill / any batched k-quant matmul. 2. Verify-window attention (seqLen>1) fell to the slow legacy ExpandKVHeads + separate softmax path on CUDA. Routed through CudaFusedOps.TryGqaPrefillAttentionWithSinks (gated to text continuations, kvLen<=8192, try/catch fallback). attn 39% -> 35% of spec time. Result (RTX 3080 Laptop, 27B IQ2_XXS greedy): MTP decode 2.77 -> 4.41 tok/s (+59%) on an English prompt, 4.03 -> 4.68 (+16%) on a Chinese prompt; output token-identical to non-MTP greedy (64/64, 96/96). All 38 CudaBackendTests pass (the Q4_K/Q5_K/Q6_K matmul test uses rows=3 and now covers the batched kernel). GdnDecodeBench gains MtpSpecStats phase printing, matmul/scaling probe modes, a TS_MTP_PROMPT override and a spec-vs-baseline token-stream check; MtpSpeculativeDecoder exposes Stats for it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

MtpSnapshotRecurrentState/Restore drained the GatedDeltaNet state to host bytes (EnsureHostReadable DtoH per recurrent layer = 48 stalls) on every speculative verify step, but the snapshot is only consumed on a partial-rejection rollback (~1 per request on this model). On the sync-bound CUDA backend those DtoH stalls were almost entirely wasted. Add a CUDA fast path that snapshots/restores _deltaStateTensor and _cudaGdnConvStateTensor device-to-device via Ops.Copy (async cuMemcpyDtoD; CopyDeviceFrom marks the destination device-modified, so the restored state is read directly by the recurrence kernel and never re-clobbered by a stale host mirror). The host-bytes path is kept for non-CUDA backends. Snapshot phase 3.6% -> 0.1% (515 -> 17 ms over a 64-token greedy decode). Spec output stays token-identical to non-MTP greedy including the rollback step (64/64, 96/96). Cumulative CUDA MTP gain: 2.77 -> 4.46 tok/s on an English prompt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

zhongkaifu and others added 2 commits June 12, 2026 15:19

support Qwen3.6 mtp

14c2a83

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

Improve cuda backend performance on Qwen3.6

b12a097

zhongkaifu and others added 4 commits June 13, 2026 12:35

support gemma4 mtp

1fe65ce

improve MTP performance on cuda backend

0ab4d88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/qwen36 mtp#47

Feature/qwen36 mtp#47
zhongkaifu wants to merge 7 commits into
mainfrom
feature/Qwen36_mtp

zhongkaifu commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhongkaifu commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 13, 2026

TensorSharp Test Matrix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant