Skip to content

Feature/qwen36 mtp#47

Open
zhongkaifu wants to merge 7 commits into
mainfrom
feature/Qwen36_mtp
Open

Feature/qwen36 mtp#47
zhongkaifu wants to merge 7 commits into
mainfrom
feature/Qwen36_mtp

Conversation

@zhongkaifu

Copy link
Copy Markdown
Owner

No description provided.

zhongkaifu and others added 2 commits June 12, 2026 15:19
The speculative trunk (prefill, verify, rollback re-advance) previously ran
only on the per-sequence linear-cache path, whose per-layer host/device
syncs leave the GPU idle (~74% of spec decode in the GDN blocks). Models
that implement IMtpBatchedSpeculativeModel now serve every trunk pass
through ForwardBatch instead: paged KV via the sequence block table,
per-slot GDN state with snapshot/restore for verify rollback, per-row
hidden/logits capture for drafting and verification. Speculation now rides
the same kernels as the non-speculative batched baseline, keeps the
sequence K/V in paged storage (prefix caching and concurrency transitions
compose), and disarms gracefully when a concurrent batch interrupts.

- MtpSpeculativeExecution: IMtpSpecTrunk abstraction (LinearMtpTrunk keeps
  the standalone decoder and non-batched models on the old path)
- BatchExecutor: BatchedMtpTrunk + TryExecuteStepMtpBatchedTrunk routed
  before the fused/batched dispatch; shared step core for both trunks
- BatchedForwardContext: OverrideFlatTokens (drafted tokens are not in the
  sequence token list) + CaptureHiddenAll/CaptureLogitsAll
- Qwen35Model: SpecForwardBatched + per-slot GDN snapshot/restore
  (pointer copies into reused buffers; GetElementsAsFloat allocated ~3 MB
  per layer per verify step - measured 92 ms/step of GC churn vs 12 ms)

Measured on Qwen3.6-27B-UD-IQ2_XXS ggml_cuda (RTX 3080 Laptop), interleaved
rounds, 128-token chat with default sampling: 3.01-3.07 tok/s vs baseline
2.72-2.73 (+11-13%); greedy stream stays a 64/64 exact match vs plain
decode; engine tests assert the batched path is used exclusively while
armed and the per-slot snapshot protocol holds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ff79e5a15

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +421 to +424
private static void TruncateUnpublishedTail(SequenceState seq, int keepCount)
{
if (seq.OutputTokens.Count > keepCount)
seq.OutputTokens.RemoveRange(keepCount, seq.OutputTokens.Count - keepCount);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rewind computed-token state when truncating drafts

When EOS or the max-token cap lands inside an accepted speculative window, this removes the unpublished tokens from OutputTokens but leaves NumComputedTokens and the block table advanced to the full verified window. If that window crossed an additional block boundary (for example a prompt ending at token 15 with an 8-token accepted window and EOS at the first emitted token), NotifyStop calls CacheFullBlocksForSequence, which hashes up to seq.NumComputedTokens and TokenAt can read past the truncated output, causing the finish path to throw instead of completing the request.

Useful? React with 👍 / 👎.

Comment on lines +517 to +520
// Batched steps clobber the linear-cache world the speculative
// context depends on (e.g. per-slot GDN state swaps model-level
// references), so speculation must re-arm from a fresh prefill.
_mtpCtx = null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve pending sampled token before dropping MTP context

If a speculative step produced a PendingNextToken and the next scheduler iteration includes another sequence, this batched path clears _mtpCtx before that already-drawn token is emitted. The sequence then falls through to normal decoding and samples again from LastLogits; under non-greedy sampling this changes the token stream and loses the exact sample that verification already committed to emit.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown

TensorSharp Test Matrix

No report artifacts produced.

zhongkaifu and others added 4 commits June 13, 2026 12:35
…ul + fused verify attention)

On the pure-C# CUDA backend a B-token forward cost ~B x a single-token forward,
so the MTP speculative verify batch never amortized: verifying B drafts cost
about B separate decodes, making --mtp-spec net-negative despite high acceptance.

Root cause (isolated with GdnDecodeBench TS_BENCH_MODE=matmul/scaling +
Qwen35Model.DebugTimeQuantMatmul): two CUDA paths re-do per-row work across the
verify rows.
  1. k-quant matmuls (ffn_down Q2_K, lm_head Q5_K, ...) ran the scalar
     ts_quant_matmul_f32 kernel that re-reads/re-dequantizes the whole weight row
     once per output row -> flat ms/row. New ts_quant_matmul_batched_f32 kernel:
     warp-per-column, 4-row tile (grid.y), decodes each weight ONCE and reuses it
     across the tile's rows (mirrors the q8_0 4-row kernel). Wired into
     CudaQuantizedOps.TryAddmmQuantizedToFloat32 for 2<=rows<=32, excluding
     IQ2_XXS/Q4_0/Q8_0 which already amortize via their per-row dp4a/tiled
     kernels (a batched IQ2_XXS variant measured 4-5x SLOWER). 2.7-2.9x on
     ffn_down and the LM head; TS_CUDA_QMM_BATCHED=0 disables. Also speeds prompt
     prefill / any batched k-quant matmul.
  2. Verify-window attention (seqLen>1) fell to the slow legacy ExpandKVHeads +
     separate softmax path on CUDA. Routed through
     CudaFusedOps.TryGqaPrefillAttentionWithSinks (gated to text continuations,
     kvLen<=8192, try/catch fallback). attn 39% -> 35% of spec time.

Result (RTX 3080 Laptop, 27B IQ2_XXS greedy): MTP decode 2.77 -> 4.41 tok/s
(+59%) on an English prompt, 4.03 -> 4.68 (+16%) on a Chinese prompt; output
token-identical to non-MTP greedy (64/64, 96/96). All 38 CudaBackendTests pass
(the Q4_K/Q5_K/Q6_K matmul test uses rows=3 and now covers the batched kernel).

GdnDecodeBench gains MtpSpecStats phase printing, matmul/scaling probe modes, a
TS_MTP_PROMPT override and a spec-vs-baseline token-stream check;
MtpSpeculativeDecoder exposes Stats for it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MtpSnapshotRecurrentState/Restore drained the GatedDeltaNet state to host bytes
(EnsureHostReadable DtoH per recurrent layer = 48 stalls) on every speculative
verify step, but the snapshot is only consumed on a partial-rejection rollback
(~1 per request on this model). On the sync-bound CUDA backend those DtoH stalls
were almost entirely wasted.

Add a CUDA fast path that snapshots/restores _deltaStateTensor and
_cudaGdnConvStateTensor device-to-device via Ops.Copy (async cuMemcpyDtoD;
CopyDeviceFrom marks the destination device-modified, so the restored state is
read directly by the recurrence kernel and never re-clobbered by a stale host
mirror). The host-bytes path is kept for non-CUDA backends.

Snapshot phase 3.6% -> 0.1% (515 -> 17 ms over a 64-token greedy decode). Spec
output stays token-identical to non-MTP greedy including the rollback step
(64/64, 96/96). Cumulative CUDA MTP gain: 2.77 -> 4.46 tok/s on an English
prompt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant