QMoE CUDA: input validation, prepack cleanups, and packaging pipeline fix by tianleiwu · Pull Request #28607 · microsoft/onnxruntime

tianleiwu · 2026-05-21T05:54:07Z

Description

Follow-up to #28583. Addresses review feedback that landed after merge (input validation, redundant memset, dead branches in PrePackComputeBias) and fixes a pre-existing latent CUTLASS issue that surfaced as a packaging pipeline failure once MoE GEMM kernels were built with a multi-arch CMAKE_CUDA_ARCHITECTURES list spanning pre-Ampere and Ampere+ targets.

Summary of Changes

Packaging pipeline build fix

File	Change
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h`	Replace the unconditional `static_assert(false, ...)` in the pre-Ampere `#else` branch of `MoeFCGemm::operator()` with `CUTLASS_NOT_IMPLEMENTED()` plus a comment explaining why this is safe.

Background: moe_gemm_kernels_*.cu instantiate MoeFCGemm through MoeGemmRunner<...>::dispatchToArch, which contains runtime (not constexpr) if (sm_ >= 80 && sm_ < 90) branches. NVCC therefore instantiates the kernel for every requested device target, including pre-Sm80 device compile passes. The old static_assert(false, ...) fired on those passes whenever CMAKE_CUDA_ARCHITECTURES contained any arch below 80 (e.g. the packaging pipeline list 52-real;61-real;75-real;86-real;89-real;90-virtual). Replacing it with CUTLASS_NOT_IMPLEMENTED() lets NVCC emit a runtime trap stub for pre-Sm80, while runtime dispatch in MoeGemmRunner::dispatchToArch() already guarantees sm_ >= 80 before the kernel is ever launched, so the stub is unreachable in practice.

Address PR #28583 post-merge review

File	Change
`onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu`	Add `ValidateScaledZP4BitBatchedArgs` (positive `experts`/`n`/`k_blocks`, `experts ≤ 65535` for the `gridDim.z` limit) and call it from both `LaunchQMoEScaledZP4BitBatched` overloads. Matches the validation style of `LaunchQMoERepackFP4ColToRow`.
`onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` (`PrePackSwizzleBlockScales`)	Remove the redundant `cudaMemsetAsync` of the destination buffer. `QMoEBlockScaleInterleaveKernel`'s `(batch, row, col) -> offset` map is a bijection over the padded output extent and writes 0 for padded source positions, so every output byte is already written. Comment explains the invariant.
`onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` (`PrePackComputeBias`, 4-bit block-wise)	Add `ORT_ENFORCE` checks for positive shape dims and an `INT_MAX/2` bound on `packed_k_blocks` (parity with `PrePackSwizzleBlockScales` / `PrePackRepackFP4Weights`). Drop the shadowed `bool is_fp16 = is_fp16_; bool is_bf16 = !is_fp16_;` locals in favour of `is_fp16_`. Replace the dead-branch ternary `(is_fp16 \|\| is_bf16 ? 2 : 4)` with `sizeof(uint16_t)` and a clarifying comment, and remove the unreachable `else ORT_THROW(...)` (the QMoE type path is strictly FP16/BF16).

Testing

Built locally with CUDA 12.8 against the failing CI arch list (-DCMAKE_CUDA_ARCHITECTURES="52-real;61-real;75-real;86-real;89-real;90-virtual") and confirmed onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels_bf16_bf16.cu.o compiles cleanly (only an sm_<75 deprecation warning, no static_assert failure).
Existing QMoE Python tests (onnxruntime/test/python/transformers/test_qmoe_cuda.py, test_qmoe_cpu.py) exercise the affected PrePackSwizzleBlockScales / PrePackComputeBias paths under --config Debug builds and continue to pass; the added ORT_ENFORCE checks only trigger on invalid shapes that are not produced by the supported QMoE input contract.
No behaviour change on supported devices: dispatchToArch already gates MoeFCGemm behind sm_ >= 80, so the new CUTLASS_NOT_IMPLEMENTED() stub is unreachable at runtime.

Motivation and Context

Once #28583 enabled the MoE GEMM kernels as part of the contrib CUDA build, packaging pipelines (which target a wide arch range to maximise GPU coverage) started failing on the pre-Ampere device compile passes. The kernel-side fix in this PR resolves the immediate breakage while keeping the cmake-level binary-size optimisation (per-kernel arch pinning, TensorRT-LLM style) as a follow-up — CMake's CUDA_ARCHITECTURES is target/directory-scoped only, so the proper way to restrict per-kernel archs is an OBJECT-library refactor, which is intentionally not in scope here.

Checklist

Tests added/updated (input validation covered by existing QMoE tests; the new ORT_ENFORCE checks fail loudly on out-of-contract shapes)
No documentation changes needed
No breaking changes
Local packaging-pipeline arch list verified to compile

…Batched via a shared ValidateScaledZP4BitBatchedArgs helper (positive experts/n/k_blocks plus experts ≤ 65535 for the gridDim.z limit), matching the validation style of LaunchQMoERepackFP4ColToRow. 2. moe_quantization.cc — PrePackSwizzleBlockScales — Removed the redundant cudaMemsetAsync and replaced it with a comment explaining why the kernel's bijective offset map makes pre-zeroing unnecessary. 3. moe_quantization.cc — PrePackComputeBias (4-bit block-wise branch): * Added ORT_ENFORCE checks for positive shape dims and INT_MAX overflow (parity with PrePackSwizzleBlockScales / PrePackRepackFP4Weights). * Dropped the shadowed bool is_fp16 = is_fp16_; bool is_bf16 = !is_fp16_; locals and use is_fp16_ directly. * Replaced the dead-branch ternary (is_fp16 || is_bf16 ? 2 : 4) with sizeof(uint16_t) and a clarifying comment. * Removed the unreachable else ORT_THROW(...) branch (since is_fp16_ is strictly binary FP16/BF16 — there's no FP32 input path through QMoE).

Copilot

Pull request overview

This PR is a follow-up refinement to CUDA QMoE/MoE contrib code: it tightens runtime argument validation and removes redundant prepack work, and it fixes a CUTLASS device-compilation issue that breaks mixed-architecture (CMAKE_CUDA_ARCHITECTURES) packaging builds.

Changes:

Fix mixed-arch CUDA builds by replacing an unconditional pre-Ampere static_assert(false) with a safe CUTLASS_NOT_IMPLEMENTED() trap stub in the MoE CUTLASS kernel.
Add input validation for LaunchQMoEScaledZP4BitBatched (positive dims + experts <= 65535) and invoke it from both overloads.
Remove redundant cudaMemsetAsync in PrePackSwizzleBlockScales and simplify/strengthen PrePackComputeBias 4-bit block-wise shape validation and type handling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu	Adds centralized argument validation for the scaled-ZP batched kernel launch.
onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc	Removes redundant memset in block-scale swizzle prepack; adds stricter shape/range checks and simplifies the FP16/BF16 bias prepack path.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h	Replaces a compile-time failure on pre-Ampere device passes with a runtime trap stub to allow mixed-arch compilation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu added 2 commits May 20, 2026 22:49

Fix packaging pipeline

5b20f3f

tianleiwu requested a review from Copilot May 21, 2026 15:48

Copilot started reviewing on behalf of tianleiwu May 21, 2026 15:52 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu Outdated

tianleiwu added 2 commits May 21, 2026 16:57

Fix windows build

8df2ef1

Refine checks

087e485

tianleiwu requested review from apsonawane, hariharans29 and kunal-vaishnavi May 22, 2026 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QMoE CUDA: input validation, prepack cleanups, and packaging pipeline fix#28607

QMoE CUDA: input validation, prepack cleanups, and packaging pipeline fix#28607
tianleiwu wants to merge 4 commits into
mainfrom
tlwu/20260520/qmoe_cuda_fix

tianleiwu commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary of Changes

Packaging pipeline build fix

Address PR #28583 post-merge review

Testing

Motivation and Context

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianleiwu commented May 21, 2026 •

edited

Loading