Skip to content

QMoE CUDA: input validation, prepack cleanups, and packaging pipeline fix#28607

Open
tianleiwu wants to merge 4 commits into
mainfrom
tlwu/20260520/qmoe_cuda_fix
Open

QMoE CUDA: input validation, prepack cleanups, and packaging pipeline fix#28607
tianleiwu wants to merge 4 commits into
mainfrom
tlwu/20260520/qmoe_cuda_fix

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu commented May 21, 2026

Description

Follow-up to #28583. Addresses review feedback that landed after merge (input validation, redundant memset, dead branches in PrePackComputeBias) and fixes a pre-existing latent CUTLASS issue that surfaced as a packaging pipeline failure once MoE GEMM kernels were built with a multi-arch CMAKE_CUDA_ARCHITECTURES list spanning pre-Ampere and Ampere+ targets.

Summary of Changes

Packaging pipeline build fix

File Change
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h Replace the unconditional static_assert(false, ...) in the pre-Ampere #else branch of MoeFCGemm::operator() with CUTLASS_NOT_IMPLEMENTED() plus a comment explaining why this is safe.

Background: moe_gemm_kernels_*.cu instantiate MoeFCGemm through MoeGemmRunner<...>::dispatchToArch, which contains runtime (not constexpr) if (sm_ >= 80 && sm_ < 90) branches. NVCC therefore instantiates the kernel for every requested device target, including pre-Sm80 device compile passes. The old static_assert(false, ...) fired on those passes whenever CMAKE_CUDA_ARCHITECTURES contained any arch below 80 (e.g. the packaging pipeline list 52-real;61-real;75-real;86-real;89-real;90-virtual). Replacing it with CUTLASS_NOT_IMPLEMENTED() lets NVCC emit a runtime trap stub for pre-Sm80, while runtime dispatch in MoeGemmRunner::dispatchToArch() already guarantees sm_ >= 80 before the kernel is ever launched, so the stub is unreachable in practice.

Address PR #28583 post-merge review

File Change
onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu Add ValidateScaledZP4BitBatchedArgs (positive experts/n/k_blocks, experts ≤ 65535 for the gridDim.z limit) and call it from both LaunchQMoEScaledZP4BitBatched overloads. Matches the validation style of LaunchQMoERepackFP4ColToRow.
onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc (PrePackSwizzleBlockScales) Remove the redundant cudaMemsetAsync of the destination buffer. QMoEBlockScaleInterleaveKernel's (batch, row, col) -> offset map is a bijection over the padded output extent and writes 0 for padded source positions, so every output byte is already written. Comment explains the invariant.
onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc (PrePackComputeBias, 4-bit block-wise) Add ORT_ENFORCE checks for positive shape dims and an INT_MAX/2 bound on packed_k_blocks (parity with PrePackSwizzleBlockScales / PrePackRepackFP4Weights). Drop the shadowed bool is_fp16 = is_fp16_; bool is_bf16 = !is_fp16_; locals in favour of is_fp16_. Replace the dead-branch ternary (is_fp16 || is_bf16 ? 2 : 4) with sizeof(uint16_t) and a clarifying comment, and remove the unreachable else ORT_THROW(...) (the QMoE type path is strictly FP16/BF16).

Testing

  • Built locally with CUDA 12.8 against the failing CI arch list (-DCMAKE_CUDA_ARCHITECTURES="52-real;61-real;75-real;86-real;89-real;90-virtual") and confirmed onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels_bf16_bf16.cu.o compiles cleanly (only an sm_<75 deprecation warning, no static_assert failure).
  • Existing QMoE Python tests (onnxruntime/test/python/transformers/test_qmoe_cuda.py, test_qmoe_cpu.py) exercise the affected PrePackSwizzleBlockScales / PrePackComputeBias paths under --config Debug builds and continue to pass; the added ORT_ENFORCE checks only trigger on invalid shapes that are not produced by the supported QMoE input contract.
  • No behaviour change on supported devices: dispatchToArch already gates MoeFCGemm behind sm_ >= 80, so the new CUTLASS_NOT_IMPLEMENTED() stub is unreachable at runtime.

Motivation and Context

Once #28583 enabled the MoE GEMM kernels as part of the contrib CUDA build, packaging pipelines (which target a wide arch range to maximise GPU coverage) started failing on the pre-Ampere device compile passes. The kernel-side fix in this PR resolves the immediate breakage while keeping the cmake-level binary-size optimisation (per-kernel arch pinning, TensorRT-LLM style) as a follow-up — CMake's CUDA_ARCHITECTURES is target/directory-scoped only, so the proper way to restrict per-kernel archs is an OBJECT-library refactor, which is intentionally not in scope here.

Checklist

  • Tests added/updated (input validation covered by existing QMoE tests; the new ORT_ENFORCE checks fail loudly on out-of-contract shapes)
  • No documentation changes needed
  • No breaking changes
  • Local packaging-pipeline arch list verified to compile

tianleiwu added 2 commits May 20, 2026 22:49
…Batched via a shared ValidateScaledZP4BitBatchedArgs helper (positive experts/n/k_blocks plus experts ≤ 65535 for the gridDim.z limit), matching the validation style of LaunchQMoERepackFP4ColToRow.

2. moe_quantization.cc — PrePackSwizzleBlockScales — Removed the redundant cudaMemsetAsync and replaced it with a comment explaining why the kernel's bijective offset map makes pre-zeroing unnecessary.
3. moe_quantization.cc — PrePackComputeBias (4-bit block-wise branch):
* Added ORT_ENFORCE checks for positive shape dims and INT_MAX overflow (parity with PrePackSwizzleBlockScales / PrePackRepackFP4Weights).
* Dropped the shadowed bool is_fp16 = is_fp16_; bool is_bf16 = !is_fp16_; locals and use is_fp16_ directly.
* Replaced the dead-branch ternary (is_fp16 || is_bf16 ? 2 : 4) with sizeof(uint16_t) and a clarifying comment.
* Removed the unreachable else ORT_THROW(...) branch (since is_fp16_ is strictly binary FP16/BF16 — there's no FP32 input path through QMoE).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a follow-up refinement to CUDA QMoE/MoE contrib code: it tightens runtime argument validation and removes redundant prepack work, and it fixes a CUTLASS device-compilation issue that breaks mixed-architecture (CMAKE_CUDA_ARCHITECTURES) packaging builds.

Changes:

  • Fix mixed-arch CUDA builds by replacing an unconditional pre-Ampere static_assert(false) with a safe CUTLASS_NOT_IMPLEMENTED() trap stub in the MoE CUTLASS kernel.
  • Add input validation for LaunchQMoEScaledZP4BitBatched (positive dims + experts <= 65535) and invoke it from both overloads.
  • Remove redundant cudaMemsetAsync in PrePackSwizzleBlockScales and simplify/strengthen PrePackComputeBias 4-bit block-wise shape validation and type handling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu Adds centralized argument validation for the scaled-ZP batched kernel launch.
onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc Removes redundant memset in block-scale swizzle prepack; adds stricter shape/range checks and simplifies the FP16/BF16 bias prepack path.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h Replaces a compile-time failure on pre-Ampere device passes with a runtime trap stub to allow mixed-arch compilation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants