QMoE CUDA: input validation, prepack cleanups, and packaging pipeline fix#28607
Open
tianleiwu wants to merge 4 commits into
Open
QMoE CUDA: input validation, prepack cleanups, and packaging pipeline fix#28607tianleiwu wants to merge 4 commits into
tianleiwu wants to merge 4 commits into
Conversation
…Batched via a shared ValidateScaledZP4BitBatchedArgs helper (positive experts/n/k_blocks plus experts ≤ 65535 for the gridDim.z limit), matching the validation style of LaunchQMoERepackFP4ColToRow. 2. moe_quantization.cc — PrePackSwizzleBlockScales — Removed the redundant cudaMemsetAsync and replaced it with a comment explaining why the kernel's bijective offset map makes pre-zeroing unnecessary. 3. moe_quantization.cc — PrePackComputeBias (4-bit block-wise branch): * Added ORT_ENFORCE checks for positive shape dims and INT_MAX overflow (parity with PrePackSwizzleBlockScales / PrePackRepackFP4Weights). * Dropped the shadowed bool is_fp16 = is_fp16_; bool is_bf16 = !is_fp16_; locals and use is_fp16_ directly. * Replaced the dead-branch ternary (is_fp16 || is_bf16 ? 2 : 4) with sizeof(uint16_t) and a clarifying comment. * Removed the unreachable else ORT_THROW(...) branch (since is_fp16_ is strictly binary FP16/BF16 — there's no FP32 input path through QMoE).
Contributor
There was a problem hiding this comment.
Pull request overview
This PR is a follow-up refinement to CUDA QMoE/MoE contrib code: it tightens runtime argument validation and removes redundant prepack work, and it fixes a CUTLASS device-compilation issue that breaks mixed-architecture (CMAKE_CUDA_ARCHITECTURES) packaging builds.
Changes:
- Fix mixed-arch CUDA builds by replacing an unconditional pre-Ampere
static_assert(false)with a safeCUTLASS_NOT_IMPLEMENTED()trap stub in the MoE CUTLASS kernel. - Add input validation for
LaunchQMoEScaledZP4BitBatched(positive dims +experts <= 65535) and invoke it from both overloads. - Remove redundant
cudaMemsetAsyncinPrePackSwizzleBlockScalesand simplify/strengthenPrePackComputeBias4-bit block-wise shape validation and type handling.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu | Adds centralized argument validation for the scaled-ZP batched kernel launch. |
| onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc | Removes redundant memset in block-scale swizzle prepack; adds stricter shape/range checks and simplifies the FP16/BF16 bias prepack path. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h | Replaces a compile-time failure on pre-Ampere device passes with a runtime trap stub to allow mixed-arch compilation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Follow-up to #28583. Addresses review feedback that landed after merge (input validation, redundant memset, dead branches in
PrePackComputeBias) and fixes a pre-existing latent CUTLASS issue that surfaced as a packaging pipeline failure once MoE GEMM kernels were built with a multi-archCMAKE_CUDA_ARCHITECTURESlist spanning pre-Ampere and Ampere+ targets.Summary of Changes
Packaging pipeline build fix
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.hstatic_assert(false, ...)in the pre-Ampere#elsebranch ofMoeFCGemm::operator()withCUTLASS_NOT_IMPLEMENTED()plus a comment explaining why this is safe.Background:
moe_gemm_kernels_*.cuinstantiateMoeFCGemmthroughMoeGemmRunner<...>::dispatchToArch, which contains runtime (notconstexpr)if (sm_ >= 80 && sm_ < 90)branches. NVCC therefore instantiates the kernel for every requested device target, including pre-Sm80 device compile passes. The oldstatic_assert(false, ...)fired on those passes wheneverCMAKE_CUDA_ARCHITECTUREScontained any arch below 80 (e.g. the packaging pipeline list52-real;61-real;75-real;86-real;89-real;90-virtual). Replacing it withCUTLASS_NOT_IMPLEMENTED()lets NVCC emit a runtime trap stub for pre-Sm80, while runtime dispatch inMoeGemmRunner::dispatchToArch()already guaranteessm_ >= 80before the kernel is ever launched, so the stub is unreachable in practice.Address PR #28583 post-merge review
onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cuValidateScaledZP4BitBatchedArgs(positiveexperts/n/k_blocks,experts ≤ 65535for thegridDim.zlimit) and call it from bothLaunchQMoEScaledZP4BitBatchedoverloads. Matches the validation style ofLaunchQMoERepackFP4ColToRow.onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc(PrePackSwizzleBlockScales)cudaMemsetAsyncof the destination buffer.QMoEBlockScaleInterleaveKernel's(batch, row, col) -> offsetmap is a bijection over the padded output extent and writes 0 for padded source positions, so every output byte is already written. Comment explains the invariant.onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc(PrePackComputeBias, 4-bit block-wise)ORT_ENFORCEchecks for positive shape dims and anINT_MAX/2bound onpacked_k_blocks(parity withPrePackSwizzleBlockScales/PrePackRepackFP4Weights). Drop the shadowedbool is_fp16 = is_fp16_; bool is_bf16 = !is_fp16_;locals in favour ofis_fp16_. Replace the dead-branch ternary(is_fp16 || is_bf16 ? 2 : 4)withsizeof(uint16_t)and a clarifying comment, and remove the unreachableelse ORT_THROW(...)(the QMoE type path is strictly FP16/BF16).Testing
-DCMAKE_CUDA_ARCHITECTURES="52-real;61-real;75-real;86-real;89-real;90-virtual") and confirmedonnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels_bf16_bf16.cu.ocompiles cleanly (only ansm_<75deprecation warning, nostatic_assertfailure).onnxruntime/test/python/transformers/test_qmoe_cuda.py,test_qmoe_cpu.py) exercise the affectedPrePackSwizzleBlockScales/PrePackComputeBiaspaths under--config Debugbuilds and continue to pass; the addedORT_ENFORCEchecks only trigger on invalid shapes that are not produced by the supported QMoE input contract.dispatchToArchalready gatesMoeFCGemmbehindsm_ >= 80, so the newCUTLASS_NOT_IMPLEMENTED()stub is unreachable at runtime.Motivation and Context
Once #28583 enabled the MoE GEMM kernels as part of the contrib CUDA build, packaging pipelines (which target a wide arch range to maximise GPU coverage) started failing on the pre-Ampere device compile passes. The kernel-side fix in this PR resolves the immediate breakage while keeping the cmake-level binary-size optimisation (per-kernel arch pinning, TensorRT-LLM style) as a follow-up — CMake's
CUDA_ARCHITECTURESis target/directory-scoped only, so the proper way to restrict per-kernel archs is an OBJECT-library refactor, which is intentionally not in scope here.Checklist
ORT_ENFORCEchecks fail loudly on out-of-contract shapes)