AutoTune MoE kernel block sizes for accelerate inference#18551
Merged
AutoTune MoE kernel block sizes for accelerate inference#18551
Conversation
Change fused_moe kernel config from (N=32, K=32, warps=2, stages=2) to (N=128, K=64, warps=4, stages=3). Benchmarked on A100 for Qwen3.5 MoE dimensions, delivering -33.6% MoE kernel time and -19.6% overall wall clock speedup with zero impact on non-MoE kernels.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18551
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 118 PendingAs of commit 5baa6b6 with merge base 186eb4b ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Replace hardcoded block sizes with @triton.autotune for both GEMM1 (_fused_moe_kernel) and GEMM2 (_fused_moe_silu_kernel). Each kernel gets its own set of 4 autotune configs derived from standalone benchmarking on A100 with Qwen3.5 MoE dimensions (M=1 decode, INT4 HQQ, group_size=128). GEMM1 configs optimized for N=1024, K=2048. GEMM2 configs optimized for N=2048, K=512. Results vs baseline (N=32, K=32): - GEMM1: 190.1 -> 124.5 us/call (-34.5%) - GEMM2: 87.2 -> 48.2 us/call (-44.7%) - Overall MoE: -37.7% - E2E decode: 45.13 -> 53.86 tok/s (+19.3%) Keeping configs to 4 per kernel avoids the AOTI fatbin OOM issue seen with larger autotune config sets.
Include the original default block sizes (N=32, K=32) in both GEMM1 and GEMM2 autotune candidate lists to prevent perf regression on hardware where smaller block sizes are optimal. Add GPU diagnostic output to test_model_e2e.sh to help investigate perf discrepancies between local and CI environments (GPU variant, memory bandwidth, CUDA version, etc.).
Print md5sum of exported model.pte and aoti_cuda_blob.ptd on CI, along with local reference checksums and PyTorch/Triton/torchao versions, to help diagnose cross-machine perf discrepancies.
Run a standalone benchmark sweep of (N, K, warps, stages) on the CI GPU during the Qwen3.5 MoE export job. This finds the optimal block sizes for the CI hardware (A100-SXM4-80GB) so we can use them as triton autotune candidates. The benchmark runs before export and is non-fatal — export proceeds even if the benchmark fails.
Run a standalone benchmark sweep of (N, K, warps, stages) on the CI GPU during the Qwen3.5 MoE export job. Uses the actual Triton kernels from executorch.backends.cuda to ensure consistency. Finds optimal block sizes for the CI hardware so we can use them as autotune candidates. Non-fatal — export proceeds even if benchmark fails.
Run a standalone benchmark sweep of (N, K, warps, stages) on the CI GPU during the Qwen3.5 MoE export job. Uses the actual Triton kernels from executorch.backends.cuda to ensure consistency. Finds optimal block sizes for the CI hardware so we can use them as autotune candidates. Non-fatal — export proceeds even if benchmark fails.
…cripts Replace autotune candidates with top-5 configs from CI A100-SXM4-80GB benchmark sweep (block sizes 8-256). Key findings: - GEMM1 best: N=8, K=256, warps=2 → 32.8us (45.8% faster than baseline) - GEMM2 best: N=8, K=128, warps=2 → 26.1us (10.6% faster than baseline) - Overall: 58.9us vs 89.6us baseline (34.3% improvement) Baseline (32,32) retained in both config lists for safety. Clean up: remove moe_kernel_benchmark.py, GPU diagnostics, and artifact checksums from CI scripts.
digantdesai
approved these changes
Mar 31, 2026
Jiseong-oh
pushed a commit
to Jiseong-oh/executorch
that referenced
this pull request
Apr 2, 2026
## Summary This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5 MoE model inference from **66.8 token/s → 77.7 token/s**. ## Motivation Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton SDPA) shows MoE dominates GPU time: | Category | Total (ms) | % GPU | |---|---|---| | **MoE** | **1,420** | **54.7%** | | Triton fused ops | 433 | 16.7% | | SDPA | 288 | 11.1% | | int4mm | 240 | 9.2% | | chunk_gated_delta_rule | 151 | 5.8% | | Router | 65 | 2.5% | The `fused_moe` kernel is the single largest bottleneck, making it the highest-leverage optimization target. ## Approach Due to hardware constraints, exhaustive autotuning at `aoti-compile` time is impractical. Instead, we: 1. **Benchmarked** all hyperparameter combinations for MoE kernels on an A100 server ([full results](https://gist.github.com/Gasoonjia/baae2475684d1246c82865ff5cbd949d)) 2. **Selected** the top-5 configurations plus the original `(N=32, K=32)` baseline 3. **Registered** them as `@triton.autotune` configs for the MoE kernels ## Results — MoE Kernel | Kernel | Best Config | Baseline | Best | Improvement | |---|---|---|---|---| | GEMM1 | `(8, 256, w2, s2)` | 60.4 µs | 32.8 µs | **45.8% faster** | | GEMM2 | `(8, 128, w2, s4)` | 29.2 µs | 26.1 µs | **10.6% faster** | **MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)** ## Results — End-to-End Inference | | Token/s | |---|---| | Baseline | 66.8 | | With this PR | **77.7** |
Jiseong-oh
pushed a commit
that referenced
this pull request
Apr 2, 2026
## Summary This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5 MoE model inference from **66.8 token/s → 77.7 token/s**. ## Motivation Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton SDPA) shows MoE dominates GPU time: | Category | Total (ms) | % GPU | |---|---|---| | **MoE** | **1,420** | **54.7%** | | Triton fused ops | 433 | 16.7% | | SDPA | 288 | 11.1% | | int4mm | 240 | 9.2% | | chunk_gated_delta_rule | 151 | 5.8% | | Router | 65 | 2.5% | The `fused_moe` kernel is the single largest bottleneck, making it the highest-leverage optimization target. ## Approach Due to hardware constraints, exhaustive autotuning at `aoti-compile` time is impractical. Instead, we: 1. **Benchmarked** all hyperparameter combinations for MoE kernels on an A100 server ([full results](https://gist.github.com/Gasoonjia/baae2475684d1246c82865ff5cbd949d)) 2. **Selected** the top-5 configurations plus the original `(N=32, K=32)` baseline 3. **Registered** them as `@triton.autotune` configs for the MoE kernels ## Results — MoE Kernel | Kernel | Best Config | Baseline | Best | Improvement | |---|---|---|---|---| | GEMM1 | `(8, 256, w2, s2)` | 60.4 µs | 32.8 µs | **45.8% faster** | | GEMM2 | `(8, 128, w2, s4)` | 29.2 µs | 26.1 µs | **10.6% faster** | **MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)** ## Results — End-to-End Inference | | Token/s | |---|---| | Baseline | 66.8 | | With this PR | **77.7** |
Jiseong-oh
pushed a commit
that referenced
this pull request
Apr 7, 2026
## Summary This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5 MoE model inference from **66.8 token/s → 77.7 token/s**. ## Motivation Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton SDPA) shows MoE dominates GPU time: | Category | Total (ms) | % GPU | |---|---|---| | **MoE** | **1,420** | **54.7%** | | Triton fused ops | 433 | 16.7% | | SDPA | 288 | 11.1% | | int4mm | 240 | 9.2% | | chunk_gated_delta_rule | 151 | 5.8% | | Router | 65 | 2.5% | The `fused_moe` kernel is the single largest bottleneck, making it the highest-leverage optimization target. ## Approach Due to hardware constraints, exhaustive autotuning at `aoti-compile` time is impractical. Instead, we: 1. **Benchmarked** all hyperparameter combinations for MoE kernels on an A100 server ([full results](https://gist.github.com/Gasoonjia/baae2475684d1246c82865ff5cbd949d)) 2. **Selected** the top-5 configurations plus the original `(N=32, K=32)` baseline 3. **Registered** them as `@triton.autotune` configs for the MoE kernels ## Results — MoE Kernel | Kernel | Best Config | Baseline | Best | Improvement | |---|---|---|---|---| | GEMM1 | `(8, 256, w2, s2)` | 60.4 µs | 32.8 µs | **45.8% faster** | | GEMM2 | `(8, 128, w2, s4)` | 29.2 µs | 26.1 µs | **10.6% faster** | **MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)** ## Results — End-to-End Inference | | Token/s | |---|---| | Baseline | 66.8 | | With this PR | **77.7** |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5 MoE model inference from 66.8 token/s → 77.7 token/s.
Motivation
Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton SDPA) shows MoE dominates GPU time:
The
fused_moekernel is the single largest bottleneck, making it the highest-leverage optimization target.Approach
Due to hardware constraints, exhaustive autotuning at
aoti-compiletime is impractical. Instead, we:(N=32, K=32)baseline@triton.autotuneconfigs for the MoE kernelsResults — MoE Kernel
(8, 256, w2, s2)(8, 128, w2, s4)MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)
Results — End-to-End Inference