AutoTune MoE kernel block sizes for accelerate inference by Gasoonjia · Pull Request #18551 · pytorch/executorch

Gasoonjia · 2026-03-27T23:26:21Z

Summary

This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5 MoE model inference from 66.8 token/s → 77.7 token/s.

Motivation

Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton SDPA) shows MoE dominates GPU time:

Category	Total (ms)	% GPU
MoE	1,420	54.7%
Triton fused ops	433	16.7%
SDPA	288	11.1%
int4mm	240	9.2%
chunk_gated_delta_rule	151	5.8%
Router	65	2.5%

The fused_moe kernel is the single largest bottleneck, making it the highest-leverage optimization target.

Approach

Due to hardware constraints, exhaustive autotuning at aoti-compile time is impractical. Instead, we:

Benchmarked all hyperparameter combinations for MoE kernels on an A100 server (full results)
Selected the top-5 configurations plus the original (N=32, K=32) baseline
Registered them as @triton.autotune configs for the MoE kernels

Results — MoE Kernel

Kernel	Best Config	Baseline	Best	Improvement
GEMM1	`(8, 256, w2, s2)`	60.4 µs	32.8 µs	45.8% faster
GEMM2	`(8, 128, w2, s4)`	29.2 µs	26.1 µs	10.6% faster

MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)

Results — End-to-End Inference

	Token/s
Baseline	66.8
With this PR	77.7

Change fused_moe kernel config from (N=32, K=32, warps=2, stages=2) to (N=128, K=64, warps=4, stages=3). Benchmarked on A100 for Qwen3.5 MoE dimensions, delivering -33.6% MoE kernel time and -19.6% overall wall clock speedup with zero impact on non-MoE kernels.

pytorch-bot · 2026-03-27T23:26:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18551

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 118 Pending

As of commit 5baa6b6 with merge base 186eb4b ():

NEW FAILURES - The following jobs have failed:

trunk / test-arm-backend-zephyr (cortex-m55) / linux-job (gh)
RuntimeError: Command docker exec -t 06b917f1768411ec38dd1553f7cbea57d33ab914287aab2d556a623d5c9752af /exec failed with exit code 1
trunk / test-arm-backend-zephyr (ethos-u55) / linux-job (gh)
RuntimeError: Command docker exec -t 30bb2d3a8ee42b167df0e981a95a749150e348946d631f089c525e83dcaf1214 /exec failed with exit code 1
trunk / test-arm-backend-zephyr (ethos-u85) / linux-job (gh)
RuntimeError: Command docker exec -t 76815899d17e48a62f0205bfe656392e5212e7d07488c21f0576ca5ccfab1903 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-27T23:27:12Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Replace hardcoded block sizes with @triton.autotune for both GEMM1 (_fused_moe_kernel) and GEMM2 (_fused_moe_silu_kernel). Each kernel gets its own set of 4 autotune configs derived from standalone benchmarking on A100 with Qwen3.5 MoE dimensions (M=1 decode, INT4 HQQ, group_size=128). GEMM1 configs optimized for N=1024, K=2048. GEMM2 configs optimized for N=2048, K=512. Results vs baseline (N=32, K=32): - GEMM1: 190.1 -> 124.5 us/call (-34.5%) - GEMM2: 87.2 -> 48.2 us/call (-44.7%) - Overall MoE: -37.7% - E2E decode: 45.13 -> 53.86 tok/s (+19.3%) Keeping configs to 4 per kernel avoids the AOTI fatbin OOM issue seen with larger autotune config sets.

Include the original default block sizes (N=32, K=32) in both GEMM1 and GEMM2 autotune candidate lists to prevent perf regression on hardware where smaller block sizes are optimal. Add GPU diagnostic output to test_model_e2e.sh to help investigate perf discrepancies between local and CI environments (GPU variant, memory bandwidth, CUDA version, etc.).

Print md5sum of exported model.pte and aoti_cuda_blob.ptd on CI, along with local reference checksums and PyTorch/Triton/torchao versions, to help diagnose cross-machine perf discrepancies.

Run a standalone benchmark sweep of (N, K, warps, stages) on the CI GPU during the Qwen3.5 MoE export job. This finds the optimal block sizes for the CI hardware (A100-SXM4-80GB) so we can use them as triton autotune candidates. The benchmark runs before export and is non-fatal — export proceeds even if the benchmark fails.

Run a standalone benchmark sweep of (N, K, warps, stages) on the CI GPU during the Qwen3.5 MoE export job. Uses the actual Triton kernels from executorch.backends.cuda to ensure consistency. Finds optimal block sizes for the CI hardware so we can use them as autotune candidates. Non-fatal — export proceeds even if benchmark fails.

…cripts Replace autotune candidates with top-5 configs from CI A100-SXM4-80GB benchmark sweep (block sizes 8-256). Key findings: - GEMM1 best: N=8, K=256, warps=2 → 32.8us (45.8% faster than baseline) - GEMM2 best: N=8, K=128, warps=2 → 26.1us (10.6% faster than baseline) - Overall: 58.9us vs 89.6us baseline (34.3% improvement) Baseline (32,32) retained in both config lists for safety. Clean up: remove moe_kernel_benchmark.py, GPU diagnostics, and artifact checksums from CI scripts.

## Summary This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5 MoE model inference from **66.8 token/s → 77.7 token/s**. ## Motivation Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton SDPA) shows MoE dominates GPU time: | Category | Total (ms) | % GPU | |---|---|---| | **MoE** | **1,420** | **54.7%** | | Triton fused ops | 433 | 16.7% | | SDPA | 288 | 11.1% | | int4mm | 240 | 9.2% | | chunk_gated_delta_rule | 151 | 5.8% | | Router | 65 | 2.5% | The `fused_moe` kernel is the single largest bottleneck, making it the highest-leverage optimization target. ## Approach Due to hardware constraints, exhaustive autotuning at `aoti-compile` time is impractical. Instead, we: 1. **Benchmarked** all hyperparameter combinations for MoE kernels on an A100 server ([full results](https://gist.github.com/Gasoonjia/baae2475684d1246c82865ff5cbd949d)) 2. **Selected** the top-5 configurations plus the original `(N=32, K=32)` baseline 3. **Registered** them as `@triton.autotune` configs for the MoE kernels ## Results — MoE Kernel | Kernel | Best Config | Baseline | Best | Improvement | |---|---|---|---|---| | GEMM1 | `(8, 256, w2, s2)` | 60.4 µs | 32.8 µs | **45.8% faster** | | GEMM2 | `(8, 128, w2, s4)` | 29.2 µs | 26.1 µs | **10.6% faster** | **MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)** ## Results — End-to-End Inference | | Token/s | |---|---| | Baseline | 66.8 | | With this PR | **77.7** |

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 27, 2026

Gasoonjia added the ciflow/cuda label Mar 28, 2026

Gasoonjia added 3 commits March 27, 2026 22:06

Merge branch 'main' into moe-tuning-only

e1ab7e9

Gasoonjia temporarily deployed to upload-benchmark-results March 30, 2026 04:36 — with GitHub Actions Inactive

Gasoonjia added 3 commits March 29, 2026 23:32

Add artifact checksums and version diagnostics to CI export

dacd3a4

Print md5sum of exported model.pte and aoti_cuda_blob.ptd on CI, along with local reference checksums and PyTorch/Triton/torchao versions, to help diagnose cross-machine perf discrepancies.

Gasoonjia temporarily deployed to upload-benchmark-results March 30, 2026 08:56 — with GitHub Actions Inactive

Gasoonjia temporarily deployed to upload-benchmark-results March 30, 2026 18:54 — with GitHub Actions Inactive

Gasoonjia temporarily deployed to upload-benchmark-results March 30, 2026 22:54 — with GitHub Actions Inactive

Merge branch 'main' into moe-tuning-only

755452f

Gasoonjia marked this pull request as ready for review March 31, 2026 05:35

Gasoonjia changed the title ~~[WIP] Tune MoE kernel block sizes for M=1 decode~~ Tune MoE kernel block sizes for M=1 decode Mar 31, 2026

Gasoonjia added 2 commits March 30, 2026 23:04

Merge branch 'main' into moe-tuning-only

8b91ce4

solve lint issue

d3b3190

Gasoonjia changed the title ~~Tune MoE kernel block sizes for M=1 decode~~ AutoTune MoE kernel block sizes for accelerate inference Mar 31, 2026

Gasoonjia temporarily deployed to upload-benchmark-results March 31, 2026 07:17 — with GitHub Actions Inactive

digantdesai approved these changes Mar 31, 2026

View reviewed changes

Gasoonjia mentioned this pull request Mar 31, 2026

Tune MoE Model kernels #18610

Closed

Merge branch 'main' into moe-tuning-only

8f3c489

Gasoonjia temporarily deployed to upload-benchmark-results March 31, 2026 18:58 — with GitHub Actions Inactive

Merge branch 'main' into moe-tuning-only

5baa6b6

Gasoonjia merged commit 217ad45 into main Mar 31, 2026
380 of 389 checks passed

Gasoonjia deleted the moe-tuning-only branch March 31, 2026 23:17

Gasoonjia temporarily deployed to upload-benchmark-results March 31, 2026 23:50 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTune MoE kernel block sizes for accelerate inference#18551

AutoTune MoE kernel block sizes for accelerate inference#18551
Gasoonjia merged 14 commits intomainfrom
moe-tuning-only

Gasoonjia commented Mar 27, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gasoonjia commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Approach

Results — MoE Kernel

Results — End-to-End Inference

Uh oh!

pytorch-bot bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18551

❌ 3 New Failures, 118 Pending

Uh oh!

github-actions bot commented Mar 27, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gasoonjia commented Mar 27, 2026 •

edited

Loading

pytorch-bot bot commented Mar 27, 2026 •

edited

Loading

This PR needs a `release notes:` label