Skip to content

AutoTune MoE kernel block sizes for accelerate inference#18551

Merged
Gasoonjia merged 14 commits intomainfrom
moe-tuning-only
Mar 31, 2026
Merged

AutoTune MoE kernel block sizes for accelerate inference#18551
Gasoonjia merged 14 commits intomainfrom
moe-tuning-only

Conversation

@Gasoonjia
Copy link
Copy Markdown
Contributor

@Gasoonjia Gasoonjia commented Mar 27, 2026

Summary

This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5 MoE model inference from 66.8 token/s → 77.7 token/s.

Motivation

Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton SDPA) shows MoE dominates GPU time:

Category Total (ms) % GPU
MoE 1,420 54.7%
Triton fused ops 433 16.7%
SDPA 288 11.1%
int4mm 240 9.2%
chunk_gated_delta_rule 151 5.8%
Router 65 2.5%

The fused_moe kernel is the single largest bottleneck, making it the highest-leverage optimization target.

Approach

Due to hardware constraints, exhaustive autotuning at aoti-compile time is impractical. Instead, we:

  1. Benchmarked all hyperparameter combinations for MoE kernels on an A100 server (full results)
  2. Selected the top-5 configurations plus the original (N=32, K=32) baseline
  3. Registered them as @triton.autotune configs for the MoE kernels

Results — MoE Kernel

Kernel Best Config Baseline Best Improvement
GEMM1 (8, 256, w2, s2) 60.4 µs 32.8 µs 45.8% faster
GEMM2 (8, 128, w2, s4) 29.2 µs 26.1 µs 10.6% faster

MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)

Results — End-to-End Inference

Token/s
Baseline 66.8
With this PR 77.7

Change fused_moe kernel config from (N=32, K=32, warps=2, stages=2)
to (N=128, K=64, warps=4, stages=3). Benchmarked on A100 for Qwen3.5
MoE dimensions, delivering -33.6% MoE kernel time and -19.6% overall
wall clock speedup with zero impact on non-MoE kernels.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 27, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18551

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 118 Pending

As of commit 5baa6b6 with merge base 186eb4b (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 27, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Replace hardcoded block sizes with @triton.autotune for both GEMM1
(_fused_moe_kernel) and GEMM2 (_fused_moe_silu_kernel). Each kernel
gets its own set of 4 autotune configs derived from standalone
benchmarking on A100 with Qwen3.5 MoE dimensions (M=1 decode,
INT4 HQQ, group_size=128).

GEMM1 configs optimized for N=1024, K=2048.
GEMM2 configs optimized for N=2048, K=512.

Results vs baseline (N=32, K=32):
- GEMM1: 190.1 -> 124.5 us/call (-34.5%)
- GEMM2: 87.2 -> 48.2 us/call (-44.7%)
- Overall MoE: -37.7%
- E2E decode: 45.13 -> 53.86 tok/s (+19.3%)

Keeping configs to 4 per kernel avoids the AOTI fatbin OOM issue
seen with larger autotune config sets.
Include the original default block sizes (N=32, K=32) in both GEMM1
and GEMM2 autotune candidate lists to prevent perf regression on
hardware where smaller block sizes are optimal.

Add GPU diagnostic output to test_model_e2e.sh to help investigate
perf discrepancies between local and CI environments (GPU variant,
memory bandwidth, CUDA version, etc.).
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results March 30, 2026 04:36 — with GitHub Actions Inactive
Print md5sum of exported model.pte and aoti_cuda_blob.ptd on CI,
along with local reference checksums and PyTorch/Triton/torchao
versions, to help diagnose cross-machine perf discrepancies.
Run a standalone benchmark sweep of (N, K, warps, stages) on the CI
GPU during the Qwen3.5 MoE export job. This finds the optimal block
sizes for the CI hardware (A100-SXM4-80GB) so we can use them as
triton autotune candidates. The benchmark runs before export and is
non-fatal — export proceeds even if the benchmark fails.
Run a standalone benchmark sweep of (N, K, warps, stages) on the CI
GPU during the Qwen3.5 MoE export job. Uses the actual Triton kernels
from executorch.backends.cuda to ensure consistency. Finds optimal
block sizes for the CI hardware so we can use them as autotune
candidates. Non-fatal — export proceeds even if benchmark fails.
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results March 30, 2026 08:56 — with GitHub Actions Inactive
Run a standalone benchmark sweep of (N, K, warps, stages) on the CI
GPU during the Qwen3.5 MoE export job. Uses the actual Triton kernels
from executorch.backends.cuda to ensure consistency. Finds optimal
block sizes for the CI hardware so we can use them as autotune
candidates. Non-fatal — export proceeds even if benchmark fails.
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results March 30, 2026 18:54 — with GitHub Actions Inactive
…cripts

Replace autotune candidates with top-5 configs from CI A100-SXM4-80GB
benchmark sweep (block sizes 8-256). Key findings:
- GEMM1 best: N=8, K=256, warps=2 → 32.8us (45.8% faster than baseline)
- GEMM2 best: N=8, K=128, warps=2 → 26.1us (10.6% faster than baseline)
- Overall: 58.9us vs 89.6us baseline (34.3% improvement)

Baseline (32,32) retained in both config lists for safety.

Clean up: remove moe_kernel_benchmark.py, GPU diagnostics, and artifact
checksums from CI scripts.
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results March 30, 2026 22:54 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia marked this pull request as ready for review March 31, 2026 05:35
@Gasoonjia Gasoonjia changed the title [WIP] Tune MoE kernel block sizes for M=1 decode Tune MoE kernel block sizes for M=1 decode Mar 31, 2026
@Gasoonjia Gasoonjia changed the title Tune MoE kernel block sizes for M=1 decode AutoTune MoE kernel block sizes for accelerate inference Mar 31, 2026
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results March 31, 2026 07:17 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results March 31, 2026 18:58 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia merged commit 217ad45 into main Mar 31, 2026
380 of 389 checks passed
@Gasoonjia Gasoonjia deleted the moe-tuning-only branch March 31, 2026 23:17
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results March 31, 2026 23:50 — with GitHub Actions Inactive
Jiseong-oh pushed a commit to Jiseong-oh/executorch that referenced this pull request Apr 2, 2026
## Summary

This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5
MoE model inference from **66.8 token/s → 77.7 token/s**.

## Motivation

Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton
SDPA) shows MoE dominates GPU time:

| Category | Total (ms) | % GPU |
|---|---|---|
| **MoE** | **1,420** | **54.7%** |
| Triton fused ops | 433 | 16.7% |
| SDPA | 288 | 11.1% |
| int4mm | 240 | 9.2% |
| chunk_gated_delta_rule | 151 | 5.8% |
| Router | 65 | 2.5% |

The `fused_moe` kernel is the single largest bottleneck, making it the
highest-leverage optimization target.

## Approach

Due to hardware constraints, exhaustive autotuning at `aoti-compile`
time is impractical. Instead, we:

1. **Benchmarked** all hyperparameter combinations for MoE kernels on an
A100 server ([full
results](https://gist.github.com/Gasoonjia/baae2475684d1246c82865ff5cbd949d))
2. **Selected** the top-5 configurations plus the original `(N=32,
K=32)` baseline
3. **Registered** them as `@triton.autotune` configs for the MoE kernels

## Results — MoE Kernel

| Kernel | Best Config | Baseline | Best | Improvement |
|---|---|---|---|---|
| GEMM1 | `(8, 256, w2, s2)` | 60.4 µs | 32.8 µs | **45.8% faster** |
| GEMM2 | `(8, 128, w2, s4)` | 29.2 µs | 26.1 µs | **10.6% faster** |

**MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)**

## Results — End-to-End Inference

| | Token/s |
|---|---|
| Baseline | 66.8 |
| With this PR | **77.7** |
Jiseong-oh pushed a commit that referenced this pull request Apr 2, 2026
## Summary

This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5
MoE model inference from **66.8 token/s → 77.7 token/s**.

## Motivation

Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton
SDPA) shows MoE dominates GPU time:

| Category | Total (ms) | % GPU |
|---|---|---|
| **MoE** | **1,420** | **54.7%** |
| Triton fused ops | 433 | 16.7% |
| SDPA | 288 | 11.1% |
| int4mm | 240 | 9.2% |
| chunk_gated_delta_rule | 151 | 5.8% |
| Router | 65 | 2.5% |

The `fused_moe` kernel is the single largest bottleneck, making it the
highest-leverage optimization target.

## Approach

Due to hardware constraints, exhaustive autotuning at `aoti-compile`
time is impractical. Instead, we:

1. **Benchmarked** all hyperparameter combinations for MoE kernels on an
A100 server ([full
results](https://gist.github.com/Gasoonjia/baae2475684d1246c82865ff5cbd949d))
2. **Selected** the top-5 configurations plus the original `(N=32,
K=32)` baseline
3. **Registered** them as `@triton.autotune` configs for the MoE kernels

## Results — MoE Kernel

| Kernel | Best Config | Baseline | Best | Improvement |
|---|---|---|---|---|
| GEMM1 | `(8, 256, w2, s2)` | 60.4 µs | 32.8 µs | **45.8% faster** |
| GEMM2 | `(8, 128, w2, s4)` | 29.2 µs | 26.1 µs | **10.6% faster** |

**MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)**

## Results — End-to-End Inference

| | Token/s |
|---|---|
| Baseline | 66.8 |
| With this PR | **77.7** |
Jiseong-oh pushed a commit that referenced this pull request Apr 7, 2026
## Summary

This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5
MoE model inference from **66.8 token/s → 77.7 token/s**.

## Motivation

Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton
SDPA) shows MoE dominates GPU time:

| Category | Total (ms) | % GPU |
|---|---|---|
| **MoE** | **1,420** | **54.7%** |
| Triton fused ops | 433 | 16.7% |
| SDPA | 288 | 11.1% |
| int4mm | 240 | 9.2% |
| chunk_gated_delta_rule | 151 | 5.8% |
| Router | 65 | 2.5% |

The `fused_moe` kernel is the single largest bottleneck, making it the
highest-leverage optimization target.

## Approach

Due to hardware constraints, exhaustive autotuning at `aoti-compile`
time is impractical. Instead, we:

1. **Benchmarked** all hyperparameter combinations for MoE kernels on an
A100 server ([full
results](https://gist.github.com/Gasoonjia/baae2475684d1246c82865ff5cbd949d))
2. **Selected** the top-5 configurations plus the original `(N=32,
K=32)` baseline
3. **Registered** them as `@triton.autotune` configs for the MoE kernels

## Results — MoE Kernel

| Kernel | Best Config | Baseline | Best | Improvement |
|---|---|---|---|---|
| GEMM1 | `(8, 256, w2, s2)` | 60.4 µs | 32.8 µs | **45.8% faster** |
| GEMM2 | `(8, 128, w2, s4)` | 29.2 µs | 26.1 µs | **10.6% faster** |

**MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)**

## Results — End-to-End Inference

| | Token/s |
|---|---|
| Baseline | 66.8 |
| With this PR | **77.7** |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants