[MLAS] KleidiAI fix igemm regression by martin-klacer-arm · Pull Request #28571 · microsoft/onnxruntime

martin-klacer-arm · 2026-05-19T15:32:18Z

Description

This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected.

The change has 2 parts:

updates to the KleidiAI IGEMM LHS packing to pack rows in bounded chunks instead of packing the full LHS buffer up front, which reduces memory usage and improves cache locality for large convolutions,
a new route selection function ArmKleidiAI::SelectConvRoute that decides between Igemm, GemmFallback and None based on convolution parameters and a workload size-based heuristic.

The function CheckCapabilitiesSme runs SelectConvRoute and only returns true if the selected route is Igemm. The patch also adds a standard GEMM fallback to the ConvRoute possibilities, and runs MlasGemm if said fallback is selected. If the function selects None, then the convolution falls back to MlasSgemmOperation.

Motivation and Context

Fixes #27633.

previously, the convolution kernel for KLEIDIAI would allocate a large contiguous buffer for the LHS (left-hand-side) matrix packing, which could consume excessive memory and reduce cache efficiency. This patch modifies the packing strategy to use a chunked approach: - Introduce a compile-time upper bound for temporary LHS packing buffers - Allocate a moderate-sized temporary buffer once. - Pack LHS rows in chunks, perform computation, then reuse the buffer for the next chunk. Benefits: - Significantly reduces peak memory usage. - Improves cache utilization and overall computation efficiency. - Avoids potential memory allocation failures for large convolutions. Performance improvement: - Test with model https://huggingface.co/garavv/arcface-onnx on MTK D9500 Before this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 4.25327 ``` After this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 5.03257 ``` ---------------------------------------------------------------------------------------------------------------- sme_with_patch | sme_without_patch | Diff (μs) | Change % | Node Name ---------------------------------------------------------------------------------------------------------------- 11975.10 | 31235.30 | 19260.20 | 160.84% | StatefulPartitionedCall/ResNet34/conv2_block1_1_conv/Conv2D 4514.80 | 7691.70 | 3176.90 | 70.37% | StatefulPartitionedCall/ResNet34/conv2_block2_1_conv/Conv2D 4220.20 | 7120.70 | 2900.50 | 68.73% | StatefulPartitionedCall/ResNet34/conv2_block3_1_conv/Conv2D 5429.20 | 8279.60 | 2850.40 | 52.50% | StatefulPartitionedCall/ResNet34/conv3_block1_1_conv/Conv2D 4497.80 | 5478.40 | 980.60 | 21.80% | StatefulPartitionedCall/ResNet34/conv4_block1_1_conv/Conv2D 3474.30 | 4351.80 | 877.50 | 25.26% | StatefulPartitionedCall/ResNet34/conv3_block3_1_conv/Conv2D 3627.30 | 4504.00 | 876.70 | 24.17% | StatefulPartitionedCall/ResNet34/conv3_block4_1_conv/Conv2D 5244.20 | 5961.10 | 716.90 | 13.67% | StatefulPartitionedCall/ResNet34/conv1_conv/Conv2D 3439.80 | 4050.90 | 611.10 | 17.77% | StatefulPartitionedCall/ResNet34/conv3_block2_1_conv/Conv2D 9749.80 | 10195.50 | 445.70 | 4.57% | StatefulPartitionedCall/ResNet34/conv2_block2_2_conv/Conv2D 3814.00 | 4209.80 | 395.80 | 10.38% | StatefulPartitionedCall/ResNet34/conv5_block2_2_conv/Conv2D 2715.90 | 3034.70 | 318.80 | 11.74% | StatefulPartitionedCall/ResNet34/conv4_block6_1_conv/Conv2D 4089.10 | 4367.80 | 278.70 | 6.82% | StatefulPartitionedCall/ResNet34/conv5_block1_1_conv/Conv2D 2698.00 | 2959.50 | 261.50 | 9.69% | StatefulPartitionedCall/ResNet34/conv4_block5_1_conv/Conv2D 3869.20 | 4102.80 | 233.60 | 6.04% | StatefulPartitionedCall/ResNet34/conv5_block3_2_conv/Conv2D 2767.90 | 2966.80 | 198.90 | 7.19% | StatefulPartitionedCall/ResNet34/conv4_block4_1_conv/Conv2D 9652.10 | 9816.60 | 164.50 | 1.70% | StatefulPartitionedCall/ResNet34/conv2_block3_2_conv/Conv2D 2897.50 | 3054.60 | 157.10 | 5.42% | StatefulPartitionedCall/ResNet34/conv4_block3_1_conv/Conv2D 4601.20 | 4748.60 | 147.40 | 3.20% | StatefulPartitionedCall/ResNet34/conv5_block1_2_conv/Conv2D 3134.00 | 3246.10 | 112.10 | 3.58% | StatefulPartitionedCall/ResNet34/conv4_block2_1_conv/Conv2D Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com> Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

- Added SelectConvRoute function to mlasi_kleidiai.h to decide between GemmFallback and Igemm based on the convolution workload parameters - Updated CheckCapabilitiesSme function in convolve_kleidiai.cpp to use the new SelectConvRoute function Co-authored-by: Damien Dooley <damien.dooley@arm.com> Signed-off-by: Martin Klacer <martin.klacer@arm.com>

Copilot

Pull request overview

This PR addresses a CPU convolution performance regression on ARM64 when the KleidiAI SME IGEMM path is selected, by introducing a new convolution route-selection heuristic and changing the KleidiAI LHS packing strategy to reduce peak temporary memory and improve cache locality on large workloads.

Changes:

Added ArmKleidiAI::SelectConvRoute (with ConvRoute options Igemm, GemmFallback, None) to decide whether to run KleidiAI IGEMM, fall back to GEMM, or use the existing path.
Updated KleidiAI convolution to pack the IGEMM LHS in bounded chunks instead of packing the full LHS up front.
In the generic MLAS convolution implementation, routed certain workloads to MlasGemm (instead of MlasSgemmOperation) when SelectConvRoute chooses GemmFallback.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h	Adds `ConvRoute` + `SelectConvRoute` heuristic helpers for choosing conv execution route.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp	Reworks KleidiAI SME IGEMM LHS packing into chunked packing and integrates route selection into capability checks.
onnxruntime/core/mlas/lib/convolve.cpp	Uses `SelectConvRoute` to optionally switch im2col+SGEMM slices to `MlasGemm` for the fallback route.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    const auto effective_kernel_h =
+        ComputeDilatedKernelSize(Parameters->DilationShape[0], Parameters->KernelShape[0]);
+    const auto effective_kernel_w =
+        ComputeDilatedKernelSize(Parameters->DilationShape[1], Parameters->KernelShape[1]);
+    const auto output_m =
+        ComputeConvOutputSize(Parameters->InputShape[0], effective_kernel_h, Parameters->Padding[0], Parameters->StrideShape[0]) *
+        ComputeConvOutputSize(Parameters->InputShape[1], effective_kernel_w, Parameters->Padding[1], Parameters->StrideShape[1]);
+


+                kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme(tile_size_m,
+                                                            d_kh * d_kw,
+                                                            ci,
+                                                            lhs_ptrs.get() + (tile_m_start + m_base) * d_kh * d_kw,
+                                                            reinterpret_cast<size_t>(activation_src),


+                m_step, d_kh * d_kw, ci);
+
+            // Determine how many rows we can pack in one chunk.
+            size_t m_chunk = std::max<size_t>(m_step, MAX_LHS_CHUNK_BYTES / bytes_per_m_step * m_step);


 #include "mlasi.h"
+#if defined(USE_KLEIDIAI)
+#include "kleidiai/mlasi_kleidiai.h"
+#endif


martin-klacer-arm · 2026-05-20T09:29:13Z

@microsoft-github-policy-service agree company="Arm"

hariharans29 · 2026-05-20T20:53:29Z

Can you please address the Copliot comments ?

martin-klacer-arm · 2026-05-21T16:07:24Z

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

JonathanC-ARM and others added 2 commits May 18, 2026 16:49

hariharans29 requested a review from Copilot May 19, 2026 16:07

Copilot started reviewing on behalf of hariharans29 May 19, 2026 16:08 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

hariharans29 changed the title ~~KleidiAI fix igemm regression~~ [MLAS] KleidiAI fix igemm regression May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] KleidiAI fix igemm regression#28571

[MLAS] KleidiAI fix igemm regression#28571
martin-klacer-arm wants to merge 2 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_igemm_regression

martin-klacer-arm commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

martin-klacer-arm commented May 20, 2026

Uh oh!

hariharans29 commented May 20, 2026

Uh oh!

martin-klacer-arm commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

martin-klacer-arm commented May 19, 2026

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

martin-klacer-arm commented May 20, 2026

Uh oh!

hariharans29 commented May 20, 2026

Uh oh!

martin-klacer-arm commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants