Skip to content

[MLAS] KleidiAI fix igemm regression#28571

Open
martin-klacer-arm wants to merge 2 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_igemm_regression
Open

[MLAS] KleidiAI fix igemm regression#28571
martin-klacer-arm wants to merge 2 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_igemm_regression

Conversation

@martin-klacer-arm
Copy link
Copy Markdown

Description

This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected.

The change has 2 parts:

  1. updates to the KleidiAI IGEMM LHS packing to pack rows in bounded chunks instead of packing the full LHS buffer up front, which reduces memory usage and improves cache locality for large convolutions,
  2. a new route selection function ArmKleidiAI::SelectConvRoute that decides between Igemm, GemmFallback and None based on convolution parameters and a workload size-based heuristic.

The function CheckCapabilitiesSme runs SelectConvRoute and only returns true if the selected route is Igemm. The patch also adds a standard GEMM fallback to the ConvRoute possibilities, and runs MlasGemm if said fallback is selected. If the function selects None, then the convolution falls back to MlasSgemmOperation.

Motivation and Context

Fixes #27633.

JonathanC-ARM and others added 2 commits May 18, 2026 16:49
previously, the convolution kernel for KLEIDIAI would allocate a large contiguous buffer
for the LHS (left-hand-side) matrix packing, which could consume excessive memory
and reduce cache efficiency.

This patch modifies the packing strategy to use a chunked approach:
- Introduce a compile-time upper bound for temporary LHS packing buffers
- Allocate a moderate-sized temporary buffer once.
- Pack LHS rows in chunks, perform computation, then reuse the buffer for the next chunk.

Benefits:
- Significantly reduces peak memory usage.
- Improves cache utilization and overall computation efficiency.
- Avoids potential memory allocation failures for large convolutions.

Performance improvement:
- Test with model https://huggingface.co/garavv/arcface-onnx on MTK D9500

    Before this patch
    ```
    ./build/RelWithDebInfo/onnxruntime_perf_test -x 1  -r 1000 arc.onnx

    Number of inferences per second: 4.25327
    ```

    After this patch
    ```
    ./build/RelWithDebInfo/onnxruntime_perf_test -x 1  -r 1000 arc.onnx

    Number of inferences per second: 5.03257
    ```

----------------------------------------------------------------------------------------------------------------
sme_with_patch | sme_without_patch  | Diff (μs) | Change % | Node Name
----------------------------------------------------------------------------------------------------------------
  11975.10 |    31235.30 |     19260.20 |    160.84% | StatefulPartitionedCall/ResNet34/conv2_block1_1_conv/Conv2D
   4514.80 |     7691.70 |      3176.90 |     70.37% | StatefulPartitionedCall/ResNet34/conv2_block2_1_conv/Conv2D
   4220.20 |     7120.70 |      2900.50 |     68.73% | StatefulPartitionedCall/ResNet34/conv2_block3_1_conv/Conv2D
   5429.20 |     8279.60 |      2850.40 |     52.50% | StatefulPartitionedCall/ResNet34/conv3_block1_1_conv/Conv2D
   4497.80 |     5478.40 |       980.60 |     21.80% | StatefulPartitionedCall/ResNet34/conv4_block1_1_conv/Conv2D
   3474.30 |     4351.80 |       877.50 |     25.26% | StatefulPartitionedCall/ResNet34/conv3_block3_1_conv/Conv2D
   3627.30 |     4504.00 |       876.70 |     24.17% | StatefulPartitionedCall/ResNet34/conv3_block4_1_conv/Conv2D
   5244.20 |     5961.10 |       716.90 |     13.67% | StatefulPartitionedCall/ResNet34/conv1_conv/Conv2D
   3439.80 |     4050.90 |       611.10 |     17.77% | StatefulPartitionedCall/ResNet34/conv3_block2_1_conv/Conv2D
   9749.80 |    10195.50 |       445.70 |      4.57% | StatefulPartitionedCall/ResNet34/conv2_block2_2_conv/Conv2D
   3814.00 |     4209.80 |       395.80 |     10.38% | StatefulPartitionedCall/ResNet34/conv5_block2_2_conv/Conv2D
   2715.90 |     3034.70 |       318.80 |     11.74% | StatefulPartitionedCall/ResNet34/conv4_block6_1_conv/Conv2D
   4089.10 |     4367.80 |       278.70 |      6.82% | StatefulPartitionedCall/ResNet34/conv5_block1_1_conv/Conv2D
   2698.00 |     2959.50 |       261.50 |      9.69% | StatefulPartitionedCall/ResNet34/conv4_block5_1_conv/Conv2D
   3869.20 |     4102.80 |       233.60 |      6.04% | StatefulPartitionedCall/ResNet34/conv5_block3_2_conv/Conv2D
   2767.90 |     2966.80 |       198.90 |      7.19% | StatefulPartitionedCall/ResNet34/conv4_block4_1_conv/Conv2D
   9652.10 |     9816.60 |       164.50 |      1.70% | StatefulPartitionedCall/ResNet34/conv2_block3_2_conv/Conv2D
   2897.50 |     3054.60 |       157.10 |      5.42% | StatefulPartitionedCall/ResNet34/conv4_block3_1_conv/Conv2D
   4601.20 |     4748.60 |       147.40 |      3.20% | StatefulPartitionedCall/ResNet34/conv5_block1_2_conv/Conv2D
   3134.00 |     3246.10 |       112.10 |      3.58% | StatefulPartitionedCall/ResNet34/conv4_block2_1_conv/Conv2D

Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com>
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
 - Added SelectConvRoute function to mlasi_kleidiai.h to decide between
   GemmFallback and Igemm based on the convolution workload parameters
 - Updated CheckCapabilitiesSme function in convolve_kleidiai.cpp
   to use the new SelectConvRoute function

Co-authored-by: Damien Dooley <damien.dooley@arm.com>
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a CPU convolution performance regression on ARM64 when the KleidiAI SME IGEMM path is selected, by introducing a new convolution route-selection heuristic and changing the KleidiAI LHS packing strategy to reduce peak temporary memory and improve cache locality on large workloads.

Changes:

  • Added ArmKleidiAI::SelectConvRoute (with ConvRoute options Igemm, GemmFallback, None) to decide whether to run KleidiAI IGEMM, fall back to GEMM, or use the existing path.
  • Updated KleidiAI convolution to pack the IGEMM LHS in bounded chunks instead of packing the full LHS up front.
  • In the generic MLAS convolution implementation, routed certain workloads to MlasGemm (instead of MlasSgemmOperation) when SelectConvRoute chooses GemmFallback.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Adds ConvRoute + SelectConvRoute heuristic helpers for choosing conv execution route.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Reworks KleidiAI SME IGEMM LHS packing into chunked packing and integrates route selection into capability checks.
onnxruntime/core/mlas/lib/convolve.cpp Uses SelectConvRoute to optionally switch im2col+SGEMM slices to MlasGemm for the fallback route.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +89 to +96
const auto effective_kernel_h =
ComputeDilatedKernelSize(Parameters->DilationShape[0], Parameters->KernelShape[0]);
const auto effective_kernel_w =
ComputeDilatedKernelSize(Parameters->DilationShape[1], Parameters->KernelShape[1]);
const auto output_m =
ComputeConvOutputSize(Parameters->InputShape[0], effective_kernel_h, Parameters->Padding[0], Parameters->StrideShape[0]) *
ComputeConvOutputSize(Parameters->InputShape[1], effective_kernel_w, Parameters->Padding[1], Parameters->StrideShape[1]);

Comment on lines +592 to +596
kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme(tile_size_m,
d_kh * d_kw,
ci,
lhs_ptrs.get() + (tile_m_start + m_base) * d_kh * d_kw,
reinterpret_cast<size_t>(activation_src),
m_step, d_kh * d_kw, ci);

// Determine how many rows we can pack in one chunk.
size_t m_chunk = std::max<size_t>(m_step, MAX_LHS_CHUNK_BYTES / bytes_per_m_step * m_step);
Comment on lines 17 to +20
#include "mlasi.h"
#if defined(USE_KLEIDIAI)
#include "kleidiai/mlasi_kleidiai.h"
#endif
@hariharans29 hariharans29 changed the title KleidiAI fix igemm regression [MLAS] KleidiAI fix igemm regression May 19, 2026
@martin-klacer-arm
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Arm"

@hariharans29
Copy link
Copy Markdown
Member

Can you please address the Copliot comments ?

@martin-klacer-arm
Copy link
Copy Markdown
Author

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] CPU Conv kernel regression for large kernels (9x9) on macOS ARM64 in ORT 1.24.x

4 participants