[MLAS] KleidiAI fix igemm regression#28571
Conversation
previously, the convolution kernel for KLEIDIAI would allocate a large contiguous buffer for the LHS (left-hand-side) matrix packing, which could consume excessive memory and reduce cache efficiency. This patch modifies the packing strategy to use a chunked approach: - Introduce a compile-time upper bound for temporary LHS packing buffers - Allocate a moderate-sized temporary buffer once. - Pack LHS rows in chunks, perform computation, then reuse the buffer for the next chunk. Benefits: - Significantly reduces peak memory usage. - Improves cache utilization and overall computation efficiency. - Avoids potential memory allocation failures for large convolutions. Performance improvement: - Test with model https://huggingface.co/garavv/arcface-onnx on MTK D9500 Before this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 4.25327 ``` After this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 5.03257 ``` ---------------------------------------------------------------------------------------------------------------- sme_with_patch | sme_without_patch | Diff (μs) | Change % | Node Name ---------------------------------------------------------------------------------------------------------------- 11975.10 | 31235.30 | 19260.20 | 160.84% | StatefulPartitionedCall/ResNet34/conv2_block1_1_conv/Conv2D 4514.80 | 7691.70 | 3176.90 | 70.37% | StatefulPartitionedCall/ResNet34/conv2_block2_1_conv/Conv2D 4220.20 | 7120.70 | 2900.50 | 68.73% | StatefulPartitionedCall/ResNet34/conv2_block3_1_conv/Conv2D 5429.20 | 8279.60 | 2850.40 | 52.50% | StatefulPartitionedCall/ResNet34/conv3_block1_1_conv/Conv2D 4497.80 | 5478.40 | 980.60 | 21.80% | StatefulPartitionedCall/ResNet34/conv4_block1_1_conv/Conv2D 3474.30 | 4351.80 | 877.50 | 25.26% | StatefulPartitionedCall/ResNet34/conv3_block3_1_conv/Conv2D 3627.30 | 4504.00 | 876.70 | 24.17% | StatefulPartitionedCall/ResNet34/conv3_block4_1_conv/Conv2D 5244.20 | 5961.10 | 716.90 | 13.67% | StatefulPartitionedCall/ResNet34/conv1_conv/Conv2D 3439.80 | 4050.90 | 611.10 | 17.77% | StatefulPartitionedCall/ResNet34/conv3_block2_1_conv/Conv2D 9749.80 | 10195.50 | 445.70 | 4.57% | StatefulPartitionedCall/ResNet34/conv2_block2_2_conv/Conv2D 3814.00 | 4209.80 | 395.80 | 10.38% | StatefulPartitionedCall/ResNet34/conv5_block2_2_conv/Conv2D 2715.90 | 3034.70 | 318.80 | 11.74% | StatefulPartitionedCall/ResNet34/conv4_block6_1_conv/Conv2D 4089.10 | 4367.80 | 278.70 | 6.82% | StatefulPartitionedCall/ResNet34/conv5_block1_1_conv/Conv2D 2698.00 | 2959.50 | 261.50 | 9.69% | StatefulPartitionedCall/ResNet34/conv4_block5_1_conv/Conv2D 3869.20 | 4102.80 | 233.60 | 6.04% | StatefulPartitionedCall/ResNet34/conv5_block3_2_conv/Conv2D 2767.90 | 2966.80 | 198.90 | 7.19% | StatefulPartitionedCall/ResNet34/conv4_block4_1_conv/Conv2D 9652.10 | 9816.60 | 164.50 | 1.70% | StatefulPartitionedCall/ResNet34/conv2_block3_2_conv/Conv2D 2897.50 | 3054.60 | 157.10 | 5.42% | StatefulPartitionedCall/ResNet34/conv4_block3_1_conv/Conv2D 4601.20 | 4748.60 | 147.40 | 3.20% | StatefulPartitionedCall/ResNet34/conv5_block1_2_conv/Conv2D 3134.00 | 3246.10 | 112.10 | 3.58% | StatefulPartitionedCall/ResNet34/conv4_block2_1_conv/Conv2D Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com> Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
- Added SelectConvRoute function to mlasi_kleidiai.h to decide between GemmFallback and Igemm based on the convolution workload parameters - Updated CheckCapabilitiesSme function in convolve_kleidiai.cpp to use the new SelectConvRoute function Co-authored-by: Damien Dooley <damien.dooley@arm.com> Signed-off-by: Martin Klacer <martin.klacer@arm.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a CPU convolution performance regression on ARM64 when the KleidiAI SME IGEMM path is selected, by introducing a new convolution route-selection heuristic and changing the KleidiAI LHS packing strategy to reduce peak temporary memory and improve cache locality on large workloads.
Changes:
- Added
ArmKleidiAI::SelectConvRoute(withConvRouteoptionsIgemm,GemmFallback,None) to decide whether to run KleidiAI IGEMM, fall back to GEMM, or use the existing path. - Updated KleidiAI convolution to pack the IGEMM LHS in bounded chunks instead of packing the full LHS up front.
- In the generic MLAS convolution implementation, routed certain workloads to
MlasGemm(instead ofMlasSgemmOperation) whenSelectConvRoutechoosesGemmFallback.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h | Adds ConvRoute + SelectConvRoute heuristic helpers for choosing conv execution route. |
| onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp | Reworks KleidiAI SME IGEMM LHS packing into chunked packing and integrates route selection into capability checks. |
| onnxruntime/core/mlas/lib/convolve.cpp | Uses SelectConvRoute to optionally switch im2col+SGEMM slices to MlasGemm for the fallback route. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const auto effective_kernel_h = | ||
| ComputeDilatedKernelSize(Parameters->DilationShape[0], Parameters->KernelShape[0]); | ||
| const auto effective_kernel_w = | ||
| ComputeDilatedKernelSize(Parameters->DilationShape[1], Parameters->KernelShape[1]); | ||
| const auto output_m = | ||
| ComputeConvOutputSize(Parameters->InputShape[0], effective_kernel_h, Parameters->Padding[0], Parameters->StrideShape[0]) * | ||
| ComputeConvOutputSize(Parameters->InputShape[1], effective_kernel_w, Parameters->Padding[1], Parameters->StrideShape[1]); | ||
|
|
| kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme(tile_size_m, | ||
| d_kh * d_kw, | ||
| ci, | ||
| lhs_ptrs.get() + (tile_m_start + m_base) * d_kh * d_kw, | ||
| reinterpret_cast<size_t>(activation_src), |
| m_step, d_kh * d_kw, ci); | ||
|
|
||
| // Determine how many rows we can pack in one chunk. | ||
| size_t m_chunk = std::max<size_t>(m_step, MAX_LHS_CHUNK_BYTES / bytes_per_m_step * m_step); |
| #include "mlasi.h" | ||
| #if defined(USE_KLEIDIAI) | ||
| #include "kleidiai/mlasi_kleidiai.h" | ||
| #endif |
|
@microsoft-github-policy-service agree company="Arm" |
|
Can you please address the Copliot comments ? |
|
Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR. |
Description
This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected.
The change has 2 parts:
ArmKleidiAI::SelectConvRoutethat decides betweenIgemm,GemmFallbackandNonebased on convolution parameters and a workload size-based heuristic.The function
CheckCapabilitiesSmerunsSelectConvRouteand only returns true if the selected route isIgemm. The patch also adds a standard GEMM fallback to theConvRoutepossibilities, and runsMlasGemmif said fallback is selected. If the function selectsNone, then the convolution falls back toMlasSgemmOperation.Motivation and Context
Fixes #27633.