ARM-software
diff --git a/‎main/acle.md‎
Lines changed: 235 additions & 1 deletion b/‎main/acle.md‎
Lines changed: 235 additions & 1 deletion
@@ -11,7 +11,7 @@ toc: true
 ---
 
 <!--
-SPDX-FileCopyrightText: Copyright 2011-2025 Arm Limited and/or its affiliates <open-source-office@arm.com>
+SPDX-FileCopyrightText: Copyright 2011-2026 Arm Limited and/or its affiliates <open-source-office@arm.com>
 SPDX-FileCopyrightText: Copyright 2022 Google LLC.
 CC-BY-SA-4.0 AND Apache-Patent-License
 See LICENSE.md file for details
@@ -487,6 +487,9 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
 * Removed all references to Transactional Memory Extension (TME).
 * Added [**Alpha**](#current-status-and-anticipated-changes) support
   for Brain 16-bit floating-point vector multiplication intrinsics.
+* Added [**Alpha**](#current-status-and-anticipated-changes)
+  support for SVE2.3 (FEAT_SVE2p3), SME2.3 (FEAT_SME2p3), FEAT_F16F32DOT,
+  FEAT_F16F32MM, FEAT_F16MM and FEAT_SVE_B16MM intrinsics.
 
 ### References
 
@@ -2003,6 +2006,10 @@ are available. This implies that `__ARM_FEATURE_SVE` is nonzero.
  are available and if the associated [ACLE features]
 (#sme-language-extensions-and-intrinsics) are supported.
 
+`__ARM_FEATURE_SVE2p3` is defined to 1 if the FEAT_SVE2p3 instructions
+ are available and if the associated [ACLE features]
+(#sme-language-extensions-and-intrinsics) are supported.
+
 #### NEON-SVE Bridge macro
 
 `__ARM_NEON_SVE_BRIDGE` is defined to 1 if the [`<arm_neon_sve_bridge.h>`](#arm_neon_sve_bridge.h)
@@ -2026,6 +2033,7 @@ of SME has an associated preprocessor macro, given in the table below:
 | FEAT_SME2   | __ARM_FEATURE_SME2         |
 | FEAT_SME2p1 | __ARM_FEATURE_SME2p1       |
 | FEAT_SME2p2 | __ARM_FEATURE_SME2p2       |
+| FEAT_SME2p3 | __ARM_FEATURE_SME2p3       |
 
 Each macro is defined if there is hardware support for the associated
 architecture feature and if all of the [ACLE
@@ -2162,6 +2170,20 @@ See [Half-precision brain
 floating-point](#half-precision-brain-floating-point) for details
 of half-precision brain floating-point types.
 
+#### Brain 16-bit floating-point matrix multiplication support
+
+This section is in
+[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
+extended in the future.
+
+`__ARM_FEATURE_SVE_B16MM` is defined to `1` if there is hardware
+support for the SVE BF16 matrix multiply extension and if the
+associated ACLE intrinsics are available.
+
+See [Half-precision brain
+floating-point](#half-precision-brain-floating-point) for details
+of half-precision brain floating-point types.
+
 ### Cryptographic extensions
 
 #### “Crypto” extension
@@ -2380,6 +2402,18 @@ this implies:
 
   * `__ARM_NEON == 1`
 
+#### Half-precision to single-precision dot product extension
+
+This section is in
+[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
+extended in the future.
+
+`__ARM_FEATURE_F16F32DOT` is defined if the half-precision dot product
+accumulating to single-precision instructions are supported and the vector
+intrinsics are available. Note that this implies:
+
+  * `__ARM_NEON == 1`
+
 #### Complex number intrinsics
 
 `__ARM_FEATURE_COMPLEX` is defined if the complex addition and complex
@@ -2425,6 +2459,21 @@ instructions and if the associated ACLE intrinsics are available.
 for the SVE 16-bit floating-point to 32-bit floating-point matrix multiply and add
 (FEAT_SVE_F16F32MM) instructions and if the associated ACLE intrinsics are available.
 
+##### Multiplication of 16-bit floating-point matrices (AdvSIMD)
+
+This section is in
+[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
+extended in the future.
+
+`__ARM_FEATURE_F16F32MM` is defined if the NEON half-precision matrix multiply
+accumulating to single-precision instruction is supported. Note that this implies:
+
+  * `__ARM_NEON == 1`
+
+`__ARM_FEATURE_F16MM` is defined if the NEON non-widening half-precision matrix multiply instruction instruction is supported. Note that this implies:
+
+  * `__ARM_NEON == 1`
+
 ##### Multiplication of 32-bit floating-point matrices
 
 `__ARM_FEATURE_SVE_MATMUL_FP32` is defined to `1` if there is hardware support
@@ -2662,6 +2711,9 @@ be found in [[BA]](#BA).
 | [`__ARM_FEATURE_DIRECTED_ROUNDING`](#directed-rounding)                                                                                                 | Directed Rounding                                                                                  | 1           |
 | [`__ARM_FEATURE_DOTPROD`](#availability-of-dot-product-intrinsics)                                                                                      | Dot product extension (ARM v8.2-A)                                                                 | 1           |
 | [`__ARM_FEATURE_DSP`](#dsp-instructions)                                                                                                                | DSP instructions (Arm v5E) (32-bit-only)                                                           | 1           |
+| [`__ARM_FEATURE_F16F32DOT`](#half-precision-to-single-precision-dot-product-extension)                                                                  | Half-precision to single-precision dot product extension (FEAT_F16F32DOT)                          | 1           |
+| [`__ARM_FEATURE_F16F32MM`](#multiplication-of-16-bit-floating-point-matrices-advsimd)                                                                   | Half-precision to single-precision matrix multiply accumulating extension (FEAT_F16F32MM).         | 1           |
+| [`__ARM_FEATURE_F16MM`](#multiplication-of-16-bit-floating-point-matrices-advsimd)                                                                      | Half-precision matrix multiply accumulating extension (FEAT_F16MM)                                 | 1           |
 | [`__ARM_FEATURE_FAMINMAX`](#floating-point-absolute-minimum-and-maximum-extension)                                                                      | Floating-point absolute minimum and maximum extension                                              | 1           |
 | [`__ARM_FEATURE_FMA`](#fused-multiply-accumulate-fma)                                                                                                   | Floating-point fused multiply-accumulate                                                           | 1           |
 | [`__ARM_FEATURE_FP16_FML`](#fp16-fml-extension)                                                                                                         | FP16 FML extension (Arm v8.4-A, optional Armv8.2-A, Armv8.3-A)                                     | 1           |
@@ -2714,6 +2766,7 @@ be found in [[BA]](#BA).
 | [`__ARM_FEATURE_SSVE_FP8FMA`](#modal-8-bit-floating-point-extensions)                                                                                   | Modal 8-bit floating-point extensions                                                              | 1           |
 | [`__ARM_FEATURE_SVE`](#scalable-vector-extension-sve)                                                                                                   | Scalable Vector Extension (FEAT_SVE)                                                               | 1           |
 | [`__ARM_FEATURE_SVE_B16B16`](#non-widening-brain-16-bit-floating-point-support)                                                                         | Non-widening brain 16-bit floating-point intrinsics (FEAT_SVE_B16B16)                              | 1           |
+| [`__ARM_FEATURE_SVE_B16MM`](#brain-16-bit-floating-point-matrix-multiplication-support)                                                                 | SVE brain 16-bit floating-point matrix multiply extension (FEAT_SVE_B16MM)                         | 1           |
 | [`__ARM_FEATURE_SVE_BF16`](#brain-16-bit-floating-point-support)                                                                                        | SVE support for the 16-bit brain floating-point extension (FEAT_BF16)                              | 1           |
 | [`__ARM_FEATURE_SVE_BFSCALE`](#brain-16-bit-floating-point-vector-multiplication-support)                                                               | SVE support for the 16-bit brain floating-point vector multiplication extension (FEAT_SVE_BFSCALE) | 1           |
 | [`__ARM_FEATURE_SVE_BITS`](#scalable-vector-extension-sve)                                                                                              | The number of bits in an SVE vector, when known in advance                                         | 256         |
@@ -2737,6 +2790,7 @@ be found in [[BA]](#BA).
 | [`__ARM_FEATURE_SVE2_SM4`](#sm4-extension)                                                                                                              | SVE2 support for the SM4 cryptographic extension (FEAT_SVE_SM4)                                     | 1           |
 | [`__ARM_FEATURE_SVE2p1`](#sve2)                                                                                                                         | SVE version 2.1 (FEAT_SVE2p1)
 | [`__ARM_FEATURE_SVE2p2`](#sve2)                                                                                                                         | SVE version 2.2 (FEAT_SVE2p2)
+| [`__ARM_FEATURE_SVE2p3`](#sve2)                                                                                                                         | SVE version 2.3 (FEAT_SVE2p3)
 | [`__ARM_FEATURE_SYSREG128`](#bit-system-registers)                                                                                                      | Support for 128-bit system registers (FEAT_SYSREG128)                                              | 1           |
 | [`__ARM_FEATURE_UNALIGNED`](#unaligned-access-supported-in-hardware)                                                                                    | Hardware support for unaligned access                                                              | 1           |
 | [`__ARM_FP`](#hardware-floating-point)                                                                                                                  | Hardware floating-point                                                                            | 1           |
@@ -9594,6 +9648,15 @@ BFloat16 floating-point adjust exponent vectors.
 
 ### SVE2 floating-point matrix multiply-accumulate instructions.
 
+#### BFMMLA, FMMLA(non-widening)
+
+16-bit floating-point matrix multiply-accumulate.
+```c
+  // Only if __ARM_FEATURE_SVE_B16MM
+  // Variant also available for _f16 if (__ARM_FEATURE_SVE2p2 && __ARM_FEATURE_F16MM)
+  svbfloat16_t svmmla[_bf16](svbfloat16_t zda, svbfloat16_t zn, svbfloat16_t zm);
+```
+
 #### FMMLA (widening, FP8 to FP16)
 
 Modal 8-bit floating-point matrix multiply-accumulate to half-precision.
@@ -9955,6 +10018,23 @@ Lookup table read with 4-bit indices.
   svint16_t svluti4_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx);
 ```
 
+### SVE2.3 lookup table
+
+The intrinsics in this section are defined by the header file
+[`<arm_sve.h>`](#arm_sve.h) when `__ARM_FEATURE_SVE2p3` is defined to 1.
+
+#### LUTI6
+
+Lookup table read with 6-bit indices (8-bit).
+
+Use of this intrinsic if `svcntb() * 8 < 256` results in undefined behaviour.
+
+```c
+  // Variant is  also available for: _u8 _mf8
+  svint8_t svluti6[_s8](svint8x2_t table, svuint8_t indices);
+```
+
+
 ### SVE2 Multi-vector AES and 128-bit polynomial multiply long instructions
 
 The specification for SVE2 Multi-vector AES and 128-bit polynomial multiply long instructions is in
@@ -13548,6 +13628,7 @@ Multi-vector saturating rounding shift right narrow and interleave
 
 ``` c
   // Variants are also available for _u16[_u32_x2]
+  // and _s8[_s16_x2] _u8[_u16_x2] if __ARM_FEATURE_SVE2p3 || __ARM_FEATURE_SME2p3
   svint16_t svqrshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm);
   ```
 
@@ -13556,6 +13637,7 @@ Multi-vector saturating rounding shift right narrow and interleave
 Multi-vector saturating rounding shift right unsigned narrow and interleave
 
 ``` c
+  // Variant for _u8[_u16_x2] is available if __ARM_FEATURE_SVE2p3 || __ARM_FEATURE_SME2p3
   svuint16_t svqrshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm);
   ```
 
@@ -13857,6 +13939,128 @@ Scalar index of first/last true predicate element (predicated).
 
   ```
 
+### SVE2.3 and SME2.3 instruction intrinsics
+
+The specification for SVE2.3 and SME2.3 are in
+[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
+extended in the future.
+
+The functions in this section are defined by either the header file
+ [`<arm_sve.h>`](#arm_sve.h) or [`<arm_sme.h>`](#arm_sme.h)
+when `__ARM_FEATURE_SVE2p3` or `__ARM_FEATURE_SME2p3` is defined, respectively.
+
+#### ADDQP
+
+Add pairwise within quadword vector segments.
+
+``` c
+  // Variants are also available for _s16, _s32 and _s64
+  svint8_t svaddqp[_s8](svint8_t zn, svint8_t zm);
+  ```
+
+#### ADDSUBP
+
+Add subtract pairwise.
+
+``` c
+  // Variants are also available for _s16, _s32 and _s64
+  svint8_t svaddsubp[_s8](svint8_t zn, svint8_t zm);
+  ```
+
+#### LUTI6
+
+Lookup table read with 6-bit indices (16-bit).
+
+Use of this intrinsic if `svcntb() * 8 < 512` results in undefined behaviour.
+
+``` c
+  // Variants are also available for _u16_x2 _f16_x2
+  svint16_t svluti6_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx);
+  ```
+
+#### FCVTZSN, FCVTZUN
+
+Floating-point narrowing convert to interleaved integer, rounding toward zero
+
+``` c
+  // Variants are also available for
+  //              s16[_f32_x2], s32[_f64_x2],
+  // u8[_f16_x2], u16[_f32_x2], u32[_f64_x2]
+  svint8_t svcvt_s8[_f16_x2](svfloat16x2_t zn);
+```
+
+#### SABAL, UABAL
+
+Two-way absolute difference sum and accumulate long.
+
+``` c
+  // Variants are also available for
+  //           s32[_s16], s64[_s32],
+  // u16[_u8], u32[_u16], u64[_u32]
+  svint16_t svaba_s16[_s8](svint16_t zda, svint8_t zn, svint8_t zm);
+```
+
+#### SCVTF, SCVTFLT, UCVTF, UCVTFLT
+
+Integer convert to floating-point (top and bottom).
+
+``` c
+  // Variants are also available for
+  //           f32[_s16], f64[_s32],
+  // f16[_u8], f32[_u16], f64[_u32]
+  
+  svfloat16_t svcvtt_f16[_s8](svint8_t zn);
+  svfloat16_t svcvtb_f16[_s8](svint8_t zn);
+```
+
+#### SDOT, UDOT (vectors)
+
+Integer dot-product (2-way, vectors).
+
+``` c
+  // Variants are also available for _u16_u8
+  svint16_t svdot[_s16_s8](svint16_t zda, svint8_t zn, svint8_t zm);
+  ```
+
+#### SDOT, UDOT (indexed)
+
+Integer dot product by indexed element (two-way).
+
+``` c
+  // Variants are also available for _u16_u8
+  svint16_t svdot_lane[_s16_s8](svint16_t zda, svint8_t zn, svint8_t zm,
+                                uint64_t imm_idx);
+  ```
+
+#### SQSHRN, UQSHRN
+
+Multi-vector saturating shift right narrow and interleave.
+
+``` c
+  // Variants are also available for 
+  //              _u16[_u32_x2]
+  // _s8[_s16_x2]  _u8[_u16_x2]
+  svint16_t svqshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm);
+  ```
+
+#### SQSHRUN
+
+Signed saturating shift right narrow by immediate to interleaved unsigned integer.
+
+``` c
+  // Variant for _u8[_s16_x2] is also available.
+  svuint16_t svqshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm);
+  ```
+
+#### SUBP
+
+Subtract pairwise.
+
+``` c
+  // Variants are also available for _s16, _s32 and _s64
+  svint8_t svsubp[_s8](svbool_t pg, svint8_t zn, svint8_t zm);
+  ```
+
 ### SME2 maximum and minimum absolute value
 
 The intrinsics in this section are defined by the header file
@@ -14581,6 +14785,36 @@ non-overloaded names to indicate which vector argument is a vector register pair
     __arm_streaming __arm_inout("za");
 ```
 
+### SME2.3 lookup table
+
+The intrinsics in this section are defined by the header file
+[`<arm_sme.h>`](#arm_sme.h) when `__ARM_FEATURE_SME2p3` is defined to 1.
+
+#### LUTI6
+
+Lookup table read with 6-bit indices (16-bit)
+
+Use of this intrinsic if `svcntb() * 8 < 512` results in undefined behaviour.
+
+```c
+  // Variant are also available for: _u16, _f16 and _bf16
+  svint16x4_t svluti6_lane_s16_x4[_s16_x2](svint16x2_t table, svuint8x2_t indices, uint64_t imm_idx);
+```
+
+Lookup table read with 6-bit indices (four registers, 8-bit)
+
+``` c
+  // Variants are also available for: _u8 _mf8
+  svint8x4_t svluti6_zt_s8_x4(uint64_t zt0, svuint8x3_t zn) __arm_streaming __arm_in("zt0");
+```
+
+Lookup table read with 6-bit indices (table, single, 8-bit)
+
+``` c
+  // Variants are also available for: _u8 _mf8
+  svint8_t svluti6_zt_s8(uint64_t zt0, svuint8_t zn) __arm_streaming __arm_in("zt0");
+```
+
 # M-profile Vector Extension (MVE) intrinsics
 
 The M-profile Vector Extension (MVE) [[MVE-spec]](#MVE-spec) instructions provide packed Single