Skip to content

Commit 5b93d9b

Browse files
committed
Add alpha support for 9.7 data processing intrinsics
This change adds intrinsics for the following architectural features: - FEAT_F16F32DOT - FEAT_F16F32MM - FEAT_F16MM - FEAT_SVE_B16MM - FEAT_SVE2p3 - FEAT_SME2p3
1 parent 2958acd commit 5b93d9b

5 files changed

Lines changed: 298 additions & 4 deletions

File tree

main/acle.md

Lines changed: 235 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ toc: true
1111
---
1212

1313
<!--
14-
SPDX-FileCopyrightText: Copyright 2011-2025 Arm Limited and/or its affiliates <open-source-office@arm.com>
14+
SPDX-FileCopyrightText: Copyright 2011-2026 Arm Limited and/or its affiliates <open-source-office@arm.com>
1515
SPDX-FileCopyrightText: Copyright 2022 Google LLC.
1616
CC-BY-SA-4.0 AND Apache-Patent-License
1717
See LICENSE.md file for details
@@ -487,6 +487,9 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
487487
* Removed all references to Transactional Memory Extension (TME).
488488
* Added [**Alpha**](#current-status-and-anticipated-changes) support
489489
for Brain 16-bit floating-point vector multiplication intrinsics.
490+
* Added [**Alpha**](#current-status-and-anticipated-changes)
491+
support for SVE2.3 (FEAT_SVE2p3), SME2.3 (FEAT_SME2p3), FEAT_F16F32DOT,
492+
FEAT_F16F32MM, FEAT_F16MM and FEAT_SVE_B16MM intrinsics.
490493

491494
### References
492495

@@ -2003,6 +2006,10 @@ are available. This implies that `__ARM_FEATURE_SVE` is nonzero.
20032006
are available and if the associated [ACLE features]
20042007
(#sme-language-extensions-and-intrinsics) are supported.
20052008

2009+
`__ARM_FEATURE_SVE2p3` is defined to 1 if the FEAT_SVE2p3 instructions
2010+
are available and if the associated [ACLE features]
2011+
(#sme-language-extensions-and-intrinsics) are supported.
2012+
20062013
#### NEON-SVE Bridge macro
20072014

20082015
`__ARM_NEON_SVE_BRIDGE` is defined to 1 if the [`<arm_neon_sve_bridge.h>`](#arm_neon_sve_bridge.h)
@@ -2026,6 +2033,7 @@ of SME has an associated preprocessor macro, given in the table below:
20262033
| FEAT_SME2 | __ARM_FEATURE_SME2 |
20272034
| FEAT_SME2p1 | __ARM_FEATURE_SME2p1 |
20282035
| FEAT_SME2p2 | __ARM_FEATURE_SME2p2 |
2036+
| FEAT_SME2p3 | __ARM_FEATURE_SME2p3 |
20292037

20302038
Each macro is defined if there is hardware support for the associated
20312039
architecture feature and if all of the [ACLE
@@ -2162,6 +2170,20 @@ See [Half-precision brain
21622170
floating-point](#half-precision-brain-floating-point) for details
21632171
of half-precision brain floating-point types.
21642172

2173+
#### Brain 16-bit floating-point matrix multiplication support
2174+
2175+
This section is in
2176+
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
2177+
extended in the future.
2178+
2179+
`__ARM_FEATURE_SVE_B16MM` is defined to `1` if there is hardware
2180+
support for the SVE BF16 matrix multiply extension and if the
2181+
associated ACLE intrinsics are available.
2182+
2183+
See [Half-precision brain
2184+
floating-point](#half-precision-brain-floating-point) for details
2185+
of half-precision brain floating-point types.
2186+
21652187
### Cryptographic extensions
21662188

21672189
#### “Crypto” extension
@@ -2380,6 +2402,18 @@ this implies:
23802402

23812403
* `__ARM_NEON == 1`
23822404

2405+
#### Half-precision to single-precision dot product extension
2406+
2407+
This section is in
2408+
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
2409+
extended in the future.
2410+
2411+
`__ARM_FEATURE_F16F32DOT` is defined if the half-precision dot product
2412+
accumulating to single-precision instructions are supported and the vector
2413+
intrinsics are available. Note that this implies:
2414+
2415+
* `__ARM_NEON == 1`
2416+
23832417
#### Complex number intrinsics
23842418

23852419
`__ARM_FEATURE_COMPLEX` is defined if the complex addition and complex
@@ -2425,6 +2459,21 @@ instructions and if the associated ACLE intrinsics are available.
24252459
for the SVE 16-bit floating-point to 32-bit floating-point matrix multiply and add
24262460
(FEAT_SVE_F16F32MM) instructions and if the associated ACLE intrinsics are available.
24272461

2462+
##### Multiplication of 16-bit floating-point matrices (AdvSIMD)
2463+
2464+
This section is in
2465+
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
2466+
extended in the future.
2467+
2468+
`__ARM_FEATURE_F16F32MM` is defined if the NEON half-precision matrix multiply
2469+
accumulating to single-precision instruction is supported. Note that this implies:
2470+
2471+
* `__ARM_NEON == 1`
2472+
2473+
`__ARM_FEATURE_F16MM` is defined if the NEON non-widening half-precision matrix multiply instruction instruction is supported. Note that this implies:
2474+
2475+
* `__ARM_NEON == 1`
2476+
24282477
##### Multiplication of 32-bit floating-point matrices
24292478

24302479
`__ARM_FEATURE_SVE_MATMUL_FP32` is defined to `1` if there is hardware support
@@ -2662,6 +2711,9 @@ be found in [[BA]](#BA).
26622711
| [`__ARM_FEATURE_DIRECTED_ROUNDING`](#directed-rounding) | Directed Rounding | 1 |
26632712
| [`__ARM_FEATURE_DOTPROD`](#availability-of-dot-product-intrinsics) | Dot product extension (ARM v8.2-A) | 1 |
26642713
| [`__ARM_FEATURE_DSP`](#dsp-instructions) | DSP instructions (Arm v5E) (32-bit-only) | 1 |
2714+
| [`__ARM_FEATURE_F16F32DOT`](#half-precision-to-single-precision-dot-product-extension) | Half-precision to single-precision dot product extension (FEAT_F16F32DOT) | 1 |
2715+
| [`__ARM_FEATURE_F16F32MM`](#multiplication-of-16-bit-floating-point-matrices-advsimd) | Half-precision to single-precision matrix multiply accumulating extension (FEAT_F16F32MM). | 1 |
2716+
| [`__ARM_FEATURE_F16MM`](#multiplication-of-16-bit-floating-point-matrices-advsimd) | Half-precision matrix multiply accumulating extension (FEAT_F16MM) | 1 |
26652717
| [`__ARM_FEATURE_FAMINMAX`](#floating-point-absolute-minimum-and-maximum-extension) | Floating-point absolute minimum and maximum extension | 1 |
26662718
| [`__ARM_FEATURE_FMA`](#fused-multiply-accumulate-fma) | Floating-point fused multiply-accumulate | 1 |
26672719
| [`__ARM_FEATURE_FP16_FML`](#fp16-fml-extension) | FP16 FML extension (Arm v8.4-A, optional Armv8.2-A, Armv8.3-A) | 1 |
@@ -2714,6 +2766,7 @@ be found in [[BA]](#BA).
27142766
| [`__ARM_FEATURE_SSVE_FP8FMA`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 |
27152767
| [`__ARM_FEATURE_SVE`](#scalable-vector-extension-sve) | Scalable Vector Extension (FEAT_SVE) | 1 |
27162768
| [`__ARM_FEATURE_SVE_B16B16`](#non-widening-brain-16-bit-floating-point-support) | Non-widening brain 16-bit floating-point intrinsics (FEAT_SVE_B16B16) | 1 |
2769+
| [`__ARM_FEATURE_SVE_B16MM`](#brain-16-bit-floating-point-matrix-multiplication-support) | SVE brain 16-bit floating-point matrix multiply extension (FEAT_SVE_B16MM) | 1 |
27172770
| [`__ARM_FEATURE_SVE_BF16`](#brain-16-bit-floating-point-support) | SVE support for the 16-bit brain floating-point extension (FEAT_BF16) | 1 |
27182771
| [`__ARM_FEATURE_SVE_BFSCALE`](#brain-16-bit-floating-point-vector-multiplication-support) | SVE support for the 16-bit brain floating-point vector multiplication extension (FEAT_SVE_BFSCALE) | 1 |
27192772
| [`__ARM_FEATURE_SVE_BITS`](#scalable-vector-extension-sve) | The number of bits in an SVE vector, when known in advance | 256 |
@@ -2737,6 +2790,7 @@ be found in [[BA]](#BA).
27372790
| [`__ARM_FEATURE_SVE2_SM4`](#sm4-extension) | SVE2 support for the SM4 cryptographic extension (FEAT_SVE_SM4) | 1 |
27382791
| [`__ARM_FEATURE_SVE2p1`](#sve2) | SVE version 2.1 (FEAT_SVE2p1)
27392792
| [`__ARM_FEATURE_SVE2p2`](#sve2) | SVE version 2.2 (FEAT_SVE2p2)
2793+
| [`__ARM_FEATURE_SVE2p3`](#sve2) | SVE version 2.3 (FEAT_SVE2p3)
27402794
| [`__ARM_FEATURE_SYSREG128`](#bit-system-registers) | Support for 128-bit system registers (FEAT_SYSREG128) | 1 |
27412795
| [`__ARM_FEATURE_UNALIGNED`](#unaligned-access-supported-in-hardware) | Hardware support for unaligned access | 1 |
27422796
| [`__ARM_FP`](#hardware-floating-point) | Hardware floating-point | 1 |
@@ -9594,6 +9648,15 @@ BFloat16 floating-point adjust exponent vectors.
95949648

95959649
### SVE2 floating-point matrix multiply-accumulate instructions.
95969650

9651+
#### BFMMLA, FMMLA(non-widening)
9652+
9653+
16-bit floating-point matrix multiply-accumulate.
9654+
```c
9655+
// Only if __ARM_FEATURE_SVE_B16MM
9656+
// Variant also available for _f16 if (__ARM_FEATURE_SVE2p2 && __ARM_FEATURE_F16MM)
9657+
svbfloat16_t svmmla[_bf16](svbfloat16_t zda, svbfloat16_t zn, svbfloat16_t zm);
9658+
```
9659+
95979660
#### FMMLA (widening, FP8 to FP16)
95989661

95999662
Modal 8-bit floating-point matrix multiply-accumulate to half-precision.
@@ -9955,6 +10018,23 @@ Lookup table read with 4-bit indices.
995510018
svint16_t svluti4_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx);
995610019
```
995710020

10021+
### SVE2.3 lookup table
10022+
10023+
The intrinsics in this section are defined by the header file
10024+
[`<arm_sve.h>`](#arm_sve.h) when `__ARM_FEATURE_SVE2p3` is defined to 1.
10025+
10026+
#### LUTI6
10027+
10028+
Lookup table read with 6-bit indices (8-bit).
10029+
10030+
Use of this intrinsic if `svcntb() * 8 < 256` results in undefined behaviour.
10031+
10032+
```c
10033+
// Variant is also available for: _u8 _mf8
10034+
svint8_t svluti6[_s8](svint8x2_t table, svuint8_t indices);
10035+
```
10036+
10037+
995810038
### SVE2 Multi-vector AES and 128-bit polynomial multiply long instructions
995910039

996010040
The specification for SVE2 Multi-vector AES and 128-bit polynomial multiply long instructions is in
@@ -13548,6 +13628,7 @@ Multi-vector saturating rounding shift right narrow and interleave
1354813628

1354913629
``` c
1355013630
// Variants are also available for _u16[_u32_x2]
13631+
// and _s8[_s16_x2] _u8[_u16_x2] if __ARM_FEATURE_SVE2p3 || __ARM_FEATURE_SME2p3
1355113632
svint16_t svqrshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm);
1355213633
```
1355313634

@@ -13556,6 +13637,7 @@ Multi-vector saturating rounding shift right narrow and interleave
1355613637
Multi-vector saturating rounding shift right unsigned narrow and interleave
1355713638

1355813639
``` c
13640+
// Variant for _u8[_u16_x2] is available if __ARM_FEATURE_SVE2p3 || __ARM_FEATURE_SME2p3
1355913641
svuint16_t svqrshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm);
1356013642
```
1356113643

@@ -13857,6 +13939,128 @@ Scalar index of first/last true predicate element (predicated).
1385713939

1385813940
```
1385913941

13942+
### SVE2.3 and SME2.3 instruction intrinsics
13943+
13944+
The specification for SVE2.3 and SME2.3 are in
13945+
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
13946+
extended in the future.
13947+
13948+
The functions in this section are defined by either the header file
13949+
[`<arm_sve.h>`](#arm_sve.h) or [`<arm_sme.h>`](#arm_sme.h)
13950+
when `__ARM_FEATURE_SVE2p3` or `__ARM_FEATURE_SME2p3` is defined, respectively.
13951+
13952+
#### ADDQP
13953+
13954+
Add pairwise within quadword vector segments.
13955+
13956+
``` c
13957+
// Variants are also available for _s16, _s32 and _s64
13958+
svint8_t svaddqp[_s8](svint8_t zn, svint8_t zm);
13959+
```
13960+
13961+
#### ADDSUBP
13962+
13963+
Add subtract pairwise.
13964+
13965+
``` c
13966+
// Variants are also available for _s16, _s32 and _s64
13967+
svint8_t svaddsubp[_s8](svint8_t zn, svint8_t zm);
13968+
```
13969+
13970+
#### LUTI6
13971+
13972+
Lookup table read with 6-bit indices (16-bit).
13973+
13974+
Use of this intrinsic if `svcntb() * 8 < 512` results in undefined behaviour.
13975+
13976+
``` c
13977+
// Variants are also available for _u16_x2 _f16_x2
13978+
svint16_t svluti6_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx);
13979+
```
13980+
13981+
#### FCVTZSN, FCVTZUN
13982+
13983+
Floating-point narrowing convert to interleaved integer, rounding toward zero
13984+
13985+
``` c
13986+
// Variants are also available for
13987+
// s16[_f32_x2], s32[_f64_x2],
13988+
// u8[_f16_x2], u16[_f32_x2], u32[_f64_x2]
13989+
svint8_t svcvt_s8[_f16_x2](svfloat16x2_t zn);
13990+
```
13991+
13992+
#### SABAL, UABAL
13993+
13994+
Two-way absolute difference sum and accumulate long.
13995+
13996+
``` c
13997+
// Variants are also available for
13998+
// s32[_s16], s64[_s32],
13999+
// u16[_u8], u32[_u16], u64[_u32]
14000+
svint16_t svaba_s16[_s8](svint16_t zda, svint8_t zn, svint8_t zm);
14001+
```
14002+
14003+
#### SCVTF, SCVTFLT, UCVTF, UCVTFLT
14004+
14005+
Integer convert to floating-point (top and bottom).
14006+
14007+
``` c
14008+
// Variants are also available for
14009+
// f32[_s16], f64[_s32],
14010+
// f16[_u8], f32[_u16], f64[_u32]
14011+
14012+
svfloat16_t svcvtt_f16[_s8](svint8_t zn);
14013+
svfloat16_t svcvtb_f16[_s8](svint8_t zn);
14014+
```
14015+
14016+
#### SDOT, UDOT (vectors)
14017+
14018+
Integer dot-product (2-way, vectors).
14019+
14020+
``` c
14021+
// Variants are also available for _u16_u8
14022+
svint16_t svdot[_s16_s8](svint16_t zda, svint8_t zn, svint8_t zm);
14023+
```
14024+
14025+
#### SDOT, UDOT (indexed)
14026+
14027+
Integer dot product by indexed element (two-way).
14028+
14029+
``` c
14030+
// Variants are also available for _u16_u8
14031+
svint16_t svdot_lane[_s16_s8](svint16_t zda, svint8_t zn, svint8_t zm,
14032+
uint64_t imm_idx);
14033+
```
14034+
14035+
#### SQSHRN, UQSHRN
14036+
14037+
Multi-vector saturating shift right narrow and interleave.
14038+
14039+
``` c
14040+
// Variants are also available for
14041+
// _u16[_u32_x2]
14042+
// _s8[_s16_x2] _u8[_u16_x2]
14043+
svint16_t svqshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm);
14044+
```
14045+
14046+
#### SQSHRUN
14047+
14048+
Signed saturating shift right narrow by immediate to interleaved unsigned integer.
14049+
14050+
``` c
14051+
// Variant for _u8[_s16_x2] is also available.
14052+
svuint16_t svqshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm);
14053+
```
14054+
14055+
#### SUBP
14056+
14057+
Subtract pairwise.
14058+
14059+
``` c
14060+
// Variants are also available for _s16, _s32 and _s64
14061+
svint8_t svsubp[_s8](svbool_t pg, svint8_t zn, svint8_t zm);
14062+
```
14063+
1386014064
### SME2 maximum and minimum absolute value
1386114065

1386214066
The intrinsics in this section are defined by the header file
@@ -14581,6 +14785,36 @@ non-overloaded names to indicate which vector argument is a vector register pair
1458114785
__arm_streaming __arm_inout("za");
1458214786
```
1458314787

14788+
### SME2.3 lookup table
14789+
14790+
The intrinsics in this section are defined by the header file
14791+
[`<arm_sme.h>`](#arm_sme.h) when `__ARM_FEATURE_SME2p3` is defined to 1.
14792+
14793+
#### LUTI6
14794+
14795+
Lookup table read with 6-bit indices (16-bit)
14796+
14797+
Use of this intrinsic if `svcntb() * 8 < 512` results in undefined behaviour.
14798+
14799+
```c
14800+
// Variant are also available for: _u16, _f16 and _bf16
14801+
svint16x4_t svluti6_lane_s16_x4[_s16_x2](svint16x2_t table, svuint8x2_t indices, uint64_t imm_idx);
14802+
```
14803+
14804+
Lookup table read with 6-bit indices (four registers, 8-bit)
14805+
14806+
``` c
14807+
// Variants are also available for: _u8 _mf8
14808+
svint8x4_t svluti6_zt_s8_x4(uint64_t zt0, svuint8x3_t zn) __arm_streaming __arm_in("zt0");
14809+
```
14810+
14811+
Lookup table read with 6-bit indices (table, single, 8-bit)
14812+
14813+
``` c
14814+
// Variants are also available for: _u8 _mf8
14815+
svint8_t svluti6_zt_s8(uint64_t zt0, svuint8_t zn) __arm_streaming __arm_in("zt0");
14816+
```
14817+
1458414818
# M-profile Vector Extension (MVE) intrinsics
1458514819

1458614820
The M-profile Vector Extension (MVE) [[MVE-spec]](#MVE-spec) instructions provide packed Single

0 commit comments

Comments
 (0)