Skip to content

JIT: Improve codegen for Vector128/256.NarrowWithSaturation#126226

Open
saucecontrol wants to merge 6 commits intodotnet:mainfrom
saucecontrol:vectorconvert
Open

JIT: Improve codegen for Vector128/256.NarrowWithSaturation#126226
saucecontrol wants to merge 6 commits intodotnet:mainfrom
saucecontrol:vectorconvert

Conversation

@saucecontrol
Copy link
Copy Markdown
Member

@saucecontrol saucecontrol commented Mar 27, 2026

Resolves #116526

This adds some missing optimized paths for NarrowWithSaturation intrinsics in pre-AVX-512 environments.

Vector128.NarrowWithSaturation was fully accelerated for signed types but not unsigned:

static Vector128<ushort> NarrowSaturate(Vector128<uint> x, Vector128<uint> y)
	=> Vector128.NarrowWithSaturation(x, y);
        vbroadcastss xmm0, dword ptr [reloc @RWD00]
        vpminuw  xmm1, xmm0, xmmword ptr [rdx]
-       vpand    xmm1, xmm1, xmm0
-       vpminuw  xmm2, xmm0, xmmword ptr [r8]
-       vpand    xmm0, xmm2, xmm0
+       vpminuw  xmm0, xmm0, xmmword ptr [r8]
        vpackuswb xmm0, xmm1, xmm0
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
        ret      

 RWD00  	dd	00FF00FFh		; 2.34184e-38
 
-; Total bytes of code 39
+; Total bytes of code 31

Vector256.NarrowWithSaturation was using the slow path for both signed and unsigned:

static Vector256<sbyte> NarrowSaturate(Vector256<short> x, Vector256<short> y)
	=> Vector256.NarrowWithSaturation(x, y);
-       vbroadcastss ymm0, dword ptr [reloc @RWD00]
-       vpmaxsw  ymm1, ymm0, ymmword ptr [rdx]
-       vbroadcastss ymm2, dword ptr [reloc @RWD04]
-       vpminsw  ymm1, ymm1, ymm2
-       vbroadcastss ymm3, dword ptr [reloc @RWD08]
-       vpand    ymm1, ymm1, ymm3
-       vpmaxsw  ymm0, ymm0, ymmword ptr [r8]
-       vpminsw  ymm0, ymm0, ymm2
-       vpand    ymm0, ymm0, ymm3
-       vpackuswb ymm0, ymm1, ymm0
+       vmovups  ymm0, ymmword ptr [rdx]
+       vpacksswb ymm0, ymm0, ymmword ptr [r8]
        vpermq   ymm0, ymm0, -40
        vmovups  ymmword ptr [rcx], ymm0
        mov      rax, rcx
        vzeroupper 
        ret      

-RWD00  	dd	FF80FF80h		;      -nan
-RWD04  	dd	007F007Fh		; 1.16633e-38
-RWD08  	dd	00FF00FFh		; 2.34184e-38
 
-; Total bytes of code 73
+; Total bytes of code 26

The bulk of the code changes here are from a refactoring of the intrinsic lookup for various vector convert ops. I'm filling in some of the optimization gaps in these intrinsics, and the tangle of logic required to select the right intrinsic is spread out in various places. Having it in a shared method will make it easier to complete the other changes I have planned.

The refactor is split into the first commit, which is zero-diff. Second commit on include the codegen improvements.

Full diffs

Copilot AI review requested due to automatic review settings March 27, 2026 21:04
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 27, 2026
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 27, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors x86/x64 SIMD vector conversion intrinsic selection into a shared helper and adds missing fast paths for Vector128/256.NarrowWithSaturation in non-AVX512 environments, reducing instruction count and code size for several narrow-with-saturation cases.

Changes:

  • Introduce GenTreeHWIntrinsic::GetHWIntrinsicIdForVectorConvert(...) to centralize lookup of conversion-related intrinsics (including optional saturating preference).
  • Improve Vector128/256.NarrowWithSaturation codegen on pre-AVX512 machines by using pack-based sequences where applicable.
  • Refactor existing conversion/widen/narrow construction to use the shared lookup helper instead of duplicated switch logic.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/coreclr/jit/hwintrinsicxarch.cpp Uses the new conversion lookup helper and adds optimized pack-based paths for NarrowWithSaturation on non-AVX512.
src/coreclr/jit/gentree.h Declares the new shared vector-convert intrinsic lookup helper.
src/coreclr/jit/gentree.cpp Implements the helper and refactors several SIMD convert/narrow/widen paths to use it.

Copilot AI review requested due to automatic review settings March 28, 2026 01:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings March 28, 2026 04:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suboptimal codgen for Vector128.NarrowWithSaturation

2 participants