JIT: enable shared-constant CSE on x64 (#92170)#129941
Conversation
Buckets pointer-class constants by their upper bits so each shared use becomes a `lea reg, [base+offset]` instead of a full `mov reg, imm64`. On x64 the bucket width is 256 and the def value is centered to maximize use of the `lea` disp8 encoding. Also extends CSE and hoist eligibility to integral constants that don't fit as imm32 or require relocation, with a per-method use-count gate so single-occurrence constants aren't speculatively hoisted. About -1.18 MB code size across the standard x64 SPMI collections; arm64 also improves and x86 is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR updates the CoreCLR JIT’s constant CSE/hoisting behavior, primarily on x64, to reduce code size by sharing nearby pointer-class constants (so uses can be materialized via base+offset) and by broadening CSE/hoist eligibility to certain “expensive” integral constants.
Changes:
- Adjusts x64 shared-constant bucketing to use an 8-bit low-bits cut (256-wide buckets) to better enable compact addressing forms.
- Adds a per-method VN-based occurrence tally to gate hoisting of plain integral constants (avoid hoisting single-occurrence constants that won’t be eliminated).
- Expands constant CSE/hoist consideration to integral constants that don’t fit in imm32 or require relocation, and tweaks the CSE heuristic cost model for low-use-count constants on xarch.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/jit/target.h | Changes x64 shared-constant bucketing width by reducing CSE_CONST_SHARED_LOW_BITS. |
| src/coreclr/jit/optimizer.cpp | Adds method-wide constant occurrence counting and uses it to gate loop-hoisting of plain integral constants. |
| src/coreclr/jit/optcse.cpp | Extends constant eligibility rules, adjusts heuristic costs, and changes shared-constant def-value selection/centering logic. |
| src/coreclr/jit/jitconfigvalues.h | Renames/retargets JitConstCSE option constants and updates the associated comment text. |
| src/coreclr/jit/compiler.h | Introduces VNToCountMap and stores it in LoopHoistContext; also contains shared-constant key encoding helpers. |
|
About -2MB on x64. |
* Use a fresh VN matching the centered value for the shared-const CSE temp, rather than reusing the original constant's VN; this keeps the value number consistent with the actual constant value of the def node. * Clarify the JitConstCSE comment to note that on x86/x64 only the nearby-value (shared) variant is enabled by default; full const CSE is still only target-gated for ARM/ARM64/RISCV64. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
I'm worried that all real-world collections (benchmarks, aspnet) seem to be 3x PerfScore regressions |
I'll try and run some plausible benchmark subset. I suspect that our existing costing under-estimates the perf impact of those large immediate values (so CSE looks like a loss, one extra mov). |
|
A few representative microbenchmarks to compare PerfScore predictions against actual hardware. @EgorBot -windows_intel -linux_amd using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Foo { }
public class FooDerived : Foo { }
public class Bench
{
private const int Iterations = 1000;
private object _fooObj = default!;
private object _intObj = default!;
private Foo[] _fillTarget = default!;
private object _fillValue = default!;
[GlobalSetup]
public void Setup()
{
_fooObj = new Foo();
_intObj = 42;
_fillTarget = new Foo[Iterations];
_fillValue = new Foo();
}
[Benchmark]
public void FillArrayWithSameRef()
{
var arr = _fillTarget;
var v = (Foo)_fillValue;
for (int i = 0; i < arr.Length; i++)
{
arr[i] = v;
}
}
[Benchmark]
public int FooObjIsFoo()
{
int count = 0;
var o = _fooObj;
for (int i = 0; i < Iterations; i++)
{
if (o is Foo) count++;
}
return count;
}
[Benchmark]
public int IntObjIsInt()
{
int count = 0;
var o = _intObj;
for (int i = 0; i < Iterations; i++)
{
if (o is int) count++;
}
return count;
}
[Benchmark]
public long ManyBigConstants()
{
long sum = 0;
for (int i = 0; i < Iterations; i++)
{
sum += 0x123456789ABCDEF0L;
sum ^= 0x123456789ABCDEF1L;
sum += 0x123456789ABCDEF2L;
sum ^= 0x123456789ABCDEF3L;
}
return sum;
}
}Note Comment generated with assistance from GitHub Copilot CLI. |
Buckets pointer-class constants by their upper bits so each shared use becomes a
lea reg, [base+offset]instead of a fullmov reg, imm64. On x64 the bucket width is 256 and the def value is centered to maximize use of theleadisp8 encoding.Also extends CSE and hoist eligibility to integral constants that don't fit as imm32 or require relocation, with a per-method use-count gate so single-occurrence constants aren't speculatively hoisted.
About -1.18 MB code size across the standard x64 SPMI collections; arm64 also improves and x86 is unchanged.