Skip to content

Commit 30ebfcd

Browse files
committed
feat(SIMD): add SIMD scalar paths to IL kernel generator
Implement Vector256 SIMD operations for mixed-type scalar operations where the array type equals the result type (no per-element conversion needed). This optimizes operations like `double_array + int_scalar`. ## Changes - Add `EmitSimdScalarRightLoop()` for SIMD scalar right operand - Add `EmitSimdScalarLeftLoop()` for SIMD scalar left operand - Add `EmitVectorCreate()` helper for Vector256.Create(scalar) - Update `GenerateSimdScalarRightKernel()` to choose SIMD when eligible - Update `GenerateSimdScalarLeftKernel()` to choose SIMD when eligible ## SIMD Eligibility SIMD is used when: - ScalarRight: `LhsType == ResultType` (array needs no conversion) - ScalarLeft: `RhsType == ResultType` (array needs no conversion) - ResultType supports SIMD (float, double, int, long, etc.) - Operation has SIMD support (Add, Subtract, Multiply, Divide) ## Benchmark Results Array size: 10,000,000 elements Before (mixed-type used scalar loop): int + double_scalar: 19.09 ms After (SIMD when eligible): double + int_scalar: 14.96 ms [IL SIMD - matches baseline] float + int_scalar: 7.18 ms [IL SIMD - matches baseline] int + double_scalar: 15.84 ms [still scalar - needs conversion] ## Technical Details The SIMD scalar loop: 1. Loads scalar, converts to result type if needed 2. Broadcasts scalar to Vector256 using Vector256.Create() 3. SIMD loop: load array vector, perform vector op, store result 4. Tail loop handles remainder elements All 2597 tests pass.
1 parent 4071859 commit 30ebfcd

2 files changed

Lines changed: 568 additions & 2 deletions

File tree

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# SIMD Optimization Investigation Results
2+
3+
## Executive Summary
4+
5+
The investigation revealed NumSharp already has optimal SIMD scalar paths for **same-type operations** (via C# SimdKernels), but **mixed-type operations** fell back to scalar loops in IL kernels. **This has now been fixed.**
6+
7+
### Implementation Complete ✅
8+
9+
SIMD scalar paths have been added to the IL kernel generator for mixed-type operations where the array type equals the result type (no per-element conversion needed).
10+
11+
**Final Benchmark Results:**
12+
```
13+
Array size: 10,000,000 elements
14+
15+
Same-type operations (C# SIMD baseline):
16+
double + double_scalar 15.29 ms [C# SIMD]
17+
float + float_scalar 8.35 ms [C# SIMD]
18+
19+
Mixed-type with IL SIMD (LHS type == Result type):
20+
double + int_scalar 14.96 ms [IL SIMD ✓] <- NOW OPTIMIZED
21+
float + int_scalar 7.18 ms [IL SIMD ✓] <- NOW OPTIMIZED
22+
23+
Mixed-type without SIMD (requires conversion):
24+
int + double_scalar 15.84 ms [Scalar loop]
25+
```
26+
27+
**Tests:** All 2597 tests pass, 0 failures.
28+
29+
---
30+
31+
## Hardware Detection Results
32+
33+
| Feature | Supported |
34+
|---------|-----------|
35+
| SSE | Yes |
36+
| SSE2 | Yes |
37+
| SSE3 | Yes |
38+
| SSSE3 | Yes |
39+
| SSE4.1 | Yes |
40+
| SSE4.2 | Yes |
41+
| AVX | Yes |
42+
| AVX2 | Yes |
43+
| **AVX-512** | **No** |
44+
| Vector256 | Yes (hardware accelerated) |
45+
| Vector512 | No |
46+
47+
**Conclusion**: This machine (and most consumer CPUs) only supports up to AVX2/Vector256. AVX-512 hardware detection should be added but has lower priority since adoption is limited.
48+
49+
---
50+
51+
## Scalar SIMD Benchmark Results
52+
53+
```
54+
Benchmark: array[10,000,000] + scalar
55+
56+
1. Scalar Loop : 25.42 ms
57+
2. SIMD Hoisted : 16.28 ms (1.56x faster)
58+
3. SIMD In-Loop : 22.42 ms (JIT doesn't fully hoist)
59+
```
60+
61+
**Key Findings:**
62+
- SIMD with hoisted `Vector256.Create(scalar)` is **1.56x faster** than scalar loop
63+
- JIT does NOT fully hoist `Vector256.Create` - explicit hoisting gains another **1.38x**
64+
- Explicit hoisting before the loop is critical for performance
65+
66+
---
67+
68+
## NumSharp Current State Analysis
69+
70+
### Execution Path Dispatch
71+
72+
```
73+
Operation Type | Path Classification | Kernel Used | SIMD Scalar?
74+
------------------|---------------------|----------------------|-------------
75+
double + double | SimdScalarRight | C# SimdKernels | YES (optimal)
76+
int + double | SimdScalarRight | IL MixedTypeKernel | NO (scalar loop)
77+
int + int | SimdScalarRight | C# SimdKernels | YES (for int/double/float/long)
78+
byte + float | SimdScalarRight | IL MixedTypeKernel | NO (scalar loop)
79+
```
80+
81+
### Performance Comparison
82+
83+
```
84+
Benchmark: array[10,000,000] + scalar
85+
86+
Same-type (double+double): 14.26 ms (C# SIMD kernel)
87+
Mixed-type (int+double): 18.07 ms (IL scalar kernel)
88+
89+
Performance gap: ~27%
90+
```
91+
92+
### Code Analysis
93+
94+
**C# SimdKernels.cs (lines 217-231)** - Optimal implementation:
95+
```csharp
96+
private static unsafe void SimdScalarRight_Add_Double(double* lhs, double scalar, double* result, int totalSize)
97+
{
98+
var scalarVec = Vector256.Create(scalar); // Hoisted!
99+
int i = 0;
100+
int vectorEnd = totalSize - Vector256<double>.Count;
101+
102+
for (; i <= vectorEnd; i += Vector256<double>.Count)
103+
{
104+
var vl = Vector256.Load(lhs + i);
105+
Vector256.Store(vl + scalarVec, result + i); // SIMD!
106+
}
107+
108+
for (; i < totalSize; i++)
109+
result[i] = lhs[i] + scalar; // Remainder
110+
}
111+
```
112+
113+
**ILKernelGenerator.cs (lines 912-970)** - Suboptimal implementation:
114+
```csharp
115+
private static void EmitScalarRightLoop(ILGenerator il, MixedTypeKernelKey key, ...)
116+
{
117+
// Line 916-925: Hoist scalar value to local (good!)
118+
var locRhsVal = il.DeclareLocal(GetClrType(key.ResultType));
119+
il.Emit(OpCodes.Ldarg_1); // rhs
120+
EmitLoadIndirect(il, key.RhsType);
121+
EmitConvertTo(il, key.RhsType, key.ResultType);
122+
il.Emit(OpCodes.Stloc, locRhsVal);
123+
124+
// Lines 938-960: Scalar operations only, NO SIMD!
125+
for (int i = 0; i < totalSize; i++)
126+
{
127+
result[i] = lhs[i] + rhsVal; // Scalar add
128+
}
129+
}
130+
```
131+
132+
---
133+
134+
## Recommendations
135+
136+
### Priority 1: Add SIMD to IL Scalar Paths (HIGH IMPACT)
137+
138+
**Why**: 27% speedup for mixed-type scalar operations.
139+
140+
**Implementation**:
141+
1. Modify `EmitScalarRightLoop()` to emit SIMD code for supported types
142+
2. Hoist `Vector256.Create(scalar)` before the loop
143+
3. Add Vector256 load/add/store in the main loop
144+
4. Keep scalar remainder loop for sizes not divisible by vector count
145+
146+
**Target types**: float, double (already have Vector256 support)
147+
148+
**Files to modify**:
149+
- `ILKernelGenerator.cs`: Add `EmitSimdScalarRightLoop()` method
150+
- Update `GenerateSimdScalarRightKernel()` to choose SIMD vs scalar based on type
151+
152+
### Priority 2: Hardware Detection (LOW PRIORITY)
153+
154+
**Why**: AVX-512 adoption is limited. Most CPUs (including this dev machine) only support AVX2.
155+
156+
**Implementation** (when AVX-512 becomes common):
157+
1. Add static readonly flags in `SimdThresholds.cs`:
158+
```csharp
159+
public static readonly bool HasAvx512 = Vector512.IsHardwareAccelerated;
160+
public static readonly int PreferredVectorWidth = HasAvx512 ? 512 : 256;
161+
```
162+
2. Add Vector512 code paths alongside Vector256
163+
3. Use runtime dispatch based on `HasAvx512`
164+
165+
**Expected benefit**: 2x throughput on AVX-512 hardware (16 floats vs 8 floats per instruction)
166+
167+
---
168+
169+
## Implementation Checklist
170+
171+
### Phase 1: SIMD Scalar for IL Kernels ✅ COMPLETE
172+
173+
- [x] Add `EmitSimdScalarRightLoop()` for float/double
174+
- [x] Add `EmitSimdScalarLeftLoop()` for float/double
175+
- [x] Add `EmitVectorCreate()` helper for Vector256.Create(scalar)
176+
- [x] Update `GenerateSimdScalarRightKernel()` to choose SIMD path
177+
- [x] Update `GenerateSimdScalarLeftKernel()` to choose SIMD path
178+
- [x] Verify correctness with small arrays
179+
- [x] Run full test suite (2597 passed, 0 failed)
180+
- [x] Benchmark before/after
181+
182+
### Phase 2: Hardware Detection (Defer)
183+
184+
- [ ] Add `SimdCapabilities` static class
185+
- [ ] Cache detection results at startup
186+
- [ ] Add Vector512 code paths (when adopting)
187+
- [ ] Runtime dispatch mechanism
188+
189+
---
190+
191+
## Files Modified
192+
193+
- `src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs`:
194+
- Added `EmitSimdScalarRightLoop()` method (lines 1063-1178)
195+
- Added `EmitSimdScalarLeftLoop()` method (lines 1180-1295)
196+
- Added `EmitVectorCreate()` helper (lines 1900-1914)
197+
- Updated `GenerateSimdScalarRightKernel()` to check SIMD eligibility
198+
- Updated `GenerateSimdScalarLeftKernel()` to check SIMD eligibility
199+
200+
---
201+
202+
## Appendix: Raw Benchmark Data
203+
204+
### Test 1: Hardware Detection
205+
```
206+
X86 Intrinsics:
207+
Sse: True
208+
Sse2: True
209+
Avx: True
210+
Avx2: True
211+
Avx512F: False
212+
213+
Generic Vector Types:
214+
Vector256<float>: True
215+
Vector512<float>: False
216+
```
217+
218+
### Test 2: Scalar vs SIMD
219+
```
220+
array[10,000,000] + scalar
221+
222+
1. Scalar Loop : 25.42 ms
223+
2. SIMD Hoisted : 16.28 ms
224+
3. SIMD In-Loop : 22.42 ms
225+
```
226+
227+
### Test 3: NumSharp Same-type vs Mixed-type
228+
```
229+
Same-type (double+double): 14.26 ms
230+
Mixed-type (int+double): 18.07 ms
231+
```
232+
233+
---
234+
235+
## Conclusion
236+
237+
The investigation confirmed:
238+
1. **Scalar SIMD** with hoisted broadcast provides **1.56x speedup** over scalar loops
239+
2. NumSharp's C# SimdKernels already implement this optimally for same-type operations
240+
3. ~~**IL MixedTypeKernels lack SIMD for scalar paths**~~ **FIXED**
241+
4. AVX-512 hardware detection is low priority due to limited adoption
242+
243+
**Status**: SIMD scalar paths have been implemented for IL kernels. Mixed-type operations like `double_array + int_scalar` now use SIMD when the array type equals the result type.
244+
245+
**Remaining work**: Hardware detection for AVX-512 (deferred until adoption increases).

0 commit comments

Comments
 (0)