|
| 1 | +# SIMD Optimization Investigation Results |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +The investigation revealed NumSharp already has optimal SIMD scalar paths for **same-type operations** (via C# SimdKernels), but **mixed-type operations** fell back to scalar loops in IL kernels. **This has now been fixed.** |
| 6 | + |
| 7 | +### Implementation Complete ✅ |
| 8 | + |
| 9 | +SIMD scalar paths have been added to the IL kernel generator for mixed-type operations where the array type equals the result type (no per-element conversion needed). |
| 10 | + |
| 11 | +**Final Benchmark Results:** |
| 12 | +``` |
| 13 | +Array size: 10,000,000 elements |
| 14 | +
|
| 15 | +Same-type operations (C# SIMD baseline): |
| 16 | + double + double_scalar 15.29 ms [C# SIMD] |
| 17 | + float + float_scalar 8.35 ms [C# SIMD] |
| 18 | +
|
| 19 | +Mixed-type with IL SIMD (LHS type == Result type): |
| 20 | + double + int_scalar 14.96 ms [IL SIMD ✓] <- NOW OPTIMIZED |
| 21 | + float + int_scalar 7.18 ms [IL SIMD ✓] <- NOW OPTIMIZED |
| 22 | +
|
| 23 | +Mixed-type without SIMD (requires conversion): |
| 24 | + int + double_scalar 15.84 ms [Scalar loop] |
| 25 | +``` |
| 26 | + |
| 27 | +**Tests:** All 2597 tests pass, 0 failures. |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## Hardware Detection Results |
| 32 | + |
| 33 | +| Feature | Supported | |
| 34 | +|---------|-----------| |
| 35 | +| SSE | Yes | |
| 36 | +| SSE2 | Yes | |
| 37 | +| SSE3 | Yes | |
| 38 | +| SSSE3 | Yes | |
| 39 | +| SSE4.1 | Yes | |
| 40 | +| SSE4.2 | Yes | |
| 41 | +| AVX | Yes | |
| 42 | +| AVX2 | Yes | |
| 43 | +| **AVX-512** | **No** | |
| 44 | +| Vector256 | Yes (hardware accelerated) | |
| 45 | +| Vector512 | No | |
| 46 | + |
| 47 | +**Conclusion**: This machine (and most consumer CPUs) only supports up to AVX2/Vector256. AVX-512 hardware detection should be added but has lower priority since adoption is limited. |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +## Scalar SIMD Benchmark Results |
| 52 | + |
| 53 | +``` |
| 54 | +Benchmark: array[10,000,000] + scalar |
| 55 | +
|
| 56 | +1. Scalar Loop : 25.42 ms |
| 57 | +2. SIMD Hoisted : 16.28 ms (1.56x faster) |
| 58 | +3. SIMD In-Loop : 22.42 ms (JIT doesn't fully hoist) |
| 59 | +``` |
| 60 | + |
| 61 | +**Key Findings:** |
| 62 | +- SIMD with hoisted `Vector256.Create(scalar)` is **1.56x faster** than scalar loop |
| 63 | +- JIT does NOT fully hoist `Vector256.Create` - explicit hoisting gains another **1.38x** |
| 64 | +- Explicit hoisting before the loop is critical for performance |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## NumSharp Current State Analysis |
| 69 | + |
| 70 | +### Execution Path Dispatch |
| 71 | + |
| 72 | +``` |
| 73 | +Operation Type | Path Classification | Kernel Used | SIMD Scalar? |
| 74 | +------------------|---------------------|----------------------|------------- |
| 75 | +double + double | SimdScalarRight | C# SimdKernels | YES (optimal) |
| 76 | +int + double | SimdScalarRight | IL MixedTypeKernel | NO (scalar loop) |
| 77 | +int + int | SimdScalarRight | C# SimdKernels | YES (for int/double/float/long) |
| 78 | +byte + float | SimdScalarRight | IL MixedTypeKernel | NO (scalar loop) |
| 79 | +``` |
| 80 | + |
| 81 | +### Performance Comparison |
| 82 | + |
| 83 | +``` |
| 84 | +Benchmark: array[10,000,000] + scalar |
| 85 | +
|
| 86 | +Same-type (double+double): 14.26 ms (C# SIMD kernel) |
| 87 | +Mixed-type (int+double): 18.07 ms (IL scalar kernel) |
| 88 | +
|
| 89 | +Performance gap: ~27% |
| 90 | +``` |
| 91 | + |
| 92 | +### Code Analysis |
| 93 | + |
| 94 | +**C# SimdKernels.cs (lines 217-231)** - Optimal implementation: |
| 95 | +```csharp |
| 96 | +private static unsafe void SimdScalarRight_Add_Double(double* lhs, double scalar, double* result, int totalSize) |
| 97 | +{ |
| 98 | + var scalarVec = Vector256.Create(scalar); // Hoisted! |
| 99 | + int i = 0; |
| 100 | + int vectorEnd = totalSize - Vector256<double>.Count; |
| 101 | + |
| 102 | + for (; i <= vectorEnd; i += Vector256<double>.Count) |
| 103 | + { |
| 104 | + var vl = Vector256.Load(lhs + i); |
| 105 | + Vector256.Store(vl + scalarVec, result + i); // SIMD! |
| 106 | + } |
| 107 | + |
| 108 | + for (; i < totalSize; i++) |
| 109 | + result[i] = lhs[i] + scalar; // Remainder |
| 110 | +} |
| 111 | +``` |
| 112 | + |
| 113 | +**ILKernelGenerator.cs (lines 912-970)** - Suboptimal implementation: |
| 114 | +```csharp |
| 115 | +private static void EmitScalarRightLoop(ILGenerator il, MixedTypeKernelKey key, ...) |
| 116 | +{ |
| 117 | + // Line 916-925: Hoist scalar value to local (good!) |
| 118 | + var locRhsVal = il.DeclareLocal(GetClrType(key.ResultType)); |
| 119 | + il.Emit(OpCodes.Ldarg_1); // rhs |
| 120 | + EmitLoadIndirect(il, key.RhsType); |
| 121 | + EmitConvertTo(il, key.RhsType, key.ResultType); |
| 122 | + il.Emit(OpCodes.Stloc, locRhsVal); |
| 123 | + |
| 124 | + // Lines 938-960: Scalar operations only, NO SIMD! |
| 125 | + for (int i = 0; i < totalSize; i++) |
| 126 | + { |
| 127 | + result[i] = lhs[i] + rhsVal; // Scalar add |
| 128 | + } |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## Recommendations |
| 135 | + |
| 136 | +### Priority 1: Add SIMD to IL Scalar Paths (HIGH IMPACT) |
| 137 | + |
| 138 | +**Why**: 27% speedup for mixed-type scalar operations. |
| 139 | + |
| 140 | +**Implementation**: |
| 141 | +1. Modify `EmitScalarRightLoop()` to emit SIMD code for supported types |
| 142 | +2. Hoist `Vector256.Create(scalar)` before the loop |
| 143 | +3. Add Vector256 load/add/store in the main loop |
| 144 | +4. Keep scalar remainder loop for sizes not divisible by vector count |
| 145 | + |
| 146 | +**Target types**: float, double (already have Vector256 support) |
| 147 | + |
| 148 | +**Files to modify**: |
| 149 | +- `ILKernelGenerator.cs`: Add `EmitSimdScalarRightLoop()` method |
| 150 | +- Update `GenerateSimdScalarRightKernel()` to choose SIMD vs scalar based on type |
| 151 | + |
| 152 | +### Priority 2: Hardware Detection (LOW PRIORITY) |
| 153 | + |
| 154 | +**Why**: AVX-512 adoption is limited. Most CPUs (including this dev machine) only support AVX2. |
| 155 | + |
| 156 | +**Implementation** (when AVX-512 becomes common): |
| 157 | +1. Add static readonly flags in `SimdThresholds.cs`: |
| 158 | + ```csharp |
| 159 | + public static readonly bool HasAvx512 = Vector512.IsHardwareAccelerated; |
| 160 | + public static readonly int PreferredVectorWidth = HasAvx512 ? 512 : 256; |
| 161 | + ``` |
| 162 | +2. Add Vector512 code paths alongside Vector256 |
| 163 | +3. Use runtime dispatch based on `HasAvx512` |
| 164 | + |
| 165 | +**Expected benefit**: 2x throughput on AVX-512 hardware (16 floats vs 8 floats per instruction) |
| 166 | + |
| 167 | +--- |
| 168 | + |
| 169 | +## Implementation Checklist |
| 170 | + |
| 171 | +### Phase 1: SIMD Scalar for IL Kernels ✅ COMPLETE |
| 172 | + |
| 173 | +- [x] Add `EmitSimdScalarRightLoop()` for float/double |
| 174 | +- [x] Add `EmitSimdScalarLeftLoop()` for float/double |
| 175 | +- [x] Add `EmitVectorCreate()` helper for Vector256.Create(scalar) |
| 176 | +- [x] Update `GenerateSimdScalarRightKernel()` to choose SIMD path |
| 177 | +- [x] Update `GenerateSimdScalarLeftKernel()` to choose SIMD path |
| 178 | +- [x] Verify correctness with small arrays |
| 179 | +- [x] Run full test suite (2597 passed, 0 failed) |
| 180 | +- [x] Benchmark before/after |
| 181 | + |
| 182 | +### Phase 2: Hardware Detection (Defer) |
| 183 | + |
| 184 | +- [ ] Add `SimdCapabilities` static class |
| 185 | +- [ ] Cache detection results at startup |
| 186 | +- [ ] Add Vector512 code paths (when adopting) |
| 187 | +- [ ] Runtime dispatch mechanism |
| 188 | + |
| 189 | +--- |
| 190 | + |
| 191 | +## Files Modified |
| 192 | + |
| 193 | +- `src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs`: |
| 194 | + - Added `EmitSimdScalarRightLoop()` method (lines 1063-1178) |
| 195 | + - Added `EmitSimdScalarLeftLoop()` method (lines 1180-1295) |
| 196 | + - Added `EmitVectorCreate()` helper (lines 1900-1914) |
| 197 | + - Updated `GenerateSimdScalarRightKernel()` to check SIMD eligibility |
| 198 | + - Updated `GenerateSimdScalarLeftKernel()` to check SIMD eligibility |
| 199 | + |
| 200 | +--- |
| 201 | + |
| 202 | +## Appendix: Raw Benchmark Data |
| 203 | + |
| 204 | +### Test 1: Hardware Detection |
| 205 | +``` |
| 206 | +X86 Intrinsics: |
| 207 | + Sse: True |
| 208 | + Sse2: True |
| 209 | + Avx: True |
| 210 | + Avx2: True |
| 211 | + Avx512F: False |
| 212 | +
|
| 213 | +Generic Vector Types: |
| 214 | + Vector256<float>: True |
| 215 | + Vector512<float>: False |
| 216 | +``` |
| 217 | + |
| 218 | +### Test 2: Scalar vs SIMD |
| 219 | +``` |
| 220 | +array[10,000,000] + scalar |
| 221 | +
|
| 222 | +1. Scalar Loop : 25.42 ms |
| 223 | +2. SIMD Hoisted : 16.28 ms |
| 224 | +3. SIMD In-Loop : 22.42 ms |
| 225 | +``` |
| 226 | + |
| 227 | +### Test 3: NumSharp Same-type vs Mixed-type |
| 228 | +``` |
| 229 | +Same-type (double+double): 14.26 ms |
| 230 | +Mixed-type (int+double): 18.07 ms |
| 231 | +``` |
| 232 | + |
| 233 | +--- |
| 234 | + |
| 235 | +## Conclusion |
| 236 | + |
| 237 | +The investigation confirmed: |
| 238 | +1. **Scalar SIMD** with hoisted broadcast provides **1.56x speedup** over scalar loops |
| 239 | +2. NumSharp's C# SimdKernels already implement this optimally for same-type operations |
| 240 | +3. ~~**IL MixedTypeKernels lack SIMD for scalar paths**~~ **FIXED** ✅ |
| 241 | +4. AVX-512 hardware detection is low priority due to limited adoption |
| 242 | + |
| 243 | +**Status**: SIMD scalar paths have been implemented for IL kernels. Mixed-type operations like `double_array + int_scalar` now use SIMD when the array type equals the result type. |
| 244 | + |
| 245 | +**Remaining work**: Hardware detection for AVX-512 (deferred until adoption increases). |
0 commit comments