|
| 1 | +# NumSharp 0.41.0-prerelease |
| 2 | + |
| 3 | +This prerelease introduces the **IL Kernel Generator** - a complete architectural overhaul that replaces ~600K lines of Regen-generated template code with ~19K lines of runtime IL generation. This delivers massive performance improvements, comprehensive NumPy 2.x alignment, and significantly cleaner maintainable code. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## TL;DR |
| 8 | + |
| 9 | +Backend rewrite via dynamic IL emission, 25 new `np.*` functions, boolean indexing rewrite, broadcast slicing fix, Regen static generation deprecated, 52 bug fixes, MatMul 35-100x faster, -532K lines net. |
| 10 | + |
| 11 | +``` |
| 12 | ++ 25 new/fixed functions (nansum, isnan, isfinite, isinf, isclose, cumprod, etc.) |
| 13 | ++ 52 bug fixes for NumPy 2.x alignment |
| 14 | ++ MatMul 35-100x faster (SIMD cache-blocked, 20+ GFLOPS) |
| 15 | ++ 97% code reduction (-532K lines) |
| 16 | ++ Runtime IL generation replaces static templates |
| 17 | ++ Vector128/256/512 SIMD with runtime detection |
| 18 | ++ Boolean indexing rewrite with SIMD fast path |
| 19 | ++ All comparison/bitwise operators now work (were returning null) |
| 20 | ++ No breaking changes - drop-in replacement |
| 21 | +``` |
| 22 | + |
| 23 | +**Install**: `dotnet add package NumSharp --version 0.41.0-prerelease` |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## Contents |
| 28 | + |
| 29 | +| Section | Highlights | |
| 30 | +|---------|------------| |
| 31 | +| [Summary](#summary) | 80 commits, -532K lines, 3,868 tests | |
| 32 | +| [IL Kernel Generator](#il-kernel-generator) | 27 files, SIMD V128/256/512 | |
| 33 | +| [New NumPy Functions (25)](#new-numpy-functions-25) | nansum, isnan, cumprod, etc. | |
| 34 | +| [Critical Bug Fixes](#critical-bug-fixes) | negative, unique, dot, linspace | |
| 35 | +| [Operator Rewrites](#operator-rewrites) | ==, !=, <, >, &, \| now work | |
| 36 | +| [Boolean Indexing Rewrite](#boolean-indexing-rewrite) | SIMD fast path | |
| 37 | +| [Slicing Improvements](#slicing-improvements) | Broadcast stride=0 preserved | |
| 38 | +| [Performance Improvements](#performance-improvements) | MatMul 35-100x, 20+ GFLOPS | |
| 39 | +| [Code Reduction](#code-reduction) | 99% binary, 98% MatMul, 97% Dot | |
| 40 | +| [Infrastructure Changes](#infrastructure-changes) | NativeMemory, KernelProvider | |
| 41 | +| [API Fixes](#api-fixes) | random(), standard_normal, dtype | |
| 42 | +| [New Test Files (64)](#new-test-files-64) | 34 kernel, 8 NumPy, 3 linalg | |
| 43 | +| [Breaking Changes](#breaking-changes) | None | |
| 44 | +| [Known Issues](#known-issues-openbugs) | 52 OpenBugs excluded | |
| 45 | +| [Installation](#installation) | `dotnet add package NumSharp` | |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +## Summary |
| 50 | + |
| 51 | +| Metric | Value | |
| 52 | +|--------|-------| |
| 53 | +| Commits | 80 | |
| 54 | +| Files Changed | 623 | |
| 55 | +| Lines Added | +71,355 | |
| 56 | +| Lines Deleted | -603,345 | |
| 57 | +| **Net Change** | **-532K lines** | |
| 58 | +| Test Results | 3,868 passed, 52 OpenBugs, 11 skipped | |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## IL Kernel Generator |
| 63 | + |
| 64 | +Runtime IL generation via `System.Reflection.Emit.DynamicMethod` replaces static Regen templates. |
| 65 | + |
| 66 | +### Kernel Files (27 new files) |
| 67 | +- `ILKernelGenerator.cs` - Core infrastructure, SIMD detection (Vector128/256/512) |
| 68 | +- `ILKernelGenerator.Binary.cs` - Add, Sub, Mul, Div, BitwiseAnd/Or/Xor |
| 69 | +- `ILKernelGenerator.MixedType.cs` - Mixed-type ops with type promotion |
| 70 | +- `ILKernelGenerator.Unary.cs` - Negate, Abs, Sqrt, Sin, Cos, Exp, Log, Sign |
| 71 | +- `ILKernelGenerator.Comparison.cs` - ==, !=, <, >, <=, >= returning bool arrays |
| 72 | +- `ILKernelGenerator.Reduction.cs` - Sum, Prod, Min, Max, Mean, ArgMax, ArgMin, All, Any |
| 73 | +- `ILKernelGenerator.Reduction.Axis.Simd.cs` - AVX2 gather for axis reductions |
| 74 | +- `ILKernelGenerator.Scan.cs` - CumSum, CumProd with SIMD |
| 75 | +- `ILKernelGenerator.Shift.cs` - LeftShift, RightShift |
| 76 | +- `ILKernelGenerator.MatMul.cs` - Cache-blocked SIMD matrix multiply |
| 77 | +- `ILKernelGenerator.Clip.cs`, `.Modf.cs`, `.Masking.cs` - Specialized ops |
| 78 | + |
| 79 | +### Execution Paths |
| 80 | +1. **SimdFull** - Contiguous + SIMD-capable dtype → Vector loop + scalar tail |
| 81 | +2. **ScalarFull** - Contiguous + non-SIMD dtype (Decimal) → Scalar loop |
| 82 | +3. **General** - Strided/broadcast → Coordinate-based iteration |
| 83 | + |
| 84 | +### Infrastructure |
| 85 | +- `IKernelProvider.cs` - Abstraction for future backends (CUDA, Vulkan) |
| 86 | +- `KernelKey.cs`, `KernelOp.cs`, `KernelSignatures.cs` - Kernel dispatch |
| 87 | +- `SimdMatMul.cs`, `SimdReductionOptimized.cs` - SIMD helpers |
| 88 | +- `TypeRules.cs` - NEP50 type promotion rules |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +## New NumPy Functions (25) |
| 93 | + |
| 94 | +### NaN-Aware Reductions (7) |
| 95 | +| Function | Description | |
| 96 | +|----------|-------------| |
| 97 | +| `np.nansum` | Sum ignoring NaN | |
| 98 | +| `np.nanprod` | Product ignoring NaN | |
| 99 | +| `np.nanmin` | Minimum ignoring NaN | |
| 100 | +| `np.nanmax` | Maximum ignoring NaN | |
| 101 | +| `np.nanmean` | Mean ignoring NaN | |
| 102 | +| `np.nanvar` | Variance ignoring NaN | |
| 103 | +| `np.nanstd` | Standard deviation ignoring NaN | |
| 104 | + |
| 105 | +### Math Operations (8) |
| 106 | +| Function | Description | |
| 107 | +|----------|-------------| |
| 108 | +| `np.cbrt` | Cube root | |
| 109 | +| `np.floor_divide` | Integer division | |
| 110 | +| `np.reciprocal` | Element-wise 1/x | |
| 111 | +| `np.trunc` | Truncate to integer | |
| 112 | +| `np.invert` | Bitwise NOT | |
| 113 | +| `np.square` | Element-wise square | |
| 114 | +| `np.cumprod` | Cumulative product | |
| 115 | +| `np.count_nonzero` | Count non-zero elements | |
| 116 | + |
| 117 | +### Bitwise & Trigonometric (4) |
| 118 | +| Function | Description | |
| 119 | +|----------|-------------| |
| 120 | +| `np.left_shift` | Bitwise left shift | |
| 121 | +| `np.right_shift` | Bitwise right shift | |
| 122 | +| `np.deg2rad` | Degrees to radians | |
| 123 | +| `np.rad2deg` | Radians to degrees | |
| 124 | + |
| 125 | +### Logic & Validation (4) - Previously returned `null` |
| 126 | +| Function | Description | |
| 127 | +|----------|-------------| |
| 128 | +| `np.isnan` | Test element-wise for NaN | |
| 129 | +| `np.isfinite` | Test element-wise for finiteness | |
| 130 | +| `np.isinf` | Test element-wise for infinity | |
| 131 | +| `np.isclose` | Element-wise comparison within tolerance | |
| 132 | + |
| 133 | +### Operators (2) - Previously returned `null` |
| 134 | +| Operator | Description | |
| 135 | +|----------|-------------| |
| 136 | +| `operator &` | Bitwise/logical AND with broadcasting | |
| 137 | +| `operator \|` | Bitwise/logical OR with broadcasting | |
| 138 | + |
| 139 | +### New Overloads |
| 140 | +| Function | New Capability | |
| 141 | +|----------|----------------| |
| 142 | +| `np.power(array, array)` | Array exponents (was scalar only) | |
| 143 | +| `np.repeat(array, NDArray)` | Per-element repeat counts | |
| 144 | +| `np.argmax/argmin(axis, keepdims)` | keepdims parameter | |
| 145 | +| `np.convolve` | Complete rewrite (was throwing NRE) | |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## Critical Bug Fixes |
| 150 | + |
| 151 | +### Behavioral Fixes |
| 152 | +| Bug | Before | After | |
| 153 | +|-----|--------|-------| |
| 154 | +| `np.negative()` | Only negated positive values (`if val > 0`) | Negates ALL values (`val = -val`) | |
| 155 | +| `np.unique()` | Returned unsorted | Sorts output, NaN at end | |
| 156 | +| `np.dot(1D, 2D)` | Threw `NotSupportedException` | Treats 1D as row vector | |
| 157 | +| `np.linspace()` | Returned `float32` for float inputs | Always `float64` default | |
| 158 | +| `np.arange()` | Threw on `start >= stop` | Returns empty array | |
| 159 | +| `np.searchsorted()` | No scalar support | Added scalar overloads returning `int` | |
| 160 | +| `np.shuffle()` | Non-standard `passes` parameter | NumPy legacy API (axis-0 only) | |
| 161 | +| Float-to-int conversion | Used rounding | Uses truncation toward zero | |
| 162 | + |
| 163 | +### Return Type Fixes |
| 164 | +| Function | Before | After | |
| 165 | +|----------|--------|-------| |
| 166 | +| `np.argmax()` / `np.argmin()` | Returned `int` | Returns `long` (large array support) | |
| 167 | +| `np.abs()` | Converted to Double | Preserves input dtype | |
| 168 | + |
| 169 | +### Empty Array Handling |
| 170 | +| Function | Before | After | |
| 171 | +|----------|--------|-------| |
| 172 | +| `np.mean([])` | Threw or returned 0 | Returns `NaN` | |
| 173 | +| `np.mean(zeros((0,3)), axis=0)` | Incorrect | `[NaN, NaN, NaN]` | |
| 174 | +| `np.mean(zeros((0,3)), axis=1)` | Incorrect | Empty array `[]` | |
| 175 | +| `np.std/var` single element | Returned 0 | Returns `NaN` with `ddof >= size` | |
| 176 | + |
| 177 | +### keepdims Fixes |
| 178 | +All reduction functions now properly preserve dimensions when `keepdims=True`: |
| 179 | +- `np.sum`, `np.prod`, `np.mean`, `np.std`, `np.var` |
| 180 | +- `np.min`, `np.max`, `np.argmin`, `np.argmax` |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## Operator Rewrites |
| 185 | + |
| 186 | +### Comparison Operators (==, !=, <, >, <=, >=) |
| 187 | +- **Before**: Manual type switch per dtype |
| 188 | +- **After**: Uses `TensorEngine` with IL kernels |
| 189 | +- Proper null handling (returns `false` scalar) |
| 190 | +- Empty array handling (returns empty bool array) |
| 191 | +- Added reverse operators (`object op NDArray`) |
| 192 | +- Full broadcasting support |
| 193 | + |
| 194 | +### Bitwise Operators (&, |, ^) |
| 195 | +- **Before**: Returned `null` |
| 196 | +- **After**: Full implementation via IL kernels |
| 197 | +- Added `NDArray<T>` typed operators |
| 198 | +- Scalar overloads for all integer types |
| 199 | + |
| 200 | +### Implicit Scalar Conversion |
| 201 | +- **Before**: `(int)ndarray_float64` would fail |
| 202 | +- **After**: Uses `Converts.ChangeType` for cross-dtype conversion |
| 203 | + |
| 204 | +--- |
| 205 | + |
| 206 | +## Boolean Indexing Rewrite |
| 207 | + |
| 208 | +Complete rewrite with NumPy-aligned behavior: |
| 209 | + |
| 210 | +### Two Cases Supported |
| 211 | +1. `arr[mask]` where `mask.shape == arr.shape` → element-wise selection |
| 212 | +2. `arr[mask]` where `mask` is 1D and `mask.shape[0] == arr.shape[0]` → axis-0 selection |
| 213 | + |
| 214 | +### SIMD Fast Path |
| 215 | +- New `BooleanMaskFastPath` for contiguous arrays |
| 216 | +- `CountTrue(bool*, int)` - SIMD count of true values |
| 217 | +- `CopyMasked<T>(src, mask, dest, size)` - SIMD masked copy |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +## Slicing Improvements |
| 222 | + |
| 223 | +### Broadcast Array Handling |
| 224 | +- **Before**: Slicing broadcast arrays would materialize data (losing stride=0) |
| 225 | +- **After**: Preserves stride=0 information (NumPy behavior) |
| 226 | +- Critical for `cumsum` and axis reductions on broadcast arrays |
| 227 | + |
| 228 | +### Empty Slice Handling |
| 229 | +- `a[100:200]` on 10-element array now returns proper empty array |
| 230 | + |
| 231 | +### Contiguous Optimization |
| 232 | +- Contiguous slices get fresh shape with `offset=0` |
| 233 | +- `IsSliced=false` for contiguous slices |
| 234 | + |
| 235 | +--- |
| 236 | + |
| 237 | +## Performance Improvements |
| 238 | + |
| 239 | +| Operation | Improvement | Details | |
| 240 | +|-----------|-------------|---------| |
| 241 | +| MatMul (2D) | 35-100x | Cache-blocked SIMD, 20+ GFLOPS | |
| 242 | +| Axis Reductions | Major | AVX2 gather + parallel outer loop | |
| 243 | +| All/Any | Major | SIMD with early-exit | |
| 244 | +| CumSum/CumProd | Major | Element-wise SIMD | |
| 245 | +| Boolean Masking | Major | SIMD CountTrue + CopyMasked | |
| 246 | +| Integer Abs/Sign | Minor | Bitwise (branchless) | |
| 247 | +| Vector512 | New | Runtime detection and utilization | |
| 248 | +| Loop Unrolling | 4x | All SIMD kernels | |
| 249 | + |
| 250 | +--- |
| 251 | + |
| 252 | +## Code Reduction |
| 253 | + |
| 254 | +### Massive File Deletions |
| 255 | +| Component | Before | After | Reduction | |
| 256 | +|-----------|--------|-------|-----------| |
| 257 | +| Binary ops (Add/Sub/Mul/Div/Mod) | 60 files, ~500K lines | 2 IL files | **99%** | |
| 258 | +| `Default.MatMul.2D2D.cs` | ~20K lines | 325 lines | **98.4%** | |
| 259 | +| `Default.Dot.NDMD.cs` | ~16K lines | 422 lines | **97.4%** | |
| 260 | +| Comparison ops (Equals) | 13 files | 1 IL file | **92%** | |
| 261 | +| Std/Var reductions | ~20K lines | ~500 lines | **97%** | |
| 262 | + |
| 263 | +### Deleted Files (76) |
| 264 | +- 60 binary op files (`Default.Add.{Type}.cs`, etc.) |
| 265 | +- 13 comparison files (`Default.Equals.{Type}.cs`, etc.) |
| 266 | +- 3 template files |
| 267 | + |
| 268 | +--- |
| 269 | + |
| 270 | +## Infrastructure Changes |
| 271 | + |
| 272 | +### Memory Allocation |
| 273 | +- `Marshal.AllocHGlobal` → `NativeMemory.Alloc` |
| 274 | +- `Marshal.FreeHGlobal` → `NativeMemory.Free` |
| 275 | +- `AllocationType.AllocHGlobal` → `AllocationType.Native` |
| 276 | +- `StackedMemoryPool` migrated to NativeMemory |
| 277 | + |
| 278 | +### DefaultEngine |
| 279 | +- Removed `ParallelAbove = 84999` constant |
| 280 | +- Added `KernelProvider` instance field |
| 281 | +- Added static `DefaultKernelProvider` for code without engine access |
| 282 | +- Removed all `Parallel.For` usage (single-threaded for determinism) |
| 283 | + |
| 284 | +### Math Functions |
| 285 | +All migrated from Regen templates to `ExecuteUnaryOp`: |
| 286 | +- Sin, Cos, Tan, ASin, ACos, ATan, ATan2 |
| 287 | +- Exp, Exp2, Expm1, Log, Log2, Log10, Log1p |
| 288 | +- Sqrt, Cbrt, Abs, Sign, Floor, Ceil, Truncate |
| 289 | +- Removed `DecimalMath` dependency for most operations |
| 290 | + |
| 291 | +### TensorEngine Extensions |
| 292 | +New abstract methods: |
| 293 | +- `NotEqual`, `Less`, `LessEqual`, `Greater`, `GreaterEqual` |
| 294 | +- `BitwiseAnd`, `BitwiseOr`, `BitwiseXor` |
| 295 | +- `LeftShift`, `RightShift` |
| 296 | +- `Power(NDArray, NDArray)`, `FloorDivide` |
| 297 | +- `Truncate`, `Reciprocal`, `Square`, `Cbrt`, `Invert` |
| 298 | +- `Deg2Rad`, `Rad2Deg`, `IsInf` |
| 299 | +- `ReduceCumMul` |
| 300 | + |
| 301 | +### IKernelProvider Methods |
| 302 | +- `CountTrue(bool*, int)` - SIMD true count |
| 303 | +- `CopyMasked<T>` - SIMD masked copy |
| 304 | +- `Variance<T>`, `StandardDeviation<T>` - SIMD two-pass |
| 305 | +- `NanSum/Prod/Min/Max` for float/double |
| 306 | +- `FindNonZeroStrided<T>` - Strided nonzero detection |
| 307 | + |
| 308 | +--- |
| 309 | + |
| 310 | +## API Fixes |
| 311 | + |
| 312 | +| Change | Details | |
| 313 | +|--------|---------| |
| 314 | +| `np.random.random()` | New alias for `random_sample()` | |
| 315 | +| `stardard_normal` | Fixed typo → `standard_normal` (old deprecated) | |
| 316 | +| `outType` → `dtype` | Parameter rename in `minimum/maximum/fmin/fmax` | |
| 317 | +| `np.modf()` | Now validates floating-point input types | |
| 318 | + |
| 319 | +--- |
| 320 | + |
| 321 | +## New Test Files (64) |
| 322 | + |
| 323 | +### Kernel Tests (34) |
| 324 | +`BinaryOpTests`, `UnaryOpTests`, `ComparisonOpTests`, `ReductionOpTests`, `AxisReductionSimdTests`, `NonContiguousTests`, `SlicedArrayOpTests`, `NanReductionTests`, `VarStdComprehensiveTests`, `ArgMaxArgMinComprehensiveTests`, `CumSumComprehensiveTests`, `BitwiseOpTests`, `ShiftOpTests`, `DtypeCoverageTests`, `DtypePromotionTests`, `EdgeCaseTests`, `BattleProofTests`, `SimdOptimizationTests`, and more. |
| 325 | + |
| 326 | +### NumPy Ported Tests (8) |
| 327 | +`ArgMaxArgMinEdgeCaseTests`, `ClipEdgeCaseTests`, `ClipNDArrayTests`, `CumSumEdgeCaseTests`, `ModfEdgeCaseTests`, `NonzeroEdgeCaseTests`, `PowerEdgeCaseTests`, `VarStdEdgeCaseTests` |
| 328 | + |
| 329 | +### Linear Algebra Battle Tests (3) |
| 330 | +`np.dot.BattleTest`, `np.matmul.BattleTest`, `np.outer.BattleTest` |
| 331 | + |
| 332 | +--- |
| 333 | + |
| 334 | +## Breaking Changes |
| 335 | + |
| 336 | +**None.** This is a drop-in replacement with improved performance and NumPy compatibility. |
| 337 | + |
| 338 | +--- |
| 339 | + |
| 340 | +## Known Issues (OpenBugs) |
| 341 | + |
| 342 | +52 tests marked as `[OpenBugs]` are excluded from CI: |
| 343 | +- sbyte (int8) type not supported |
| 344 | +- Some bitmap operations require GDI+ (Windows only) |
| 345 | +- Various edge cases documented in test files |
| 346 | + |
| 347 | +--- |
| 348 | + |
| 349 | +## Installation |
| 350 | + |
| 351 | +```bash |
| 352 | +dotnet add package NumSharp --version 0.41.0-prerelease |
| 353 | +``` |
| 354 | + |
| 355 | +Or via Package Manager: |
| 356 | +```powershell |
| 357 | +Install-Package NumSharp -Version 0.41.0-prerelease |
| 358 | +``` |
| 359 | + |
| 360 | +## Testing |
| 361 | + |
| 362 | +```bash |
| 363 | +cd test/NumSharp.UnitTest |
| 364 | + |
| 365 | +# Run tests excluding known issues |
| 366 | +dotnet test -- "--treenode-filter=/*/*/*/*[Category!=OpenBugs]" |
| 367 | + |
| 368 | +# Run all tests |
| 369 | +dotnet test |
| 370 | +``` |
| 371 | + |
| 372 | +--- |
| 373 | + |
| 374 | +## Feedback |
| 375 | + |
| 376 | +This is a prerelease. Please report any issues at: |
| 377 | +https://github.com/SciSharp/NumSharp/issues |
| 378 | + |
| 379 | +--- |
| 380 | + |
| 381 | +**Full Changelog**: See [CHANGES.md](./CHANGES.md) for complete documentation of all 80 commits. |
0 commit comments