fix(sum): use pairwise summation for floating-point linalg::Sum#849
fix(sum): use pairwise summation for floating-point linalg::Sum#849IvanaGyro wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces naive summation loops in Sum_internal.cpp with a recursive pairwise summation algorithm (PairwiseSum) implemented in pairwise_sum.hpp to improve floating-point accuracy, and enables the corresponding accuracy tests. The review feedback highlights a critical safety issue in StrideView::Iterator where storing a pointer to the view can lead to a dangling pointer, and recommends storing the underlying range's iterator instead, along with updating the begin() and end() methods.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #849 +/- ##
==========================================
+ Coverage 29.49% 29.68% +0.19%
==========================================
Files 241 242 +1
Lines 35524 35544 +20
Branches 14780 14781 +1
==========================================
+ Hits 10477 10551 +74
+ Misses 17791 17737 -54
Partials 7256 7256
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
e355381 to
4c76053
Compare
IvanaGyro
left a comment
There was a problem hiding this comment.
(Review written by Claude on behalf of @IvanaGyro.)
The pairwise-summation core looks correct to me. PairwiseSumBlocks covers every element exactly once across the three regimes (<8 straight loop, the 8-accumulator block for [8,128] with its scalar tail, and the n>128 split where half is rounded down to a multiple of 8 and is always ≥64), and the reduction order matches NumPy's np.add.reduce. Empty ranges return T(0), preserving the old behavior. The Sum_internal rewrite and the test re-enable are clean.
One additive note on top of the existing comments — see the inline comment about StrideView being unexercised in this PR.
|
(Comment written by Claude on behalf of @IvanaGyro.) Test-coverage note (in addition to the |
4c76053 to
38fdf4f
Compare
IvanaGyro
left a comment
There was a problem hiding this comment.
(Review written by Claude on behalf of @IvanaGyro.)
Thanks for addressing the earlier round — stride_view/iterator naming, the dangling-pointer fix (iterator now stores the base iterator, with a dedicated IteratorsOutliveTheView test), the move to its own stride_view.hpp, the | stride(k) pipe adaptor, and the static_asserts on the range concepts all look good, and the fix is correct. One blocking issue: the new test file isn't registered in the build — details inline.
38fdf4f to
3a5a680
Compare
PR #849 Review:
|
IvanaGyro
left a comment
There was a problem hiding this comment.
(Review written by Claude on behalf of @IvanaGyro.)
The switch to NumPy-style pairwise summation is a clean, well-documented improvement, and re-enabling the previously DISABLED_Accuracy test is the right move. The implementation looks correct: the empty range returns T(0), the <8 / <=128 / recursive-split branches are well-formed, and the powers-of-two block structure is what lets the homogeneous-value test now pass exactly. Two test-coverage gaps below; no blocking concerns with the production code itself.
linalg::Sum accumulated every element into one scalar in a serial loop, whose worst-case rounding error grows as O(N * eps). The Accuracy test in tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was disabled. Add a pairwise-summation primitive (pairwise_sum.hpp): PairwiseSum mirrors NumPy's np.add.reduce -- a straight loop below 8 elements, an eight-accumulator unrolled block up to 128, and a halve-and-recurse split rounded to a multiple of eight above that. Worst-case error growth drops to O(log N * eps) at essentially the same cost. PairwiseSum consumes any random-access range, so it also serves strided reductions later. Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to call PairwiseSum over the contiguous storage span and re-enable the accuracy test. Integer branches keep the straight loop since they do not round. The GPU path (cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is left unchanged. For the disabled test's values the new code lands within 2 ULP of the reference, against the 4-ULP tolerance of EXPECT_NUMBER_EQ; naive accumulation was 108 ULP off. Co-authored-by: Claude <noreply@anthropic.com>
The existing LinalgSumHomogeneousValuesTest.Accuracy uses element_number = 10000, which only reaches the recursive-split branch of PairwiseSumBlocks, and the equal-magnitude input distribution is the one regime where naive serial summation stays nearly exact -- neither demonstrates nor protects the actual accuracy benefit pairwise summation buys. Add two focused tests: * LinalgSumBoundaryTest.EachPairwiseSumBranch iterates over sizes 1, 7, 8, 9, 15, 128, 129, 137, 1024 and verifies linalg::Sum on a homogeneous-1.0 tensor is exactly N. The sizes straddle the n < 8 / n <= 128 / n > 128 thresholds and include the n % 8 != 0 tails inside the 8-accumulator block, so a regression that drops or double-counts an element in any branch fails immediately rather than only on the n = 10000 recursive case. * LinalgSumHeterogeneousMagnitudeTest.RecoversTermsLostByNaiveAccumulation_* sums [+L, 1, 1, ..., 1, -L] where L is far above the precision threshold (1e16 for double > 2^53, 1e8 for float > 2^23). The exact sum is N - 2. Under naive accumulation the running total reaches L on the first element and every subsequent +1 vanishes in IEEE 754 rounding (1.0 is below the unit in the last place at that magnitude), so naive collapses to ~0. Pairwise keeps small values together in the tree until the +/- L cancellation at the top, so most of the small terms survive. The contract asserted is N/2 < result <= N: tight enough that a serial-accumulation regression collapses to ~0 and fails, loose enough to accommodate the handful of 1's that the unrolled accumulator holding L still rounds away. Co-authored-by: Claude <noreply@anthropic.com>
ed5a88b to
e7a9b3e
Compare
Fixes #835.
Problem
linalg::Sumaccumulated every element into one scalar in a serial loop, whose worst-case rounding error grows asO(N · eps). TheAccuracytest intests/linalg_test/sum_test.cpp(10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was therefore disabled.Change
src/backend/linalg_internal_cpu/pairwise_sum.hpp:StrideView— a C++20 random-access, sized view over every k-th element of a range.PairwiseSum— NumPynp.add.reduce-style summation: straight loop below 8 elements, an 8-accumulator unrolled block up to 128, and a halve-and-recurse split (rounded to a multiple of 8) above that. Worst-case error growth drops toO(log N · eps)at essentially the same cost.Float/Double/ComplexFloat/ComplexDoublebranches ofSum_internalto usePairwiseSum. Integer branches keep the straight loop (they do not round).DISABLED_Accuracy→Accuracy).cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is unchanged and itsGpuAccuracytest stays enabled.StrideViewis intentionally reusable for strided reductions (e.g. a matrix diagonal); the follow-up Trace refactor (#834) builds on it.Test plan
openblas-cpu: full suite 948/948 passed (includes the 10 re-enabledLinalgSumHomogeneousValuesTest.Accuracy<...>cases).mkl-cpu: full suite 948/948 passed.EXPECT_NUMBER_EQtolerates 4 ULP); naive accumulation was 108 ULP off.🤖 Draft — opened for review.
Generated by Claude Code