Skip to content

fix(sum): use pairwise summation for floating-point linalg::Sum#849

Open
IvanaGyro wants to merge 2 commits into
masterfrom
claude/835-pairwise-sum
Open

fix(sum): use pairwise summation for floating-point linalg::Sum#849
IvanaGyro wants to merge 2 commits into
masterfrom
claude/835-pairwise-sum

Conversation

@IvanaGyro

Copy link
Copy Markdown
Collaborator

Fixes #835.

Problem

linalg::Sum accumulated every element into one scalar in a serial loop, whose worst-case rounding error grows as O(N · eps). The Accuracy test in tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was therefore disabled.

Change

  • Add a strided pairwise-summation primitive src/backend/linalg_internal_cpu/pairwise_sum.hpp:
    • StrideView — a C++20 random-access, sized view over every k-th element of a range.
    • PairwiseSum — NumPy np.add.reduce-style summation: straight loop below 8 elements, an 8-accumulator unrolled block up to 128, and a halve-and-recurse split (rounded to a multiple of 8) above that. Worst-case error growth drops to O(log N · eps) at essentially the same cost.
  • Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to use PairwiseSum. Integer branches keep the straight loop (they do not round).
  • Re-enable the accuracy test (DISABLED_AccuracyAccuracy).
  • The GPU path (cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is unchanged and its GpuAccuracy test stays enabled.

StrideView is intentionally reusable for strided reductions (e.g. a matrix diagonal); the follow-up Trace refactor (#834) builds on it.

Test plan

  • openblas-cpu: full suite 948/948 passed (includes the 10 re-enabled LinalgSumHomogeneousValuesTest.Accuracy<...> cases).
  • mkl-cpu: full suite 948/948 passed.
  • For the previously-disabled values the new code lands within 2 ULP of the reference (EXPECT_NUMBER_EQ tolerates 4 ULP); naive accumulation was 108 ULP off.
  • GPU CI (unchanged GPU path).

🤖 Draft — opened for review.


Generated by Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces naive summation loops in Sum_internal.cpp with a recursive pairwise summation algorithm (PairwiseSum) implemented in pairwise_sum.hpp to improve floating-point accuracy, and enables the corresponding accuracy tests. The review feedback highlights a critical safety issue in StrideView::Iterator where storing a pointer to the view can lead to a dangling pointer, and recommends storing the underlying range's iterator instead, along with updating the begin() and end() methods.

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.68%. Comparing base (87c9fd5) to head (e7a9b3e).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #849      +/-   ##
==========================================
+ Coverage   29.49%   29.68%   +0.19%     
==========================================
  Files         241      242       +1     
  Lines       35524    35544      +20     
  Branches    14780    14781       +1     
==========================================
+ Hits        10477    10551      +74     
+ Misses      17791    17737      -54     
  Partials     7256     7256              
Flag Coverage Δ
cpp 29.28% <100.00%> (+0.19%) ⬆️
python 52.71% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
C++ backend 30.96% <100.00%> (+0.22%) ⬆️
Python bindings 17.09% <ø> (ø)
Python package 52.71% <ø> (ø)
Files with missing lines Coverage Δ
src/backend/linalg_internal_cpu/Sum_internal.cpp 96.29% <100.00%> (+86.61%) ⬆️
src/backend/linalg_internal_cpu/pairwise_sum.hpp 100.00% <100.00%> (ø)

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87c9fd5...e7a9b3e. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
@IvanaGyro IvanaGyro force-pushed the claude/835-pairwise-sum branch from e355381 to 4c76053 Compare May 29, 2026 07:53

@IvanaGyro IvanaGyro left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

The pairwise-summation core looks correct to me. PairwiseSumBlocks covers every element exactly once across the three regimes (<8 straight loop, the 8-accumulator block for [8,128] with its scalar tail, and the n>128 split where half is rounded down to a multiple of 8 and is always ≥64), and the reduction order matches NumPy's np.add.reduce. Empty ranges return T(0), preserving the old behavior. The Sum_internal rewrite and the test re-enable are clean.

One additive note on top of the existing comments — see the inline comment about StrideView being unexercised in this PR.

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
@IvanaGyro

Copy link
Copy Markdown
Collaborator Author

(Comment written by Claude on behalf of @IvanaGyro.)

Test-coverage note (in addition to the StrideView point in the review): the re-enabled Accuracy test sums 10000 identical values. That exercises the recursive split, but not the block-size boundaries nor mixed-magnitude inputs — which is exactly where pairwise summation diverges from naive accumulation. Consider a small direct PairwiseSum unit test covering n = 0, 1, 7, 8, 9, 128, 129 with a mix of large/small/negative values, compared against a higher-precision (e.g. long double / Kahan) reference, so the 8-accumulator block edge and the halve-and-recurse boundary are pinned and can't silently regress.

@IvanaGyro IvanaGyro left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

Thanks for addressing the earlier round — stride_view/iterator naming, the dangling-pointer fix (iterator now stores the base iterator, with a dedicated IteratorsOutliveTheView test), the move to its own stride_view.hpp, the | stride(k) pipe adaptor, and the static_asserts on the range concepts all look good, and the fix is correct. One blocking issue: the new test file isn't registered in the build — details inline.

Comment thread tests/linalg_test/stride_view_test.cpp Outdated
Comment thread src/backend/linalg_internal_cpu/stride_view.hpp Outdated
@IvanaGyro IvanaGyro force-pushed the claude/835-pairwise-sum branch from 38fdf4f to 3a5a680 Compare May 30, 2026 09:57
@pcchen pcchen added this to the v1.1.0 milestone May 31, 2026
@pcchen

pcchen commented May 31, 2026

Copy link
Copy Markdown
Collaborator

PR #849 Review: fix(sum) — pairwise summation for floating-point linalg::Sum

Posted by Claude Code on behalf of @pcchen

Overview

This PR replaces the naive serial accumulation loop in Sum_internal (float/double/complexfloat/complexdouble) with a pairwise summation algorithm, reducing worst-case rounding error from O(N·eps) to O(log N·eps). The re-enabled Accuracy test confirms the fix works in practice (within 2 ULP vs. 108 ULP for naive).

The algorithm mirrors NumPy's np.add.reduce:

  • n < 8: straight loop
  • 8 ≤ n ≤ 128: 8-accumulator unrolled block + tail
  • n > 128: split at n/2 rounded down to a multiple of 8, recurse

Correctness

The algorithm is correct and the boundary cases are handled:

  • n = 0: falls into n < 8 path, loop doesn't execute, returns T(0). ✅
  • n = 8: initialises r0..r7, inner loop skipped (i+8 > n), tail skipped, returns tree sum of 8 accumulators. ✅
  • n = 9: inner loop skipped, tail adds first[8] to the tree sum. ✅
  • Complex types: T(0) constructs {0.0, 0.0} for std::complex<T>, all arithmetic works element-wise. ✅
  • Half-rounding can't produce zero: for n > 128, half = n/2 ≥ 64, and half - half%8 ≥ 56 > 0, so the recursion always terminates. ✅
  • Recursion depth: O(log₂(N/128)) — for N = 1 billion, depth ≈ 23, no stack concern. ✅

Issues

1. C++20 requirement needs explicit confirmation (Important)

pairwise_sum.hpp requires C++20:

  • #include <ranges> — C++20
  • std::ranges::random_access_range, std::ranges::range_value_t, std::ranges::begin, std::ranges::size — C++20 concepts/functions
  • std::random_access_iterator — C++20 concept
  • std::span in Sum_internal.cpp — C++20

The PR passes the full test suite, so the build system already compiles with C++20. But if CMakeLists.txt hasn't been updated to require C++20 (e.g., set(CMAKE_CXX_STANDARD 20)), a clean build on a C++17-defaulting toolchain will fail. Please confirm CMAKE_CXX_STANDARD is already set to ≥ 20, or update it in this PR.

2. Comment references non-existent stride_view.hpp (Minor)

// pass a strided view (see stride_view.hpp) to sum a strided sequence such as a matrix diagonal.

stride_view.hpp does not exist yet — it's deferred to the follow-up Trace PR (#834). A reader encountering this comment today will search for a file that isn't there. Change to:

// pass a strided view (see issue #834) to sum a strided sequence such as a matrix diagonal.

or remove the parenthetical until #834 lands.

3. Missing <complex> include in pairwise_sum.hpp (Minor)

T(0) for complex types requires std::complex to be visible. Currently this is satisfied transitively via the callers, but pairwise_sum.hpp should be self-contained. Add #include <complex> to the header.

4. Why half must be a multiple of 8 — comment is missing (Minor)

std::size_t half = n / 2;
half -= half % 8;

This alignment ensures child calls can enter the unrolled path, but there's no comment explaining it. A brief note like // align to block size so the child call enters the unrolled path would help future readers.


Strengths

  • Clean, self-contained header with a clear algorithm structure.
  • Handles empty, small, medium, and large inputs correctly.
  • The PairwiseSum wrapper accepts any random-access range, enabling future reuse for strided reductions without changing the core algorithm.
  • Re-enabling the Accuracy test (rather than just removing DISABLED_) is the right approach — it now serves as a permanent regression guard.
  • Integer types correctly keep the straight loop (no rounding needed).
  • GPU path is correctly left unchanged.

Recommendation

Approve after addressing:

  1. Confirm (or add) CMAKE_CXX_STANDARD 20 in CMakeLists.txt.
  2. Fix the dangling stride_view.hpp reference in the comment.
  3. (Optional) Add #include <complex> to pairwise_sum.hpp and a comment on the half % 8 alignment.

@IvanaGyro IvanaGyro left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

The switch to NumPy-style pairwise summation is a clean, well-documented improvement, and re-enabling the previously DISABLED_Accuracy test is the right move. The implementation looks correct: the empty range returns T(0), the <8 / <=128 / recursive-split branches are well-formed, and the powers-of-two block structure is what lets the homogeneous-value test now pass exactly. Two test-coverage gaps below; no blocking concerns with the production code itself.

Comment thread tests/linalg_test/sum_test.cpp
Comment thread tests/linalg_test/sum_test.cpp
@IvanaGyro IvanaGyro marked this pull request as ready for review June 9, 2026 17:31
@IvanaGyro IvanaGyro requested a review from pcchen June 9, 2026 17:31
IvanaGyro and others added 2 commits June 10, 2026 17:02
linalg::Sum accumulated every element into one scalar in a serial loop,
whose worst-case rounding error grows as O(N * eps). The Accuracy test in
tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose
bit patterns are chosen to expose round-off) failed under naive accumulation
and was disabled.

Add a pairwise-summation primitive (pairwise_sum.hpp): PairwiseSum mirrors
NumPy's np.add.reduce -- a straight loop below 8 elements, an eight-accumulator
unrolled block up to 128, and a halve-and-recurse split rounded to a multiple
of eight above that. Worst-case error growth drops to O(log N * eps) at
essentially the same cost. PairwiseSum consumes any random-access range, so it
also serves strided reductions later.

Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to
call PairwiseSum over the contiguous storage span and re-enable the accuracy
test. Integer branches keep the straight loop since they do not round. The GPU
path (cuReduce_gpu) already performs a block/tree reduction with bounded error,
so it is left unchanged.

For the disabled test's values the new code lands within 2 ULP of the
reference, against the 4-ULP tolerance of EXPECT_NUMBER_EQ; naive accumulation
was 108 ULP off.

Co-authored-by: Claude <noreply@anthropic.com>
The existing LinalgSumHomogeneousValuesTest.Accuracy uses element_number =
10000, which only reaches the recursive-split branch of PairwiseSumBlocks, and
the equal-magnitude input distribution is the one regime where naive serial
summation stays nearly exact -- neither demonstrates nor protects the actual
accuracy benefit pairwise summation buys.

Add two focused tests:

* LinalgSumBoundaryTest.EachPairwiseSumBranch iterates over sizes 1, 7, 8, 9,
  15, 128, 129, 137, 1024 and verifies linalg::Sum on a homogeneous-1.0 tensor
  is exactly N. The sizes straddle the n < 8 / n <= 128 / n > 128 thresholds
  and include the n % 8 != 0 tails inside the 8-accumulator block, so a
  regression that drops or double-counts an element in any branch fails
  immediately rather than only on the n = 10000 recursive case.

* LinalgSumHeterogeneousMagnitudeTest.RecoversTermsLostByNaiveAccumulation_*
  sums [+L, 1, 1, ..., 1, -L] where L is far above the precision threshold
  (1e16 for double > 2^53, 1e8 for float > 2^23). The exact sum is N - 2.
  Under naive accumulation the running total reaches L on the first element
  and every subsequent +1 vanishes in IEEE 754 rounding (1.0 is below the unit
  in the last place at that magnitude), so naive collapses to ~0. Pairwise
  keeps small values together in the tree until the +/- L cancellation at the
  top, so most of the small terms survive. The contract asserted is N/2 <
  result <= N: tight enough that a serial-accumulation regression collapses to
  ~0 and fails, loose enough to accommodate the handful of 1's that the
  unrolled accumulator holding L still rounds away.

Co-authored-by: Claude <noreply@anthropic.com>
@IvanaGyro IvanaGyro force-pushed the claude/835-pairwise-sum branch from ed5a88b to e7a9b3e Compare June 10, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

linalg::Sum accumulates floating-point error; use pairwise summation like NumPy

2 participants