fix(sum): use pairwise summation for floating-point linalg::Sum by IvanaGyro · Pull Request #849 · Cytnx-dev/Cytnx

IvanaGyro · 2026-05-29T04:08:06Z

Fixes #835.

Problem

linalg::Sum accumulated every element into one scalar in a serial loop, whose worst-case rounding error grows as O(N · eps). The Accuracy test in tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was therefore disabled.

Change

Add a strided pairwise-summation primitive src/backend/linalg_internal_cpu/pairwise_sum.hpp:
- StrideView — a C++20 random-access, sized view over every k-th element of a range.
- PairwiseSum — NumPy np.add.reduce-style summation: straight loop below 8 elements, an 8-accumulator unrolled block up to 128, and a halve-and-recurse split (rounded to a multiple of 8) above that. Worst-case error growth drops to O(log N · eps) at essentially the same cost.
Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to use PairwiseSum. Integer branches keep the straight loop (they do not round).
Re-enable the accuracy test (DISABLED_Accuracy → Accuracy).
The GPU path (cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is unchanged and its GpuAccuracy test stays enabled.

StrideView is intentionally reusable for strided reductions (e.g. a matrix diagonal); the follow-up Trace refactor (#834) builds on it.

Test plan

openblas-cpu: full suite 948/948 passed (includes the 10 re-enabled LinalgSumHomogeneousValuesTest.Accuracy<...> cases).
mkl-cpu: full suite 948/948 passed.
For the previously-disabled values the new code lands within 2 ULP of the reference (EXPECT_NUMBER_EQ tolerates 4 ULP); naive accumulation was 108 ULP off.
GPU CI (unchanged GPU path).

🤖 Draft — opened for review.

Generated by Claude Code

gemini-code-assist

Code Review

This pull request replaces naive summation loops in Sum_internal.cpp with a recursive pairwise summation algorithm (PairwiseSum) implemented in pairwise_sum.hpp to improve floating-point accuracy, and enables the corresponding accuracy tests. The review feedback highlights a critical safety issue in StrideView::Iterator where storing a pointer to the view can lead to a dangling pointer, and recommends storing the underlying range's iterator instead, along with updating the begin() and end() methods.

codecov · 2026-05-29T04:42:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.68%. Comparing base (87c9fd5) to head (e7a9b3e).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #849      +/-   ##
==========================================
+ Coverage   29.49%   29.68%   +0.19%     
==========================================
  Files         241      242       +1     
  Lines       35524    35544      +20     
  Branches    14780    14781       +1     
==========================================
+ Hits        10477    10551      +74     
+ Misses      17791    17737      -54     
  Partials     7256     7256

Flag	Coverage Δ
cpp	`29.28% <100.00%> (+0.19%)`	⬆️
python	`52.71% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
C++ backend	`30.96% <100.00%> (+0.22%)`	⬆️
Python bindings	`17.09% <ø> (ø)`
Python package	`52.71% <ø> (ø)`

Files with missing lines	Coverage Δ
src/backend/linalg_internal_cpu/Sum_internal.cpp	`96.29% <100.00%> (+86.61%)`	⬆️
src/backend/linalg_internal_cpu/pairwise_sum.hpp	`100.00% <100.00%> (ø)`

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87c9fd5...e7a9b3e. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

IvanaGyro

(Review written by Claude on behalf of @IvanaGyro.)

The pairwise-summation core looks correct to me. PairwiseSumBlocks covers every element exactly once across the three regimes (<8 straight loop, the 8-accumulator block for [8,128] with its scalar tail, and the n>128 split where half is rounded down to a multiple of 8 and is always ≥64), and the reduction order matches NumPy's np.add.reduce. Empty ranges return T(0), preserving the old behavior. The Sum_internal rewrite and the test re-enable are clean.

One additive note on top of the existing comments — see the inline comment about StrideView being unexercised in this PR.

IvanaGyro · 2026-05-29T12:03:26Z

(Comment written by Claude on behalf of @IvanaGyro.)

Test-coverage note (in addition to the StrideView point in the review): the re-enabled Accuracy test sums 10000 identical values. That exercises the recursive split, but not the block-size boundaries nor mixed-magnitude inputs — which is exactly where pairwise summation diverges from naive accumulation. Consider a small direct PairwiseSum unit test covering n = 0, 1, 7, 8, 9, 128, 129 with a mix of large/small/negative values, compared against a higher-precision (e.g. long double / Kahan) reference, so the 8-accumulator block edge and the halve-and-recurse boundary are pinned and can't silently regress.

IvanaGyro

(Review written by Claude on behalf of @IvanaGyro.)

Thanks for addressing the earlier round — stride_view/iterator naming, the dangling-pointer fix (iterator now stores the base iterator, with a dedicated IteratorsOutliveTheView test), the move to its own stride_view.hpp, the | stride(k) pipe adaptor, and the static_asserts on the range concepts all look good, and the fix is correct. One blocking issue: the new test file isn't registered in the build — details inline.

pcchen · 2026-05-31T02:44:42Z

PR #849 Review: `fix(sum)` — pairwise summation for floating-point `linalg::Sum`

Posted by Claude Code on behalf of @pcchen

Overview

This PR replaces the naive serial accumulation loop in Sum_internal (float/double/complexfloat/complexdouble) with a pairwise summation algorithm, reducing worst-case rounding error from O(N·eps) to O(log N·eps). The re-enabled Accuracy test confirms the fix works in practice (within 2 ULP vs. 108 ULP for naive).

The algorithm mirrors NumPy's np.add.reduce:

n < 8: straight loop
8 ≤ n ≤ 128: 8-accumulator unrolled block + tail
n > 128: split at n/2 rounded down to a multiple of 8, recurse

Correctness

The algorithm is correct and the boundary cases are handled:

n = 0: falls into n < 8 path, loop doesn't execute, returns T(0). ✅
n = 8: initialises r0..r7, inner loop skipped (i+8 > n), tail skipped, returns tree sum of 8 accumulators. ✅
n = 9: inner loop skipped, tail adds first[8] to the tree sum. ✅
Complex types: T(0) constructs {0.0, 0.0} for std::complex<T>, all arithmetic works element-wise. ✅
Half-rounding can't produce zero: for n > 128, half = n/2 ≥ 64, and half - half%8 ≥ 56 > 0, so the recursion always terminates. ✅
Recursion depth: O(log₂(N/128)) — for N = 1 billion, depth ≈ 23, no stack concern. ✅

Issues

1. C++20 requirement needs explicit confirmation (Important)

pairwise_sum.hpp requires C++20:

#include <ranges> — C++20
std::ranges::random_access_range, std::ranges::range_value_t, std::ranges::begin, std::ranges::size — C++20 concepts/functions
std::random_access_iterator — C++20 concept
std::span in Sum_internal.cpp — C++20

The PR passes the full test suite, so the build system already compiles with C++20. But if CMakeLists.txt hasn't been updated to require C++20 (e.g., set(CMAKE_CXX_STANDARD 20)), a clean build on a C++17-defaulting toolchain will fail. Please confirm CMAKE_CXX_STANDARD is already set to ≥ 20, or update it in this PR.

2. Comment references non-existent `stride_view.hpp` (Minor)

// pass a strided view (see stride_view.hpp) to sum a strided sequence such as a matrix diagonal.

stride_view.hpp does not exist yet — it's deferred to the follow-up Trace PR (#834). A reader encountering this comment today will search for a file that isn't there. Change to:

// pass a strided view (see issue #834) to sum a strided sequence such as a matrix diagonal.

or remove the parenthetical until #834 lands.

3. Missing `<complex>` include in `pairwise_sum.hpp` (Minor)

T(0) for complex types requires std::complex to be visible. Currently this is satisfied transitively via the callers, but pairwise_sum.hpp should be self-contained. Add #include <complex> to the header.

4. Why `half` must be a multiple of 8 — comment is missing (Minor)

std::size_t half = n / 2;
half -= half % 8;

This alignment ensures child calls can enter the unrolled path, but there's no comment explaining it. A brief note like // align to block size so the child call enters the unrolled path would help future readers.

Strengths

Clean, self-contained header with a clear algorithm structure.
Handles empty, small, medium, and large inputs correctly.
The PairwiseSum wrapper accepts any random-access range, enabling future reuse for strided reductions without changing the core algorithm.
Re-enabling the Accuracy test (rather than just removing DISABLED_) is the right approach — it now serves as a permanent regression guard.
Integer types correctly keep the straight loop (no rounding needed).
GPU path is correctly left unchanged.

Recommendation

Approve after addressing:

Confirm (or add) CMAKE_CXX_STANDARD 20 in CMakeLists.txt.
Fix the dangling stride_view.hpp reference in the comment.
(Optional) Add #include <complex> to pairwise_sum.hpp and a comment on the half % 8 alignment.

IvanaGyro

(Review written by Claude on behalf of @IvanaGyro.)

The switch to NumPy-style pairwise summation is a clean, well-documented improvement, and re-enabling the previously DISABLED_Accuracy test is the right move. The implementation looks correct: the empty range returns T(0), the <8 / <=128 / recursive-split branches are well-formed, and the powers-of-two block structure is what lets the homogeneous-value test now pass exactly. Two test-coverage gaps below; no blocking concerns with the production code itself.

linalg::Sum accumulated every element into one scalar in a serial loop, whose worst-case rounding error grows as O(N * eps). The Accuracy test in tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was disabled. Add a pairwise-summation primitive (pairwise_sum.hpp): PairwiseSum mirrors NumPy's np.add.reduce -- a straight loop below 8 elements, an eight-accumulator unrolled block up to 128, and a halve-and-recurse split rounded to a multiple of eight above that. Worst-case error growth drops to O(log N * eps) at essentially the same cost. PairwiseSum consumes any random-access range, so it also serves strided reductions later. Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to call PairwiseSum over the contiguous storage span and re-enable the accuracy test. Integer branches keep the straight loop since they do not round. The GPU path (cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is left unchanged. For the disabled test's values the new code lands within 2 ULP of the reference, against the 4-ULP tolerance of EXPECT_NUMBER_EQ; naive accumulation was 108 ULP off. Co-authored-by: Claude <noreply@anthropic.com>

The existing LinalgSumHomogeneousValuesTest.Accuracy uses element_number = 10000, which only reaches the recursive-split branch of PairwiseSumBlocks, and the equal-magnitude input distribution is the one regime where naive serial summation stays nearly exact -- neither demonstrates nor protects the actual accuracy benefit pairwise summation buys. Add two focused tests: * LinalgSumBoundaryTest.EachPairwiseSumBranch iterates over sizes 1, 7, 8, 9, 15, 128, 129, 137, 1024 and verifies linalg::Sum on a homogeneous-1.0 tensor is exactly N. The sizes straddle the n < 8 / n <= 128 / n > 128 thresholds and include the n % 8 != 0 tails inside the 8-accumulator block, so a regression that drops or double-counts an element in any branch fails immediately rather than only on the n = 10000 recursive case. * LinalgSumHeterogeneousMagnitudeTest.RecoversTermsLostByNaiveAccumulation_* sums [+L, 1, 1, ..., 1, -L] where L is far above the precision threshold (1e16 for double > 2^53, 1e8 for float > 2^23). The exact sum is N - 2. Under naive accumulation the running total reaches L on the first element and every subsequent +1 vanishes in IEEE 754 rounding (1.0 is below the unit in the last place at that magnitude), so naive collapses to ~0. Pairwise keeps small values together in the tree until the +/- L cancellation at the top, so most of the small terms survive. The contract asserted is N/2 < result <= N: tight enough that a serial-accumulation regression collapses to ~0 and fails, loose enough to accommodate the handful of 1's that the unrolled accumulator holding L still rounds away. Co-authored-by: Claude <noreply@anthropic.com>

IvanaGyro mentioned this pull request May 29, 2026

refactor(trace): implement the GPU Tensor trace as CUDA kernels #850

Draft

3 tasks

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

IvanaGyro commented May 29, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

IvanaGyro commented May 29, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

IvanaGyro force-pushed the claude/835-pairwise-sum branch from e355381 to 4c76053 Compare May 29, 2026 07:53

IvanaGyro commented May 29, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

IvanaGyro mentioned this pull request May 29, 2026

ci(coverage): build coverage on push to master so Codecov compares the right base #854

Merged

1 task

IvanaGyro force-pushed the claude/835-pairwise-sum branch from 4c76053 to 38fdf4f Compare May 29, 2026 23:12

IvanaGyro commented May 29, 2026

View reviewed changes

Comment thread tests/linalg_test/stride_view_test.cpp Outdated

IvanaGyro commented May 30, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/stride_view.hpp Outdated

IvanaGyro force-pushed the claude/835-pairwise-sum branch from 38fdf4f to 3a5a680 Compare May 30, 2026 09:57

pcchen added this to the v1.1.0 milestone May 31, 2026

pcchen mentioned this pull request May 31, 2026

linalg::Sum accumulates floating-point error; use pairwise summation like NumPy #835

Open

This was referenced May 31, 2026

feat(stride_view): add a strided random-access range adaptor #860

Draft

refactor(trace): pass scalar Trace internal args by value and unify naming #861

Closed

refactor(trace): compute the CPU Tensor trace in the container layer #862

Draft

IvanaGyro commented May 31, 2026

View reviewed changes

Comment thread tests/linalg_test/sum_test.cpp

Comment thread tests/linalg_test/sum_test.cpp

IvanaGyro marked this pull request as ready for review June 9, 2026 17:31

IvanaGyro requested a review from pcchen June 9, 2026 17:31

IvanaGyro and others added 2 commits June 10, 2026 17:02

IvanaGyro force-pushed the claude/835-pairwise-sum branch from ed5a88b to e7a9b3e Compare June 10, 2026 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sum): use pairwise summation for floating-point linalg::Sum#849

fix(sum): use pairwise summation for floating-point linalg::Sum#849
IvanaGyro wants to merge 2 commits into
masterfrom
claude/835-pairwise-sum

IvanaGyro commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

IvanaGyro left a comment

Uh oh!

Uh oh!

IvanaGyro commented May 29, 2026

Uh oh!

IvanaGyro left a comment

Uh oh!

Uh oh!

Uh oh!

pcchen commented May 31, 2026

Uh oh!

IvanaGyro left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IvanaGyro commented May 29, 2026

Problem

Change

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

IvanaGyro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IvanaGyro commented May 29, 2026

Uh oh!

IvanaGyro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pcchen commented May 31, 2026

PR #849 Review: fix(sum) — pairwise summation for floating-point linalg::Sum

Overview

Correctness

Issues

1. C++20 requirement needs explicit confirmation (Important)

2. Comment references non-existent stride_view.hpp (Minor)

3. Missing <complex> include in pairwise_sum.hpp (Minor)

4. Why half must be a multiple of 8 — comment is missing (Minor)

Strengths

Recommendation

Uh oh!

IvanaGyro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 29, 2026 •

edited

Loading

PR #849 Review: `fix(sum)` — pairwise summation for floating-point `linalg::Sum`

2. Comment references non-existent `stride_view.hpp` (Minor)

3. Missing `<complex>` include in `pairwise_sum.hpp` (Minor)

4. Why `half` must be a multiple of 8 — comment is missing (Minor)