Skip to content

CPU ReduceSum: improve large float32 full-reduction accuracy and add regression coverage#28587

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-cpu-executionprovider-reducesum-issue
Draft

CPU ReduceSum: improve large float32 full-reduction accuracy and add regression coverage#28587
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-cpu-executionprovider-reducesum-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 20, 2026

Description

CPU ReduceSum on large float32 tensors could produce materially incorrect results (orders of magnitude beyond expected reduction-order drift), while equivalent float64 reduction paths remained stable. This PR tightens numerical behavior for the affected path and adds a focused regression case matching the reported shape/value pattern.

  • Numerical stabilization in ReduceSum<float>

    • Updated ReduceAggregatorSum<T>::aggall to use Kahan compensated summation for T=float in the scalar aggregation path.
    • Kept non-float behavior unchanged.
  • Targeted regression test

    • Added ReductionOpTest.ReduceSum_default_axes_do_not_keep_dims_large_float32_constant_input.
    • Reproduces a large full-tensor reduction ([5, 68, 64, 64], constant 0.1f) and asserts the scalar output against a high-precision reference with tight tolerances.
if constexpr (std::is_same_v<T, float>) {
  double sum = 0.0, compensation = 0.0;
  for (int64_t i = 0; i < size; ++i) {
    const double value = static_cast<double>(from_data[i]) - compensation;
    const double next_sum = sum + value;
    compensation = (next_sum - sum) - value;
    sum = next_sum;
  }
  return static_cast<float>(sum);
}

Motivation and Context

The issue reports a large discrepancy in CPU EP ReduceSum(float32) for a simple large constant input, even with optimizations and threading constrained; the observed error is far larger than normal non-associativity noise. The change directly targets that high-error reduction path and adds regression protection for the same workload profile.

Copilot AI and others added 2 commits May 20, 2026 16:24
Copilot AI changed the title [WIP] Fix numerical discrepancy in CPUExecutionProvider ReduceSum CPU ReduceSum: improve large float32 full-reduction accuracy and add regression coverage May 20, 2026
Copilot AI requested a review from justinchuby May 20, 2026 16:32
@justinchuby
Copy link
Copy Markdown
Contributor

@copilot is this efficient?

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 20, 2026

@copilot is this efficient?

It’s O(n) and allocation-free, but it is slower than the previous vectorized float sum because of the compensated double accumulation. The tradeoff is intentional here to eliminate the large numeric error reported for big float32 reductions. In this branch, that behavior is in ReduceAggregatorSum<float>::aggall from commit 1d0f58f. If you want, I can follow up with a size-threshold fast path (Eigen for small reductions, compensated path for large) to reduce overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large numerical discrepancy in CPUExecutionProvider ReduceSum for large float32 tensor

2 participants