Skip to content

fix(isthmus): std_dev, variance function mappings#780

Open
nielspardon wants to merge 4 commits into
substrait-io:mainfrom
nielspardon:par-stddev
Open

fix(isthmus): std_dev, variance function mappings#780
nielspardon wants to merge 4 commits into
substrait-io:mainfrom
nielspardon:par-stddev

Conversation

@nielspardon

@nielspardon nielspardon commented Mar 26, 2026

Copy link
Copy Markdown
Member

This PR implements the bidirectional conversion between Calcite and Substrait for the
statistical aggregate functions standard deviation and variance (STDDEV_POP,
STDDEV_SAMP, VAR_POP, VAR_SAMP).

Problem

Substrait uses a single function name for both the population and sample variants:

  • std_dev for both STDDEV_POP and STDDEV_SAMP
  • variance for both VAR_POP and VAR_SAMP

The population (n denominator) vs. sample (n-1 denominator) distinction is captured by a
distribution value (POPULATION or SAMPLE). Previously these functions were not mapped,
so the existing TPC-DS test cases using them were silently mis-mapped to AVG (Calcite
represents these statistical functions with SqlAvgAggFunction).

Solution

This uses the non-deprecated signatures introduced in substrait-io/substrait#1011
(Substrait v0.87.0), which carry the distinction as an enum function argument (the
leading distribution argument, e.g. std_dev:req_fp64), rather than the deprecated
function-option form (substrait-io/substrait#1019). Resolves #803.

Since the input arguments are cast to FP64 when necessary, the integer-based signatures
proposed in substrait-io/substrait#1012 are not required.

Calcite → Substrait:

  • Added function mappings in FunctionMappings for all four statistical operators.
  • AggregateFunctionConverter synthesizes the leading distribution enum operand based on
    the Calcite SqlKind, so the generic function matcher resolves the enum-arg variant and
    builds the EnumArg automatically (no bespoke option plumbing).
  • Statistical inputs are cast to FP64 where necessary.

Substrait → Calcite:

  • FunctionConverter.getSqlOperatorFromSubstraitFunc disambiguates the population/sample
    operator from the distribution enum argument.
  • SubstraitRelNodeConverter and PreCalciteAggregateValidator skip the non-value enum
    argument when building Calcite aggregate operands.

DSL & shared enum:

  • A shared StatisticalDistribution enum (in :core) is the single source of truth for the
    SAMPLE / POPULATION values used by both the DSL builder and isthmus.
  • SubstraitBuilder gains stddevPopulation, stddevSample, variancePopulation, and
    varianceSample convenience methods.

Non-floating-point inputs

std_dev / variance only define fp32 / fp64 signatures, so a statistical aggregate
over an integer (or other non-fp) column is rewritten in SubstraitRelVisitor to cast the
argument to fp64 (appending a cast column so other aggregates over the same column are
unaffected) and cast the result back to the type Calcite inferred. The rewrite is
idempotent, so the converted plan is stable under further round trips.

Testing

  • AggregationFunctionsTest exercises full round trips (POJO ⇄ proto and Substrait ⇄
    Calcite) for all four functions, with and without grouping.
  • StatisticalFunctionTest verifies the SQL round trip, asserts that each SQL operator
    maps to the enum-arg signature (std_dev:req_fp64 etc.) with the correct distribution
    EnumArg and no function options, and covers non-fp (integer) inputs — including a
    column shared with a non-statistical aggregate, and with grouping.

🤖 Generated with AI

@mbwhite mbwhite left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - will be good to get these cleaned-up

@bestbeforetoday

Copy link
Copy Markdown
Member

For similar special-case conversion handling with scalar functions, we made a special effort to extract the special case handling to separate classes outside of the main ScalarFunctionConverter. Perhaps we should do something similar for special case function handling in AggregateFunctionConverter? See PR #401.

I certainly do not see the implementation in #401 as perfect. Ideally function conversion would be refactored to better support special case function handling. Perhaps by making the building blocks of function and type mapping more easily composable and reusable by different converters. That doesn't need to happen in one step though; more likely a refactor once things are working.

@nielspardon

Copy link
Copy Markdown
Member Author

For similar special-case conversion handling with scalar functions, we made a special effort to extract the special case handling to separate classes outside of the main ScalarFunctionConverter. Perhaps we should do something similar for special case function handling in AggregateFunctionConverter? See PR #401.

I certainly do not see the implementation in #401 as perfect. Ideally function conversion would be refactored to better support special case function handling. Perhaps by making the building blocks of function and type mapping more easily composable and reusable by different converters. That doesn't need to happen in one step though; more likely a refactor once things are working.

I actually started with an AggregateFunctionMapper similar to ScalarFunctionMapper but then ran into some challenges with that approach since I only have access to the Calcite AggregateCall in the AggregateFunctionConverter and can not modify the input of the Aggregate relation node which I need to do for casting the types of input fields. I also feel like a bigger redesign might be worthwhile exploring.

@nielspardon nielspardon deleted the par-stddev branch June 11, 2026 09:50
@nielspardon nielspardon restored the par-stddev branch June 11, 2026 09:50
@nielspardon nielspardon reopened this Jun 11, 2026
@nielspardon nielspardon marked this pull request as draft June 15, 2026 18:17
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Use the non-deprecated std_dev/variance signatures that carry the
SAMPLE/POPULATION distinction as a leading "distribution" enum argument
(std_dev:req_fp64 etc.) instead of the now-deprecated function option.

During Calcite -> Substrait conversion the distribution enum operand is
synthesized so the generic function matcher resolves the enum-arg variant
and builds the EnumArg; the reverse direction disambiguates the Calcite
operator from that argument. A shared StatisticalDistribution enum is
added in :core so the DSL builder and isthmus share one source of truth.

Resolves substrait-io#803

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Collapse the three-way branch in getSqlOperatorFromSubstraitFunc into a
single conditional: when a distribution enum argument is present, narrow
the candidate operators by it (falling back to all operators when output
type filtering yielded none). Behavior is unchanged.

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Substrait's std_dev/variance only define fp32/fp64 signatures, so a
statistical aggregate over an integer (or other non-fp) column must be
cast. Previously the cast projection built in fromAggCall was discarded
(only used to type the measure operand), producing an inconsistent plan
where the aggregate argument was typed fp64 over an un-cast integer input.

visit(Aggregate) now rewrites such aggregates at the Calcite level: it
appends a cast(arg AS fp64) column to the input (leaving the original
column for other aggregates over it), re-points the statistical aggregate
at the appended column with its return type re-derived over fp64, and
casts the results back to the type Calcite inferred via a projection on
top. The rewrite is idempotent (fp32/fp64 arguments are untouched), so the
recursive re-conversion terminates and the plan is stable after Calcite's
project normalization.

Also fold in minor cleanups to FunctionConverter: import EnumArg rather
than using its fully qualified name, correct the multi-operator error
message (it serves aggregates too, not only scalar functions), and
document the single-distribution-argument assumption in distributionArgument.

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
@nielspardon

Copy link
Copy Markdown
Member Author

rebased on latest main and updated the PR to use the new enum arg function variants instead of the deprecated function option variants

@nielspardon nielspardon marked this pull request as ready for review June 16, 2026 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

use enum arg function signatures for std_dev and variance functions instead of function option signatures

3 participants