fix(isthmus): std_dev, variance function mappings#780
Conversation
mbwhite
left a comment
There was a problem hiding this comment.
LGTM - will be good to get these cleaned-up
|
For similar special-case conversion handling with scalar functions, we made a special effort to extract the special case handling to separate classes outside of the main ScalarFunctionConverter. Perhaps we should do something similar for special case function handling in AggregateFunctionConverter? See PR #401. I certainly do not see the implementation in #401 as perfect. Ideally function conversion would be refactored to better support special case function handling. Perhaps by making the building blocks of function and type mapping more easily composable and reusable by different converters. That doesn't need to happen in one step though; more likely a refactor once things are working. |
I actually started with an |
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Use the non-deprecated std_dev/variance signatures that carry the SAMPLE/POPULATION distinction as a leading "distribution" enum argument (std_dev:req_fp64 etc.) instead of the now-deprecated function option. During Calcite -> Substrait conversion the distribution enum operand is synthesized so the generic function matcher resolves the enum-arg variant and builds the EnumArg; the reverse direction disambiguates the Calcite operator from that argument. A shared StatisticalDistribution enum is added in :core so the DSL builder and isthmus share one source of truth. Resolves substrait-io#803 Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Collapse the three-way branch in getSqlOperatorFromSubstraitFunc into a single conditional: when a distribution enum argument is present, narrow the candidate operators by it (falling back to all operators when output type filtering yielded none). Behavior is unchanged. Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Substrait's std_dev/variance only define fp32/fp64 signatures, so a statistical aggregate over an integer (or other non-fp) column must be cast. Previously the cast projection built in fromAggCall was discarded (only used to type the measure operand), producing an inconsistent plan where the aggregate argument was typed fp64 over an un-cast integer input. visit(Aggregate) now rewrites such aggregates at the Calcite level: it appends a cast(arg AS fp64) column to the input (leaving the original column for other aggregates over it), re-points the statistical aggregate at the appended column with its return type re-derived over fp64, and casts the results back to the type Calcite inferred via a projection on top. The rewrite is idempotent (fp32/fp64 arguments are untouched), so the recursive re-conversion terminates and the plan is stable after Calcite's project normalization. Also fold in minor cleanups to FunctionConverter: import EnumArg rather than using its fully qualified name, correct the multi-operator error message (it serves aggregates too, not only scalar functions), and document the single-distribution-argument assumption in distributionArgument. Signed-off-by: Niels Pardon <par@zurich.ibm.com>
|
rebased on latest main and updated the PR to use the new enum arg function variants instead of the deprecated function option variants |
This PR implements the bidirectional conversion between Calcite and Substrait for the
statistical aggregate functions standard deviation and variance (
STDDEV_POP,STDDEV_SAMP,VAR_POP,VAR_SAMP).Problem
Substrait uses a single function name for both the population and sample variants:
std_devfor bothSTDDEV_POPandSTDDEV_SAMPvariancefor bothVAR_POPandVAR_SAMPThe population (n denominator) vs. sample (n-1 denominator) distinction is captured by a
distributionvalue (POPULATIONorSAMPLE). Previously these functions were not mapped,so the existing TPC-DS test cases using them were silently mis-mapped to
AVG(Calciterepresents these statistical functions with
SqlAvgAggFunction).Solution
This uses the non-deprecated signatures introduced in substrait-io/substrait#1011
(Substrait v0.87.0), which carry the distinction as an enum function argument (the
leading
distributionargument, e.g.std_dev:req_fp64), rather than the deprecatedfunction-option form (substrait-io/substrait#1019). Resolves #803.
Since the input arguments are cast to
FP64when necessary, the integer-based signaturesproposed in substrait-io/substrait#1012 are not required.
Calcite → Substrait:
FunctionMappingsfor all four statistical operators.AggregateFunctionConvertersynthesizes the leadingdistributionenum operand based onthe Calcite
SqlKind, so the generic function matcher resolves the enum-arg variant andbuilds the
EnumArgautomatically (no bespoke option plumbing).FP64where necessary.Substrait → Calcite:
FunctionConverter.getSqlOperatorFromSubstraitFuncdisambiguates the population/sampleoperator from the
distributionenum argument.SubstraitRelNodeConverterandPreCalciteAggregateValidatorskip the non-value enumargument when building Calcite aggregate operands.
DSL & shared enum:
StatisticalDistributionenum (in:core) is the single source of truth for theSAMPLE/POPULATIONvalues used by both the DSL builder and isthmus.SubstraitBuildergainsstddevPopulation,stddevSample,variancePopulation, andvarianceSampleconvenience methods.Non-floating-point inputs
std_dev/varianceonly definefp32/fp64signatures, so a statistical aggregateover an integer (or other non-fp) column is rewritten in
SubstraitRelVisitorto cast theargument to
fp64(appending a cast column so other aggregates over the same column areunaffected) and cast the result back to the type Calcite inferred. The rewrite is
idempotent, so the converted plan is stable under further round trips.
Testing
AggregationFunctionsTestexercises full round trips (POJO ⇄ proto and Substrait ⇄Calcite) for all four functions, with and without grouping.
StatisticalFunctionTestverifies the SQL round trip, asserts that each SQL operatormaps to the enum-arg signature (
std_dev:req_fp64etc.) with the correctdistributionEnumArgand no function options, and covers non-fp (integer) inputs — including acolumn shared with a non-statistical aggregate, and with grouping.
🤖 Generated with AI