Summary
Grouped first_value and last_value currently apply aggregate FILTER using only BooleanArray::value(idx), without checking predicate validity. This allows rows where the filter predicate is NULL to be treated as passing when the underlying value bit is set.
Under SQL aggregate FILTER semantics, rows pass only when the filter evaluates to TRUE (Some(true)), and must be excluded when it is FALSE or NULL.
Affected Area
- File:
datafusion/functions-aggregate/src/first_last.rs
- Function:
FirstLastGroupsAccumulator::get_filtered_extreme_of_each_group
- Current predicate:
let passed_filter = opt_filter.is_none_or(|x| x.value(idx_in_val));
Problem Statement
The grouped path for first_value/last_value computes passed_filter from BooleanArray::value() only. For nullable boolean arrays, this can treat NULL rows as passing if the value bitmap bit is 1, even though the row should be excluded.
This violates SQL semantics for aggregate FILTER and can produce incorrect non-NULL aggregate results when all rows should be filtered out.
Reproduction
SQL
SELECT
g,
first_value(a ORDER BY a) FILTER (WHERE b < 1) AS fv
FROM (
VALUES
(0, 10, CAST(NULL AS INT)),
(0, 20, 2)
) AS t(g, a, b)
GROUP BY g;
Observed Result
Expected Result
Reason:
- Row
(0, 10, NULL) has b < 1 = NULL and must not pass FILTER.
- Row
(0, 20, 2) has b < 1 = FALSE and must not pass FILTER.
- No rows satisfy
Some(true), so grouped first_value should return NULL.
Root Cause
In grouped first_last processing, filter pass logic does not encode the invariant:
- row passes aggregate filter iff predicate is
Some(true)
Instead, it checks only value(idx), which ignores nullability at that row.
Proposed Fix
Update grouped filter evaluation in get_filtered_extreme_of_each_group to require both validity and value (or equivalent Some(true) check), for example:
x.is_valid(idx_in_val) && x.value(idx_in_val)
- or
x.iter().nth(idx_in_val) == Some(Some(true))
- or a shared helper encapsulating
Some(true) semantics.
A helper-based approach is preferable to avoid semantic drift with other grouped aggregate paths.
Testing Plan
- Add SQLLogicTest coverage for grouped
first_value with nullable FILTER predicate returning no TRUE rows.
- Add the equivalent grouped
last_value case.
- Add at least one mixed case where only some rows satisfy
Some(true) to verify rows with NULL predicates are excluded.
- Run:
cargo test -p datafusion-functions-aggregate --lib first_last
cargo test -p datafusion-sqllogictest --test sqllogictests aggregate
Acceptance Criteria
- Grouped
first_value and last_value exclude rows where FILTER evaluates to NULL.
- Reproducer query returns
NULL as expected.
- New regression tests fail before the fix and pass after.
- No behavior regressions for non-null filter predicates.
Summary
Grouped
first_valueandlast_valuecurrently apply aggregateFILTERusing onlyBooleanArray::value(idx), without checking predicate validity. This allows rows where the filter predicate isNULLto be treated as passing when the underlying value bit is set.Under SQL aggregate
FILTERsemantics, rows pass only when the filter evaluates toTRUE(Some(true)), and must be excluded when it isFALSEorNULL.Affected Area
datafusion/functions-aggregate/src/first_last.rsFirstLastGroupsAccumulator::get_filtered_extreme_of_each_grouplet passed_filter = opt_filter.is_none_or(|x| x.value(idx_in_val));Problem Statement
The grouped path for
first_value/last_valuecomputespassed_filterfromBooleanArray::value()only. For nullable boolean arrays, this can treatNULLrows as passing if the value bitmap bit is1, even though the row should be excluded.This violates SQL semantics for aggregate
FILTERand can produce incorrect non-NULL aggregate results when all rows should be filtered out.Reproduction
SQL
Observed Result
fv = 10Expected Result
fv = NULLReason:
(0, 10, NULL)hasb < 1=NULLand must not passFILTER.(0, 20, 2)hasb < 1=FALSEand must not passFILTER.Some(true), so groupedfirst_valueshould returnNULL.Root Cause
In grouped
first_lastprocessing, filter pass logic does not encode the invariant:Some(true)Instead, it checks only
value(idx), which ignores nullability at that row.Proposed Fix
Update grouped filter evaluation in
get_filtered_extreme_of_each_groupto require both validity and value (or equivalentSome(true)check), for example:x.is_valid(idx_in_val) && x.value(idx_in_val)x.iter().nth(idx_in_val) == Some(Some(true))Some(true)semantics.A helper-based approach is preferable to avoid semantic drift with other grouped aggregate paths.
Testing Plan
first_valuewith nullableFILTERpredicate returning noTRUErows.last_valuecase.Some(true)to verify rows withNULLpredicates are excluded.cargo test -p datafusion-functions-aggregate --lib first_lastcargo test -p datafusion-sqllogictest --test sqllogictests aggregateAcceptance Criteria
first_valueandlast_valueexclude rows whereFILTERevaluates toNULL.NULLas expected.