Skip to content

Align grouped first_value/last_value FILTER handling with shared Some(true) semantics#22681

Open
kosiew wants to merge 3 commits into
apache:mainfrom
kosiew:filtering-logic-01-22665
Open

Align grouped first_value/last_value FILTER handling with shared Some(true) semantics#22681
kosiew wants to merge 3 commits into
apache:mainfrom
kosiew:filtering-logic-01-22665

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Jun 1, 2026

Which issue does this PR close?

Rationale for this change

Grouped first_value and last_value used a local BooleanArray::value(idx) check when evaluating aggregate FILTER predicates. This reads only the value bit and does not account for NULL validity, which can incorrectly treat a NULL filter value with a true value bit as passing.

DataFusion's aggregate FILTER semantics require that only Some(true) passes; both Some(false) and None should reject the row. This change reuses the existing shared filter validity logic to ensure grouped first_value and last_value behave consistently with other aggregate implementations.

What changes are included in this PR?

  • Made filter_to_validity publicly accessible from datafusion-functions-aggregate-common so it can be reused by grouped aggregate implementations.

  • Updated FirstLastGroupsAccumulator to:

    • Precompute filter validity using filter_to_validity.
    • Evaluate FILTER predicates using the shared validity bitmap rather than BooleanArray::value(idx).
  • Added targeted unit test helpers for grouped first_value / last_value accumulators.

  • Added regression tests covering nullable FILTER predicates where the value bit is true but the validity bit is NULL:

    • test_first_group_acc_rejects_null_filter_with_true_value_bit
    • test_last_group_acc_rejects_null_filter_with_true_value_bit
    • test_first_group_acc_merge_rejects_null_filter_with_true_value_bit

Are these changes tested?

Yes.

This PR adds the following unit tests:

  • test_first_group_acc_rejects_null_filter_with_true_value_bit
  • test_last_group_acc_rejects_null_filter_with_true_value_bit
  • test_first_group_acc_merge_rejects_null_filter_with_true_value_bit

These tests verify that rows with a NULL FILTER value do not pass even when the underlying boolean value bit is true, and that the same semantics are preserved in the merge path.

Are there any user-facing changes?

No user-facing changes are intended.

This is a correctness fix that aligns grouped first_value and last_value aggregate FILTER evaluation with existing DataFusion aggregate semantics for nullable filter predicates.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 3 commits June 1, 2026 11:44
…on functionality

- Updated `filter_to_validity` to public in `groups_accumulator/nulls.rs`.
- Modified `first_last.rs` to use shared `filter_to_validity` for grouped first/last.
- Enhanced handling to reject NULL filter rows even if the value bit is true.
- Added tests for:
- First update path
- Last update path
- First merge path
- Destructured group_indices loop by removing redundant variable assignment.
- Added test helper: new_int64_first_last_group_acc(...) for improved test setup.
- Added test helper: nullable_bool_filter(...) to facilitate testing with nullable booleans.
- Replaced repeated test setup across tests with the newly created helpers for better maintainability.
- Changed valid to validity in nullable_bool_filter.
- Added assert_group_acc_int64_result function.
- Replaced repeated evaluate/downcast/assert blocks in three tests for better maintainability.
@github-actions github-actions Bot added the functions Changes to functions implementation label Jun 1, 2026
@kosiew kosiew marked this pull request as ready for review June 1, 2026 04:45
@Nagato-Yuzuru
Copy link
Copy Markdown

The same bug in #22666. Looks like this PR covers that one too. Might be worth a sqllogictest on top since the repro was SQL. Here's mine if you want it (without the fix, g=1 returns 10 instead of NULL):

# Grouped first_value/last_value must apply aggregate FILTER with Some(true)
# semantics: a row passes only when the predicate is TRUE. Rows where the
# predicate evaluates to NULL or FALSE must be excluded.
#
# Rows per group (predicate is b < 1):
#   g=1: (a=10, b=NULL -> NULL), (a=20, b=2 -> FALSE)        => no rows pass
#   g=2: (a=30, b=0 -> TRUE), (a=40, b=NULL -> NULL),
#        (a=50, b=-5 -> TRUE)                                => a=30 and a=50 pass
#   g=3: (a=60, b=NULL -> NULL)                              => no rows pass
statement ok
CREATE TABLE first_last_filter_null_tests(g INT, a INT, b INT) AS VALUES
(1, 10, CAST(NULL AS INT)),
(1, 20, 2),
(2, 30, 0),
(2, 40, CAST(NULL AS INT)),
(2, 50, -5),
(3, 60, CAST(NULL AS INT));

# Groups 1 and 3 have no rows passing the filter -> NULL.
# Group 2 has a=30 and a=50 passing -> first_value ORDER BY a = 30.
query II
SELECT g, first_value(a ORDER BY a) FILTER (WHERE b < 1) AS fv
FROM first_last_filter_null_tests
GROUP BY g
ORDER BY g;
----
1 NULL
2 30
3 NULL

# Same groups via last_value: group 2 picks the largest passing a = 50.
query II
SELECT g, last_value(a ORDER BY a) FILTER (WHERE b < 1) AS lv
FROM first_last_filter_null_tests
GROUP BY g
ORDER BY g;
----
1 NULL
2 50
3 NULL

statement ok
DROP TABLE first_last_filter_null_tests;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants