Skip to content

Grouped first_value/last_value FILTER Incorrectly Includes NULL Predicate Rows #22666

@kosiew

Description

@kosiew

Summary

Grouped first_value and last_value currently apply aggregate FILTER using only BooleanArray::value(idx), without checking predicate validity. This allows rows where the filter predicate is NULL to be treated as passing when the underlying value bit is set.

Under SQL aggregate FILTER semantics, rows pass only when the filter evaluates to TRUE (Some(true)), and must be excluded when it is FALSE or NULL.

Affected Area

  • File: datafusion/functions-aggregate/src/first_last.rs
  • Function: FirstLastGroupsAccumulator::get_filtered_extreme_of_each_group
  • Current predicate:
    • let passed_filter = opt_filter.is_none_or(|x| x.value(idx_in_val));

Problem Statement

The grouped path for first_value/last_value computes passed_filter from BooleanArray::value() only. For nullable boolean arrays, this can treat NULL rows as passing if the value bitmap bit is 1, even though the row should be excluded.

This violates SQL semantics for aggregate FILTER and can produce incorrect non-NULL aggregate results when all rows should be filtered out.

Reproduction

SQL

SELECT
  g,
  first_value(a ORDER BY a) FILTER (WHERE b < 1) AS fv
FROM (
  VALUES
    (0, 10, CAST(NULL AS INT)),
    (0, 20, 2)
) AS t(g, a, b)
GROUP BY g;

Observed Result

  • fv = 10

Expected Result

  • fv = NULL

Reason:

  • Row (0, 10, NULL) has b < 1 = NULL and must not pass FILTER.
  • Row (0, 20, 2) has b < 1 = FALSE and must not pass FILTER.
  • No rows satisfy Some(true), so grouped first_value should return NULL.

Root Cause

In grouped first_last processing, filter pass logic does not encode the invariant:

  • row passes aggregate filter iff predicate is Some(true)

Instead, it checks only value(idx), which ignores nullability at that row.

Proposed Fix

Update grouped filter evaluation in get_filtered_extreme_of_each_group to require both validity and value (or equivalent Some(true) check), for example:

  • x.is_valid(idx_in_val) && x.value(idx_in_val)
  • or x.iter().nth(idx_in_val) == Some(Some(true))
  • or a shared helper encapsulating Some(true) semantics.

A helper-based approach is preferable to avoid semantic drift with other grouped aggregate paths.

Testing Plan

  1. Add SQLLogicTest coverage for grouped first_value with nullable FILTER predicate returning no TRUE rows.
  2. Add the equivalent grouped last_value case.
  3. Add at least one mixed case where only some rows satisfy Some(true) to verify rows with NULL predicates are excluded.
  4. Run:
    • cargo test -p datafusion-functions-aggregate --lib first_last
    • cargo test -p datafusion-sqllogictest --test sqllogictests aggregate

Acceptance Criteria

  1. Grouped first_value and last_value exclude rows where FILTER evaluates to NULL.
  2. Reproducer query returns NULL as expected.
  3. New regression tests fail before the fix and pass after.
  4. No behavior regressions for non-null filter predicates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions