Fix strict NotEqualTo and NotIn metrics with nulls and NaNs#3547
Fix strict NotEqualTo and NotIn metrics with nulls and NaNs#3547kevinjqliu wants to merge 1 commit into
Conversation
|
This looks the same fix as #3521 |
|
|
||
| should_read = _StrictMetricsEvaluator(strict_data_file_schema, NotIn("some_nulls", {"abc", "def"})).eval(strict_data_file_1) | ||
| assert should_read, "Should match: notIn on some nulls column, 'bbb' > 'abc' and 'bbb' < 'def'" | ||
| assert not should_read, "Should not match: mixed-null notIn cannot be proven when bounds are missing" |
There was a problem hiding this comment.
some_nulls has value_count = 50 and null_count = 10
- Its field id is
5in strict_data_file_schema
So 40 values are non-null, but without lower/upper bounds the strict evaluator cannot rule out "abc" or "def" among them.
Before the fix, “column can contain nulls” incorrectly caused ROWS_MUST_MATCH; now only “column contains nulls only” can do that.
2d02d02 to
fa29f84
Compare
fa29f84 to
43860b7
Compare
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file) | ||
| assert should_read == ROWS_MIGHT_NOT_MATCH, "Should not match: bounds prove the non-null value is 5" |
There was a problem hiding this comment.
value_count = 2 and null_count = 1, so there is 1 non-null value. Bounds [5..5] mean the non-null value is 5, so NotEqualTo("x", 5) / NotIn("x", {5, 6}) cannot match every row.
| nan_value_counts=None, | ||
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file) |
There was a problem hiding this comment.
value_count = 2 and null_count = 2, so all values are null. That means every row matches NotEqualTo("x", 5) / NotIn("x", {5, 6}).
| upper_bounds={1: to_bytes(field_type, 5.0)}, | ||
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5.0)).eval(data_file) |
There was a problem hiding this comment.
value_count = 2 and nan_count = 1, so there is 1 non-NaN value. Bounds [5.0..5.0] mean the non-NaN value is 5.0, so NotEqualTo("x", 5.0) / NotIn("x", {5.0, 6.0}) cannot match every row.
| nan_value_counts={1: 2}, | ||
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5.0)).eval(data_file) |
There was a problem hiding this comment.
value_count = 2 and nan_count = 2, so all values are NaN. That means every row matches NotEqualTo("x", 5.0) / NotIn("x", {5.0, 6.0}).
Summary
Related to #3498
Fix strict metrics evaluation for
NotEqualToandNotInso files are only proven to match when a column contains only nulls or only NaNs. Mixed null/NaN files now continue through the existing bounds checks instead of being treated asROWS_MUST_MATCH.Root Cause
The strict evaluator used
_can_contain_nulls/_can_contain_nansfor negative predicates. That is too broad: a file with values like[null, 5]and bounds5..5cannot be proven to matchx != 5orx not in {5}because the non-null row may still fail the predicate.Java Parity
This matches Java's
StrictMetricsEvaluator, which only short-circuits negative predicates when the column contains only nulls or only NaNs:notEqnotInValidation
UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py -k "mixed_nulls_and_matching_bounds or mixed_nans_and_matching_bounds or all_nulls or all_nans or strict_integer_not_in"UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.pyUV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run ruff check pyiceberg/expressions/visitors.py tests/expressions/test_evaluator.pygit diff --check