Fix strict NotEqualTo and NotIn metrics with nulls and NaNs by kevinjqliu · Pull Request #3547 · apache/iceberg-python

kevinjqliu · 2026-06-22T01:15:59Z

Summary

Related to #3498

Fix strict metrics evaluation for NotEqualTo and NotIn so files are only proven to match when a column contains only nulls or only NaNs. Mixed null/NaN files now continue through the existing bounds checks instead of being treated as ROWS_MUST_MATCH.

Root Cause

The strict evaluator used _can_contain_nulls / _can_contain_nans for negative predicates. That is too broad: a file with values like [null, 5] and bounds 5..5 cannot be proven to match x != 5 or x not in {5} because the non-null row may still fail the predicate.

Java Parity

This matches Java's StrictMetricsEvaluator, which only short-circuits negative predicates when the column contains only nulls or only NaNs:

Validation

UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py -k "mixed_nulls_and_matching_bounds or mixed_nans_and_matching_bounds or all_nulls or all_nans or strict_integer_not_in"
UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py
UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run ruff check pyiceberg/expressions/visitors.py tests/expressions/test_evaluator.py
git diff --check

rambleraptor · 2026-06-22T01:24:31Z

This looks the same fix as #3521

kevinjqliu · 2026-06-22T01:27:12Z


    should_read = _StrictMetricsEvaluator(strict_data_file_schema, NotIn("some_nulls", {"abc", "def"})).eval(strict_data_file_1)
-    assert should_read, "Should match: notIn on some nulls column, 'bbb' > 'abc' and 'bbb' < 'def'"
+    assert not should_read, "Should not match: mixed-null notIn cannot be proven when bounds are missing"


some_nulls has value_count = 50 and null_count = 10

Its field id is 5 in strict_data_file_schema

So 40 values are non-null, but without lower/upper bounds the strict evaluator cannot rule out "abc" or "def" among them.

Before the fix, “column can contain nulls” incorrectly caused ROWS_MUST_MATCH; now only “column contains nulls only” can do that.

kevinjqliu · 2026-06-22T01:51:43Z

+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file)
+    assert should_read == ROWS_MIGHT_NOT_MATCH, "Should not match: bounds prove the non-null value is 5"


value_count = 2 and null_count = 1, so there is 1 non-null value. Bounds [5..5] mean the non-null value is 5, so NotEqualTo("x", 5) / NotIn("x", {5, 6}) cannot match every row.

kevinjqliu · 2026-06-22T01:51:55Z

+        nan_value_counts=None,
+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file)


value_count = 2 and null_count = 2, so all values are null. That means every row matches NotEqualTo("x", 5) / NotIn("x", {5, 6}).

kevinjqliu · 2026-06-22T01:52:17Z

+        upper_bounds={1: to_bytes(field_type, 5.0)},
+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5.0)).eval(data_file)


value_count = 2 and nan_count = 1, so there is 1 non-NaN value. Bounds [5.0..5.0] mean the non-NaN value is 5.0, so NotEqualTo("x", 5.0) / NotIn("x", {5.0, 6.0}) cannot match every row.

kevinjqliu · 2026-06-22T01:52:21Z

+        nan_value_counts={1: 2},
+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5.0)).eval(data_file)


value_count = 2 and nan_count = 2, so all values are NaN. That means every row matches NotEqualTo("x", 5.0) / NotIn("x", {5.0, 6.0}).

kevinjqliu commented Jun 22, 2026

View reviewed changes

kevinjqliu changed the title ~~[codex] Fix strict NotEqualTo and NotIn metrics with nulls and NaNs~~ Fix strict NotEqualTo and NotIn metrics with nulls and NaNs Jun 22, 2026

kevinjqliu force-pushed the kevinjqliu/codex-strict-metrics-not-eq-not-in branch from 2d02d02 to fa29f84 Compare June 22, 2026 01:28

Fix strict not-equal metrics with nulls and NaNs

43860b7

kevinjqliu force-pushed the kevinjqliu/codex-strict-metrics-not-eq-not-in branch from fa29f84 to 43860b7 Compare June 22, 2026 01:45

kevinjqliu commented Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix strict NotEqualTo and NotIn metrics with nulls and NaNs#3547

Fix strict NotEqualTo and NotIn metrics with nulls and NaNs#3547
kevinjqliu wants to merge 1 commit into
apache:mainfrom
kevinjqliu:kevinjqliu/codex-strict-metrics-not-eq-not-in

kevinjqliu commented Jun 22, 2026 •

edited

Loading

Uh oh!

rambleraptor commented Jun 22, 2026 •

edited

Loading

Uh oh!

kevinjqliu Jun 22, 2026

Uh oh!

kevinjqliu Jun 22, 2026

Uh oh!

kevinjqliu Jun 22, 2026

Uh oh!

kevinjqliu Jun 22, 2026

Uh oh!

kevinjqliu Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevinjqliu commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Java Parity

Validation

Uh oh!

rambleraptor commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinjqliu commented Jun 22, 2026 •

edited

Loading

rambleraptor commented Jun 22, 2026 •

edited

Loading