Adaptive (runtime, stats-based) conjunct reordering for FilterExec#22698
Adaptive (runtime, stats-based) conjunct reordering for FilterExec#22698adriangb wants to merge 4 commits into
Conversation
Add a shared, policy-free substrate for runtime-adaptive filtering under `adaptive`: - `SelectivityStats`: per-predicate online accumulator of selectivity (pass rate), cost (eval nanos), and a caller-supplied effectiveness sample with Welford mean/variance and one-sided confidence bounds. - `AdaptiveStatsRegistry`: concurrent `FilterId -> stats` map with per-predicate skip flags, for a shared/multi-threaded consumer. The kernel defines no placement or ordering policy — consumers (an adaptive `FilterExec`, later the parquet scan) layer their own ranking function on top. `FilterId` is registry-local; there is no global id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Experimental, off by default. Gates runtime-adaptive reordering of the conjuncts of a conjunctive `FilterExec` predicate. Regenerate configs.md and the information_schema config listing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When `execution.adaptive_filter_reordering` is on and the predicate is a multi-conjunct `AND` with no volatile expressions, FilterExec evaluates the conjuncts in a measured order instead of as a single fused predicate. - Conjuncts are evaluated sequentially with threshold-gated compaction (mirroring BinaryExpr's pre-selection), measuring each conjunct's marginal selectivity and cost per batch via stream-local `Vec<SelectivityStats>` (ids are dense 0..n, no locking). - Conjuncts are ranked by mean discards-per-second (= minimising cost_per_row / (1 - pass_rate)); the order is committed once it is statistically certain (adjacent effectiveness confidence intervals do not overlap), or after a small sample cap if they are indistinguishable. - On freeze the conjuncts are fused into a left-deep AND in the learned order and evaluated as an ordinary predicate, so the steady state pays no adaptive overhead and reuses BinaryExpr's pre-selection. A frozen evaluator periodically re-thaws to detect distribution drift, backing the interval off exponentially while the order is stable. State is stream-local; the plan, results, and EXPLAIN are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing lift-selectivity-stats (5e71ea4) to 85bc5ef (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing lift-selectivity-stats (5e71ea4) to 85bc5ef (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing lift-selectivity-stats (5e71ea4) to 85bc5ef (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks env:
DATAFUSION_EXECUTION_ADAPTIVE_FILTER_REORDERING: true |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing lift-selectivity-stats (5e71ea4) to 85bc5ef (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing lift-selectivity-stats (5e71ea4) to 85bc5ef (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing lift-selectivity-stats (5e71ea4) to 85bc5ef (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks env:
DATAFUSION_EXECUTION_ADAPTIVE_FILTER_REORDERING: true |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing lift-selectivity-stats (5e71ea4) to 85bc5ef (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
…r reordering - adaptive_filter.slt: results and EXPLAIN are identical with the flag on and off (reordering changes evaluation order only). - benchmarks/sql_benchmarks/adversarial_filter: a self-contained SQL benchmark suite (synthetic data generated inline via generate_series) of five equally-expensive regexp predicates with the selective one written last — where SQL order, the apache#22343 cost heuristic, and BinaryExpr pre-selection all leave the order wrong. Toggle with ADAPTIVE_FILTER_REORDERING: BENCH_NAME=adversarial_filter ADAPTIVE_FILTER_REORDERING=true \ cargo bench --bench sql Q01 (selective last): ~1.75x faster at 10M rows (more at higher ADV_ROWS). Q02 (selective first, control): neutral — confirms the win is an ordering fix and the adaptive path adds no overhead when it cannot help. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5e71ea4 to
5f61a73
Compare
Which issue does this PR close?
(static cheap/expensive heuristic reordering). No single issue; happy to file
one if useful.
Rationale for this change
Predicate evaluation order matters: running a selective predicate first lets it
gate the work of the predicates after it. The static cheap/expensive heuristic
(#22343) sorts conjuncts into two cost classes and stable-sorts within each, so
it does nothing to order multiple similarly-expensive predicates; and
BinaryExpr'sANDshort-circuit only gates on a leftmost selective conjunct.So a conjunction of several expensive predicates whose selective member is not
written first is evaluated with every predicate scanning ~every row — and
neither mechanism fixes it.
This PR adds runtime, statistics-based conjunct reordering for
FilterExec:it measures each conjunct's selectivity and cost on the rows that actually reach
it and runs the ones that discard the most rows per unit of CPU time first.
Maximising discards-per-second is exactly minimising
cost_per_row / (1 - pass_rate),the classic optimal ordering key for independent conjuncts.
It is off by default (
datafusion.execution.adaptive_filter_reordering).What changes are included in this PR?
Split into four reviewable commits:
physical-expr-common: adaptive selectivity-stats substrate — apolicy-free
SelectivityStats(online selectivity + cost with Welfordmean/variance and confidence bounds) and a concurrent
AdaptiveStatsRegistry.Reusable by other consumers (e.g. a future parquet-scan integration).
common: config flagexecution.adaptive_filter_reordering(defaultfalse), plus regenerated
configs.md/information_schemalisting.physical-plan: adaptive conjunct reordering inFilterExec— astream-local evaluator that:
BinaryExpr's pre-selection) and measures each marginally;statistically certain (adjacent confidence intervals stop overlapping),
or after a small sample cap if the conjuncts are indistinguishable;
ANDin the learnedorder and evaluates it as an ordinary predicate (no measurement overhead,
inherits
BinaryExprpre-selection);drift, so steady-state overhead decays toward zero.
State is stream-local; the plan, results, and
EXPLAINare unchanged.Tests + benchmark — an end-to-end
.sltasserting identical results/planwith the flag on and off, and a self-contained SQL benchmark suite
benchmarks/sql_benchmarks/adversarial_filter(synthetic data generatedinline) demonstrating the win. Run with:
Are these changes tested?
Yes:
SelectivityStats, registry) and theFilterExecevaluator (gating correctness, certainty-freeze, re-thaw backoff,drift adaptation).
adaptive_filter.slt: results andEXPLAINidentical with the flag on/off.1.75x faster at 10M rows (≈2.85x at 30M — the win scales with data
size); the selective-first control is neutral, confirming zero overhead
when reordering can't help.
real Q12 -22.7% win; TPC-DS SF1: net +0.0%; ClickBench: neutral on
engaging queries (small residual only on sub-10ms queries).
Are there any user-facing changes?
One new config option,
datafusion.execution.adaptive_filter_reordering(experimental, default false). When enabled, the order in which a conjunctive
filter's predicates are evaluated may change at runtime; results are unchanged,
but observable side effects of fallible predicates could differ (predicates
containing volatile expressions are never reordered).