perf: stream the target on the probe side of the merge_insert join#7382
Open
sezruby wants to merge 1 commit into
Open
perf: stream the target on the probe side of the merge_insert join#7382sezruby wants to merge 1 commit into
sezruby wants to merge 1 commit into
Conversation
`create_plan` (the v2 merge_insert fast path used by both single-node `execute` and uncommitted/distributed merges) built the join as `target.join(source)`. When the target exceeds DataFusion's hash-join collect threshold the join is planned as `mode=Partitioned`, whose build (left) side is hashed and held in memory per partition. With the target on the left this materialized the entire target per partition, which can exhaust executor memory on large tables (observed: an 8x16GB Spark cluster OOM-killed merging a 12-51M row source into a 64M row target, peak RSS hitting the 16GB pod limit with only ~0.4GB on the JVM heap — the rest was native Arrow hash memory). Build the join as `source.join(target)` instead, so the (typically small) source is the hash build side and the (potentially huge) target is streamed as the probe side. Neither input carries comparable row statistics (the source is a one-shot stream), so DataFusion's `should_swap_join_order` leaves the operands as written rather than swapping the target back onto the build side. The join output is semantically identical — every column is referenced downstream by qualified name, not position — so this is purely a memory/scheduling change. Measured on the OOM repro (real 64M-row target, same 16GB pods): the merge now completes where it previously OOM-killed, peak executor RSS dropped from 16.3GB to 6.7GB (20% source) and 11.0GB (80% source), and it ran ~2x faster than the position-delta path. Add `test_plan_keeps_target_on_probe_side_at_scale`, which exercises the production-representative `Partitioned` plan (>128K-row target) and asserts the target scan is the probe side. The existing toy-sized snapshot tests cannot catch a regression here: at small scale the join is `CollectLeft` and the optimizer freely swaps sides, so they pass with the operands reversed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
create_plan— themerge_insertfast path used byMergeInsertJob::execute— builds the join astarget.join(source).When the target exceeds DataFusion's hash-join collect threshold (
hash_join_single_partition_threshold_rows, 128K rows), the join is planned asmode=Partitioned, whose build (left) side is hashed and held in memory (per partition). With the target on the left, the entire target is materialized in memory during the merge. On a large target this dominates memory use and scales with target size, even when only a handful of rows are being upserted.This is the wrong side to build: the source of an upsert is typically far smaller than the target.
Fix
Build the join as
source.join(target)so the (typically small) source is the hash build side and the (potentially huge) target is streamed as the probe side.Neither input carries comparable row statistics (the source is a one-shot stream), so DataFusion's
should_swap_join_orderleaves the operands as written rather than swapping the target back onto the build side. The join-type mapping is mirrored accordingly (Left↔Right;Inner/Fullunchanged). Every column is referenced downstream by qualified name, not position, so the join output is semantically identical — this is purely a memory/scheduling change.Measured (single-node, pure
MergeInsertJob::execute)Upserting a 50K-row source into a wide-row target, peak process RSS (
getrusage):After the fix, peak memory is bounded by the source size and stays flat as the target grows (5M→20M: ~0.35 GB); before, it scales with the target. ~4.7× lower peak at 20M, and the gap widens with target size.
How the memory numbers were produced (single-node, no external deps)
I measured this with a throwaway
#[ignore]d test inmerge_insert.rs(not included in this PR — it allocates multiple GB and isn't a unit test). It writes a wide-row target, upserts a small source throughMergeInsertJob::execute, and prints the process peak RSS viagetrusage(RUSAGE_SELF). Reviewers can paste it in to reproduce:Run (each orientation in its own process so peak RSS is isolated):
To see the "before" number, temporarily flip the orientation back to
scan_aliased.join(source_df_aliased, ...)(with the join type mirrored) and rerun.Test
Adds
test_plan_keeps_target_on_probe_side_at_scale, which exercises the production-representativePartitionedplan (>128K-row target) and asserts the target scan (LanceRead) is the probe (right) side of theHashJoinExec.The existing toy-sized snapshot tests cannot catch a regression here: at small scale the join is
CollectLeftand the optimizer freely swaps sides, so they would still pass with the operands reversed. The four snapshot tests are updated for the new (semantically identical) projection column order.cargo test -p lance --lib merge_insert→ 157 passed.cargo fmt --all+cargo clippy -p lance --lib --testsclean.🤖 Generated with Claude Code