feat: accept TableProvider write inputs for merge_insert and insert by wjones127 · Pull Request #7368 · lance-format/lance

wjones127 · 2026-06-18T21:18:32Z

Implements #4583: make Arc<dyn TableProvider> the canonical internal write input behind ergonomic wrappers, so re-readable sources replay across retries without spilling and materialized sources expose statistics to the merge join.

Changes

merge_insert: execute_provider (canonical entry) and execute_batches (multi-partition MemTable); execute(stream) is now a thin wrapper. The retry loop re-scans the provider per attempt, replacing the new_source_iter/SpillStreamIter replay layer. A one-shot stream spills (memory→disk) only when conflict_retries > 0; spill_for_retry(false) fails fast on contention instead of buffering the stream.
Source statistics (use case 3): execute_provider/execute_batches plan against the provider via read_table, so a MemTable/file source's exact num_rows/total_byte_size reach the optimizer and drive join build-side selection. Stream sources keep the one-shot path — they carry no statistics anyway, and this preserves the source's original error type.
InsertBuilder: adds execute_provider/execute_uncommitted_provider for input-shape parity (no retry/spill today).
Python: materialized inputs (pa.Table, pa.RecordBatch, pandas/polars frames, dict/list-of-dict) route through the new in-memory execute_batches path; streams, readers, scanners, and datasets keep the spilling path.

Tests

Rust: new tests for execute_provider, execute_batches, source-stats-in-plan, spill_for_retry(false) fail-fast, and InsertBuilder::execute_provider; full dataset::write:: suite + spill tests green; clippy/fmt clean.
Python: new materialized-vs-streaming parity test; merge_insert suite green; ruff clean.

Follow-ups (out of scope here)

Re-scannable Python providers for pa.dataset.Dataset/Scanner (currently still spill — needs a Python-callback TableProvider).
Fanning the fragment writer out over provider partitions so materialized inserts write data files in parallel (use case 1).

🤖 Generated with Claude Code

Make `Arc<dyn TableProvider>` the canonical internal write input behind ergonomic wrappers. Re-readable sources are now replayed across retries without spilling to disk, and materialized sources report statistics that let DataFusion choose the merge-join build side. - merge_insert: add `execute_provider` (canonical) and `execute_batches` (multi-partition `MemTable`); `execute(stream)` becomes a wrapper. Retries re-scan the provider instead of the removed `new_source_iter`/ `SpillStreamIter` replay layer. A one-shot stream spills only when retries are enabled; `spill_for_retry(false)` fails fast instead of buffering. - Plan against the provider directly so a MemTable/file source's statistics reach the join; stream sources keep the one-shot path (no stats lost, and the source's original error type is preserved). - InsertBuilder gains `execute_provider`/`execute_uncommitted_provider`. - Python routes materialized inputs (pa.Table, RecordBatch, DataFrame, ...) through the in-memory path; streams and scanners keep spilling. Re-scannable Python providers (pa.dataset.Dataset/Scanner) and parallel data-file writes over provider partitions remain follow-ups. Issue: lance-format#4583 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-18T21:57:59Z

Codecov Report

❌ Patch coverage is 90.02494% with 40 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/write/merge_insert.rs	93.76%	7 Missing and 14 partials ⚠️
rust/lance-datafusion/src/spill.rs	69.38%	9 Missing and 6 partials ⚠️
rust/lance-datafusion/src/exec.rs	72.72%	1 Missing and 2 partials ⚠️
rust/lance-core/src/error.rs	75.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Always plan against the source TableProvider directly (read_table), so every source — including spilled and one-shot streams — exposes its statistics to the merge join. The merge write node already requires a single-partition input, so the optimizer coalesces multi-partition providers; the previous target_partitions(1) hack and the scan_provider_directly branch are removed. Drop the provider_to_stream first-batch peek that preserved the source error's concrete type. Source errors are shared across join partitions by DataFusion (DataFusionError::Shared), so the type cannot be recovered, and no caller needs it — Python surfaces these errors by message. The error conversion now handles Shared (recursing when sole-owner, otherwise preserving the message under the execution category). Revert the InsertBuilder provider methods: they only adapt a provider back to a stream and add no parallelism until fragment fan-out exists. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Verify that merging an empty batch list is a no-op that leaves the target unchanged, exercising the empty-source partition and schema-fallback paths in batches_to_provider / batches_into_partitions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added A-python Python bindings enhancement New feature or request labels Jun 18, 2026

wjones127 commented Jun 18, 2026

View reviewed changes

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated

wjones127 and others added 2 commits June 18, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: accept TableProvider write inputs for merge_insert and insert#7368

feat: accept TableProvider write inputs for merge_insert and insert#7368
wjones127 wants to merge 3 commits into
lance-format:mainfrom
wjones127:worktree-synchronous-mixing-metcalfe

wjones127 commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjones127 commented Jun 18, 2026

Changes

Tests

Follow-ups (out of scope here)

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 18, 2026 •

edited

Loading