Skip to content

feat: accept TableProvider write inputs for merge_insert and insert#7368

Draft
wjones127 wants to merge 3 commits into
lance-format:mainfrom
wjones127:worktree-synchronous-mixing-metcalfe
Draft

feat: accept TableProvider write inputs for merge_insert and insert#7368
wjones127 wants to merge 3 commits into
lance-format:mainfrom
wjones127:worktree-synchronous-mixing-metcalfe

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

Implements #4583: make Arc<dyn TableProvider> the canonical internal write input behind ergonomic wrappers, so re-readable sources replay across retries without spilling and materialized sources expose statistics to the merge join.

Changes

  • merge_insert: execute_provider (canonical entry) and execute_batches (multi-partition MemTable); execute(stream) is now a thin wrapper. The retry loop re-scans the provider per attempt, replacing the new_source_iter/SpillStreamIter replay layer. A one-shot stream spills (memory→disk) only when conflict_retries > 0; spill_for_retry(false) fails fast on contention instead of buffering the stream.
  • Source statistics (use case 3): execute_provider/execute_batches plan against the provider via read_table, so a MemTable/file source's exact num_rows/total_byte_size reach the optimizer and drive join build-side selection. Stream sources keep the one-shot path — they carry no statistics anyway, and this preserves the source's original error type.
  • InsertBuilder: adds execute_provider/execute_uncommitted_provider for input-shape parity (no retry/spill today).
  • Python: materialized inputs (pa.Table, pa.RecordBatch, pandas/polars frames, dict/list-of-dict) route through the new in-memory execute_batches path; streams, readers, scanners, and datasets keep the spilling path.

Tests

  • Rust: new tests for execute_provider, execute_batches, source-stats-in-plan, spill_for_retry(false) fail-fast, and InsertBuilder::execute_provider; full dataset::write:: suite + spill tests green; clippy/fmt clean.
  • Python: new materialized-vs-streaming parity test; merge_insert suite green; ruff clean.

Follow-ups (out of scope here)

  • Re-scannable Python providers for pa.dataset.Dataset/Scanner (currently still spill — needs a Python-callback TableProvider).
  • Fanning the fragment writer out over provider partitions so materialized inserts write data files in parallel (use case 1).

🤖 Generated with Claude Code

Make `Arc<dyn TableProvider>` the canonical internal write input behind
ergonomic wrappers. Re-readable sources are now replayed across retries
without spilling to disk, and materialized sources report statistics that let
DataFusion choose the merge-join build side.

- merge_insert: add `execute_provider` (canonical) and `execute_batches`
  (multi-partition `MemTable`); `execute(stream)` becomes a wrapper. Retries
  re-scan the provider instead of the removed `new_source_iter`/
  `SpillStreamIter` replay layer. A one-shot stream spills only when retries
  are enabled; `spill_for_retry(false)` fails fast instead of buffering.
- Plan against the provider directly so a MemTable/file source's statistics
  reach the join; stream sources keep the one-shot path (no stats lost, and
  the source's original error type is preserved).
- InsertBuilder gains `execute_provider`/`execute_uncommitted_provider`.
- Python routes materialized inputs (pa.Table, RecordBatch, DataFrame, ...)
  through the in-memory path; streams and scanners keep spilling.

Re-scannable Python providers (pa.dataset.Dataset/Scanner) and parallel
data-file writes over provider partitions remain follow-ups.

Issue: lance-format#4583

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added A-python Python bindings enhancement New feature or request labels Jun 18, 2026
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated
Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated
Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated
wjones127 and others added 2 commits June 18, 2026 16:57
Always plan against the source TableProvider directly (read_table), so
every source — including spilled and one-shot streams — exposes its
statistics to the merge join. The merge write node already requires a
single-partition input, so the optimizer coalesces multi-partition
providers; the previous target_partitions(1) hack and the
scan_provider_directly branch are removed.

Drop the provider_to_stream first-batch peek that preserved the source
error's concrete type. Source errors are shared across join partitions by
DataFusion (DataFusionError::Shared), so the type cannot be recovered, and
no caller needs it — Python surfaces these errors by message. The error
conversion now handles Shared (recursing when sole-owner, otherwise
preserving the message under the execution category).

Revert the InsertBuilder provider methods: they only adapt a provider back
to a stream and add no parallelism until fragment fan-out exists.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verify that merging an empty batch list is a no-op that leaves the target
unchanged, exercising the empty-source partition and schema-fallback paths
in batches_to_provider / batches_into_partitions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant