feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#2
Draft
tobocop2 wants to merge 410 commits into
Draft
feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#2tobocop2 wants to merge 410 commits into
tobocop2 wants to merge 410 commits into
Conversation
5 tasks
tobocop2
added a commit
that referenced
this pull request
Apr 26, 2026
…ch dot_u8.rs convention Per-function doc comments on `*_scalar`/`*_avx`/`*_avx_fma`/`*_avx2`/`*_avx512` inner functions are now one-liners matching the existing convention in `dot_u8.rs` (see e.g. its `pub unsafe fn dot_u8_avx2` comment). Drops the redundant "Caller must ensure..." precondition lines — those are implicit from `unsafe fn` + `#[target_feature]`. Module-level `//!` docs and public-API `///` docs are left detailed per project convention. Refs #1, #2.
This was referenced Apr 26, 2026
a3df856 to
9193496
Compare
58325e8 to
7db5171
Compare
7db5171 to
26aa7f4
Compare
…ance-format#6901) ## Summary The background memtable flush handler (`MemTableFlushHandler::flush_memtable`) called `flush`, which persists the data file and bloom filter but **builds no secondary indexes**. The handler never even received the shard's index configs. Two query-side consequences over flushed generations: - **Point lookups** (`point_lookup.rs`) run a `filter_expr` scan. Lance can route that through a scalar index — but none existed, so lookups fell back to a full scan (perf regression). - **Vector search** (`vector_search.rs`) uses index-only `fast_search()`. Its doc comment assumes "each flushed memtable has its own vector index built during flush", which was false — so flushed vector rows were invisible to KNN. This is a **correctness** bug, not just perf. ## Changes - Thread the shard's `index_configs` into `MemTableFlushHandler`. - Call `flush_with_indexes` when any indexes are configured so each flushed generation carries the same secondary indexes as the active memtable; keep plain `flush` when none are configured to avoid a needless dataset open. - Box the `flush_with_indexes` future to keep the flush async block under the type-layout recursion limit. ## Testing - New `test_flushed_generation_is_indexed`: writes through the real `ShardWriter` path, forces a flush, and asserts the flushed generation (a) carries the BTree index and (b) resolves `id = 5` via `ScalarIndexQuery` rather than a scan. HNSW/FTS flush and `fast_search` over an indexed flushed generation are already covered by existing tests. - `cargo test -p lance --lib dataset::mem_wal::` — 316 passed, 0 failed. - `cargo fmt --all` and `cargo clippy -p lance --tests -- -D warnings` — clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rks (lance-format#6882) Final piece of the lance-format#6856 split. Replaces the duplicated criterion-based `mem_wal_read.rs` and `mem_wal_vector.rs` benchmarks with two standalone CLI benchmarks that emit JSON output for panel-style trend analysis: - `mem_wal_vector_bench`: KNN search across LSM levels with deterministic synthetic 384-dim embeddings, IVF-RQ base table index, and recall verification against brute-force ground truth. - `mem_wal_point_lookup_bench`: PK-based point lookups across the base table, flushed generations, and active memtable. Both accept `--flushed-generations` and `--max-memtable-rows` for sweeping the full matrix; results are written as individual JSON files. This is the reusable bench template I want to extend for FTS benchmarking. Depends on the LSM vector-search API (`with_dataset` / `refine_factor`) landed in lance-format#6881, so it's the last of the three split PRs. Part of splitting lance-format#6856 into focused PRs. Co-authored with @jackye1995. --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
…-format#6915) The Lance-DuckDB documentation build is broken due to an incorrect URL. This PR fixes that to render correctly in the sidebar via the `.pages` card. The build scripts have been updated to only include the relevant docs files from the `lance-duckdb/docs` folder. Also adds a new `index.md` page that users will now land on, rather than landing on the DataFusion page. The index page contains links to all the other relevant integrations.
…ost-filter (lance-format#6899) ## What Fixes a **stale read** in LSM MemWAL vector search: when a primary key is updated and its *fresh* row falls out of its own source's top-k, the superseded copy from an older generation could win the cross-source dedup and be returned. ## Background `LsmGlobalPkDedupExec` (introduced in lance-format#6881) is exact only over the candidates each source *surfaces*. If a PK's fresh version is pushed out of its source's top-k by closer rows, the dedup never sees it and cannot suppress the stale copy from an older generation — so the stale row is served. Repro: `test_vector_search_stale_read_when_fresh_falls_out_of_top_k`. ## Approach Make staleness a **per-source PK-hash post-filter** applied to each source's KNN *before* the cross-source union, so a stale row never reaches the merge. - **Membership.** Each generation's membership is an `Arc<HashSet<u64>>` of PK hashes (`compute_pk_hash` — the same hash the dedup nodes use). Built once per generation; flushed generations' sets are cached on `FlushedMemTableCache` and scanned **streaming** (one batch resident at a time, no full PK-column buffer). - **Per-source block set.** `compute_source_block_lists` gives each source the membership sets of the generations newer than it — `NEWER(G)` — as a `Vec<Arc<HashSet<u64>>>`, **referenced, never merged into a per-query union**. Generations are **per-shard**, so the map is keyed `(shard_id, generation)` and a source is only superseded by strictly-newer generations of its *own* shard (the base table, shardless and oldest, is blocked by every generation). The base table is **not scanned** — it's filtered by hashing only its KNN candidates. - **Execution.** `PkHashFilterExec` drops any candidate whose PK hash is in any of its source's blocked sets. This handles only **cross-generation** supersession: a PK in a newer generation makes every copy of it stale, so dropping by hash needs no row address. **Within-generation** duplicates (same PK twice in one generation) share a hash and are left to the existing global dedup's `(generation, freshness)` tiebreaker. ## Configuration `LsmVectorSearchPlanner::plan_search` exposes two knobs (Rust + Python + Java bindings): - **`overfetch_factor: f64`** — a single knob controlling *both* stale filtering and over-fetch: - `< 1.0` (e.g. `0.0`): stale filtering **off** (no block-list / `PkHashFilterExec`; the global dedup still runs). - `== 1.0` (**default**): filtering **on**, no over-fetch — a source with superseded rows fetches exactly `k` and may return fewer than `k` live rows. - `> 1.0`: filtering **on**, over-fetch `ceil(k * factor)` so dropping the stale rows still leaves `k`. There is intentionally no separate on/off flag — over-fetch is only meaningful while filtering, so the factor encodes both. - **`refine_base_table: bool`** (replaces the old `refine_factor: Option<u32>`) — re-rank the base arm's approximate index distances to exact (factor 1). Auto-enabled whenever stale filtering runs (over-fetch widens the base's candidate pool, which must be exact before the merge). ## External API `LsmScanner::contains_pks(&RecordBatch) -> Vec<bool>` — test which primary keys have been (re)written in the WAL fresh tier (active + frozen memtables + flushed generations), built like any query (construct the scanner, then call). Hashing is internal, so callers never reproduce `compute_pk_hash`. ## Scope / caveats - **Post-filter, not a true prefilter.** It relies on over-fetch to backfill dropped rows and does **not** guarantee `k` live results in the adversarial case (more superseded rows near the query than the over-fetch covers); `PkHashFilterExec` logs a per-source warning when this happens. Promoting the same membership into the KNN as a true prefilter (the index traverses until `k` rows pass, removing the over-fetch) is the headline follow-up. - **Within-generation top-k eviction** (a generation holds both a stale and a fresh copy of a PK, and the fresh one is evicted) is *not* pre-filtered — it shares a hash, so it can't be disambiguated by membership. It's the same bug class as the cross-source case, currently relying on the global dedup; closing it would need flush-time dedup (so flushed generations are internally deduped like the base table). The base table is assumed internally deduped. ## Tests `test_vector_search_stale_read_when_fresh_falls_out_of_top_k` passes (with positive + `overfetch_factor=0.0` toggle-off assertions); plus over-fetch backfill, cross-flushed and composite-PK stale reads, same-L0 newest-wins, per-shard block-list isolation, and `PkHashFilterExec` / membership unit tests (incl. null and composite PKs). `cargo fmt`, `cargo clippy -p lance --tests -- -D warnings`, and the `lance` / `lance-jni` / `pylance` crate checks are clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…at#6784) Vector and FTS indexes already support the fast search mode. The current PR enables fast search capability in scalar index scenarios. ``` dataset.create_scalar_index("filter", "BTREE") fast = dataset.to_table(filter="filter >= 95", fast_search=True) ``` Only indexed(scalar index) fragments can be scanned in fast search mode. **PAY ATTENTION** **Scalar index fast search Not Support for Lance Legacy Version LanceFileVersion::Legacy** --------- Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
## Summary Closes lance-format#6821. - Extend `Scanner::nearest` for batched queries on fixed-size vector columns (no separate `nearest_batch` API). - Flat batch KNN in `KNNVectorDistanceExec`: one scan per data batch, per-query top-k in a single stream (`m × k` rows max). - Add `query_index` (0-based) so callers can group results per input query. - Indexed path (`use_index=true`): per-query ANN, union, and `query_index` tagging when a vector index exists. - `distance_range` applied before per-query top-k on the flat path; indexed batch uses the same bounds. - `fast_search` without an index returns an empty batch result that still includes `query_index`. - Batch vs multivector: `FixedSizeList` column + list-like query → batch of single-vector queries; `List` multivector column → one multivector query. ## API | | | | --- | --- | | **Input** | List-like or 2-D query against a `FixedSizeList` embedding column | | **Output** | Up to `k` rows per query vector + `query_index` | | **Flat** | Shared scan/decode; requires `_rowid` | | **Indexed** | One indexed search per query vector, then merge | ## Benchmark Local disk, float32 vectors, `use_index=false`. OS page cache accepted. Medians over 15 timed rounds (3 warmup); separate then batch each round. ```bash cd python && uv run --extra benchmarks pytest python/python/benchmarks/test_search.py::test_batch_flat_knn ``` At small `m`, the second separate query often hits warm page cache, so speedup is modest (~1.1×). It grows with `m` as scan work is shared. ### Query count (1M rows, dim 512, k 10) | m | separate | batch | saved | speedup | | --: | --: | --: | --: | --: | | 2 | 227.48 ms | 208.46 ms | 19.02 ms | 1.09× | | 5 | 559.10 ms | 328.84 ms | 230.26 ms | 1.70× | | 10 | 1.1125 s | 536.48 ms | 576.01 ms | 2.07× | ### Dataset size (m 10, dim 512, k 10) | rows | separate | batch | saved | speedup | | --: | --: | --: | --: | --: | | 100,000 | 123.84 ms | 53.23 ms | 70.62 ms | 2.33× | | 500,000 | 566.20 ms | 261.00 ms | 305.20 ms | 2.17× | | 1,000,000 | 1.1104 s | 520.86 ms | 589.50 ms | 2.13× | ## Test plan - [x] `cargo test -p lance --lib test_batch_knn` - [x] `cargo test -p lance fast_search_without` - [x] `uv run pytest python/tests/test_vector_index.py -k batch` - [x] `cargo clippy -p lance --tests -- -D warnings` - [x] `uv run ruff format --check python/lance/dataset.py python/tests/test_vector_index.py` --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: BubbleCal <bubble-cal@outlook.com>
## Summary
Fix `Planner::optimize_expr` so expression type coercion runs before
simplification. This matches DataFusion's normal plan pipeline where
analyzer type coercion runs before optimizer simplification.
This matters for immutable UDFs with literal arguments. With the
previous order, simplification could evaluate a constant UDF call before
`Int64` literals were coerced to the UDF's expected `Float64` inputs.
For geospatial filters, `st_point(0, 0)` could reach the geo UDF as
integer arrays and panic during the UDF's strict float downcast.
`st_point(0.0, 0.0)` worked because the literals were already `Float64`.
## Repro Demo
This is a standalone Python repro script. It is not added to the repo.
```python
import tempfile
import lance
import numpy as np
import pyarrow as pa
from geoarrow.rust.core import point, points
def run_filter(ds, expr):
table = ds.to_table(filter=expr)
values = table.to_pydict()
print(f"{expr} -> {table.num_rows} rows: {values}")
assert table.num_rows == 1
with tempfile.TemporaryDirectory() as tmpdir:
point_array = points([np.array([1.0, 10.0]), np.array([2.0, 10.0])])
schema = pa.schema([pa.field(point("xy")).with_name("point")])
table = pa.Table.from_arrays([point_array], schema=schema)
ds = lance.write_dataset(table, f"{tmpdir}/geo_filter.lance")
run_filter(ds, "st_distance(point, st_point(0.0, 0.0)) < 5")
run_filter(ds, "st_distance(point, st_point(0, 0)) < 5")
```
Observed before the fix:
```text
st_distance(point, st_point(0.0, 0.0)) < 5 -> 1 rows: {'point': [{'x': 1.0, 'y': 2.0}]}
thread 'lance_background_thread' panicked at .../arrow-array-58.3.0/src/cast.rs:849:33:
primitive array
RuntimeError: Task was aborted
```
Observed after the fix:
```text
st_distance(point, st_point(0.0, 0.0)) < 5 -> 1 rows: {'point': [{'x': 1.0, 'y': 2.0}]}
st_distance(point, st_point(0, 0)) < 5 -> 1 rows: {'point': [{'x': 1.0, 'y': 2.0}]}
```
OpenDAL-backed object stores route Lance's bounded `CloudObjectReader` reads through `get_opts`, and `object_store_opendal` resolves those requests with a `stat_with` before reading bytes. This changes the bounded read path to use `get_ranges` and forwards `get_ranges` through `DynamicOpenDalStore`, letting OpenDAL issue range reads without the extra metadata call. Full-object reads and stream paths still use `get_opts` where metadata and stream semantics are needed.
…ion-vector-on-flush) (lance-format#6929) Unfortunately, we still have an issue where we can serve stale data from within a single source in the `mem_wal` implementation. The crux is that we dedup globally AFTER we run the filter at each source. So if the newer data does not satisfy the filter, then the algorithm maintains the stale data. In this PR we updated the active memtable flush path the make a single pass over the in-memory data and compute a delete vector that gets written on the flushed memtable. Since we read using the native `LanceScanner` that respects the delete vector we essentially dedup the flushed memtable without having to invalidate and recompute the indicies that are built during inserts on the active memtable. As part of this we removed all of the now unnecessary `_generation` / `_freshness` field addition and datafusion dedup nodes. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FilteredReadExec could spend a pushed-down LIMIT on scalar-index matches before evaluating a remaining refine filter, which allowed filtered scans to return fewer rows than requested even when enough rows matched the full predicate. This keeps scalar-index range truncation behind the full-filter exactness boundary: indexed candidates are still narrowed by the scalar index, but scan_range_after_filter is only consumed early when no refine filter remains. Regression coverage exercises both the FilteredReadExec path and the scanner(filter, limit) end-to-end behavior. Fixes lancedb/lancedb#3436.
This updates the workspace `jieba-rs` dependency to 0.10.0 and refreshes the resolved root and Python lockfiles. The existing Lance Jieba tokenizer integration continues to use the same upstream API surface, while the resolved `jieba-macros` package now matches the new release. The Python Rust third-party license inventory version entries were also updated to match the resolved dependencies.
…ce-format#7191) ## Summary Adds forward-compatibility infrastructure to the directory-catalog `__manifest` dataset, mirroring the Lance table format's reader/writer feature flags but at the catalog-manifest layer. - Persists two `u64` bitmasks in the `__manifest` dataset's `table_metadata` (`lance.namespace.manifest.reader_feature_flags` / `writer_feature_flags`). Absent keys parse as `0`, so every existing manifest stays universally compatible. - A build refuses to read or write a manifest that sets a flag it does not understand, returning a clear "please upgrade" error instead of misreading it. Reader and writer checks are enforced centrally: in the manifest consistency wrapper, at catalog open, and on the copy-on-write mutation path. - Also stops the directory catalog from silently degrading to directory listing when the manifest is incompatible — `build()` and the per-operation fallbacks propagate the incompatibility instead of masking it, so the check cannot be bypassed. This is the **mechanism only**: no manifest feature is defined yet, so the known masks are `0` and nothing is ever set — **zero behavior change** today. It is the prerequisite so that a future `__manifest` format change (e.g. a schema migration) can be shipped safely: that change adds its bit to the known masks and stamps it on write, and from then on older clients refuse the new format instead of misreading it.
Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
…-reuse window (lance-format#7325) ## Summary A deferred-remap compaction that materializes deletions writes a fragment-reuse index (FRI) that the inverted (full-text-search) index is read through at load time. The load-time path dropped the deleted rows and renumbered the surviving `doc_id`s — but the posting lists reference `doc_id`s **positionally** (a `doc_id` is an index into the `DocSet`'s `row_ids` / `num_tokens` arrays, fixed at build time) and are not regenerated at load. Dropping rows shifted every later `doc_id` out from under the posting lists, so a query would index `num_tokens` / `row_ids` out of bounds (panic) or score/return the wrong document. This is deletion-specific: merge-only deferred compaction remaps every row to `Some(new_addr)`, so nothing is dropped and positions stay aligned. It only breaks when deletions are materialized (`remap_row_id` returns `None`). ## Fix: tombstone-preserve-positions In `DocSet::from_columns` (the FRI load path), instead of dropping deleted rows: - keep every doc slot so `doc_id`s stay aligned with the posting lists; - put `RowAddress::TOMBSTONE_ROW` in the deleted slots, and leave them out of the `inv` reverse map so a `row_id` lookup never resolves to a deleted doc; - keep `num_tokens` full-length, so `num_tokens(doc_id)` can't go out of bounds. In `Wand::search`, skip docs whose resolved `row_id` is `TOMBSTONE_ROW` — placed right beside the existing prefilter-mask skip and using the same iterator-advance, so a tombstoned doc is stepped over exactly like a prefilter-rejected one and never surfaces in results. The heavyweight physical remap (`DocSet::remap`) still does the real renumber + compact (and rebuilds the posting lists to match); this load-time path only needs to stay consistent until then. ### Note on stats Tombstoned slots are still counted in `total_tokens` / `len()`, so BM25 `avgdl` in the FRI window is effectively the pre-deletion average. This only perturbs *scores* slightly, never the result set, and the physical remap restores exact stats. Excluding tombstones would require changing `len()` semantics (used by `idf`), which isn't worth it for a transient window. ## Test `test_read_inverted_index_with_defer_index_remap_and_deletions`: delete a prefix, deferred-compact, then assert FTS returns exactly the surviving rows — both in the FRI window and after physical remap + trim. Without the fix it panics on the out-of-bounds `num_tokens` access. ## Scope Independent change against `main` — touches only the inverted-index load and query paths (`scalar/inverted/{index,wand}.rs`) plus one new test. The analogous IVF_HNSW desync under deletions (lance-format#3993) is **not** addressed here: the HNSW graph traverses node ids positionally to compute distances *during* search (not just at result collection), so it needs a different approach across the SQ/PQ/flat/RQ storage load paths — a separate change.
…ance-format#7285) ## Problem The dataset fixtures in `python/benchmarks/test_search.py` pass deprecated parameters to the dataset APIs: - `lance.write_dataset(..., use_legacy_format=False)` - `lance.dataset(..., index_cache_size=64 * 1024)` The test config sets `filterwarnings = ['error::DeprecationWarning', ...]`, so these emit `DeprecationWarning` **as errors during fixture setup**. As a result every benchmark in `test_search.py` errors out before running: ``` DeprecationWarning: use_legacy_format is deprecated, use data_storage_version instead DeprecationWarning: The 'index_cache_size' parameter is deprecated. Use 'index_cache_size_bytes' instead. ``` ## Fix Switch to the current parameters: - `use_legacy_format=False` → `data_storage_version="stable"` (the exact mapping the deprecation shim applies). - `index_cache_size=64 * 1024` → `index_cache_size_bytes=512 * 1024 * 1024` (512 MiB comfortably caches these 100k-row IVF_PQ indices). ## Verification ``` $ uv run --group benchmarks pytest python/benchmarks/test_search.py::test_ann_no_refine --benchmark-only test_ann_no_refine[clean] 541.48 us 1813 ops test_ann_no_refine[with_delete_files] 770.46 us 1285 ops test_ann_no_refine[with_new_rows] 2843.17 us 349 ops 3 passed ``` `ruff check` / `ruff format --check` clean. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…TS) (lance-format#7067) ## What Fixes a stale-read phantom shared by the **vector** and **FTS** index search arms over the MemWAL **active memtable**, and routes the in-memory newest-per-PK / membership decisions through a single maintained MVCC index. ## The bug The active memtable is an append log; a PK update is a later append with the same key. The in-memory secondary indexes — HNSW (vector) and the inverted index (FTS) — are **append-only**, so an updated row's old entries stay live. Both arms deduped with `WithinSourceDedupExec`, which only suppresses a stale row when the fresh version is **also in the result set**. When an update moves a row out of the query's match set (vector: far from the query; FTS: new text no longer matches), the fresh version isn't returned, so the stale version leaks. (`point_lookup` was immune — it already did the MVCC recency seek.) ## The fix Maintain a per-memtable **MVCC PK-position index**: a lock-free arena skiplist keyed on `(compute_pk_hash(pk_columns), row_position)`, enabled on the active memtable and carried through freeze. The row position *is* the version stamp, so this reuses the exact primitive point-lookup trusts (`get_newest_visible`). - **`NewestPkFilterExec`** keeps an index hit iff `get_newest_visible(pk_hash, max_visible) == row_position` — predicate-independent, snapshot-exact (keys on the scanner's latched `max_visible`). Wired into the active vector arm (replacing `WithinSourceDedupExec`) and the FTS arm (adding `with_row_id`). - **point_lookup** falls back to the index (hash + value-equality collision guard) when no scalar BTree exists; its plan-path active arm uses `SortExec(_rowid DESC).fetch(1)` instead of `WithinSourceDedupExec`. - **Cross-source block-list** probes the index per candidate (`GenMembership::Index`, snapshot-bounded) with no per-query set; flushed/base keep cached sets. `contains_pks` probes too. - **Cleanup:** `WithinSourceDedupExec` / `DedupDirection` and the per-query PK-hash set builders (`pk_hashes()`, `in_memory_pk_hashes`) are deleted. Net negative LOC. Hash keying covers single **and composite** PKs uniformly. The snapshot-bounded probe also closes a latent over-block where a not-yet-visible newer write could shadow an older visible copy. ## Tests Both `#[ignore]`d repros un-ignored and passing; new `PkPositionIndex` unit tests, point-lookup-without-btree, index-sourced block-list, and snapshot-bounded **vanished-row guard** tests (within- and cross-source). Full `mem_wal` suite green; `cargo fmt` + `clippy -D warnings` clean. ## Deferred follow-ups - Migrate `MemTableDedupScanExec`'s reverse-walk `HashSet` (filtered-read scan path) onto the same probe — the last within-source mechanism off the index; benchmark-gated. - In-graph HNSW within-gen eviction (perf end-game; correctness is now exact). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ist columns (lance-format#7247) Closes lance-format#5102 ## Problem A fixed-size-list column with dimension 0 panics with `attempt to divide by zero` (`rust/lance-encoding/src/data.rs`, `FixedSizeListBlock::num_values`). As of pylance 7.0.0 the panic fires on **write** for every storage version (`stable`/`2.1`/`2.2`), and reading datasets persisted by older writers (which accepted such columns) panics as well. Reproduction details are in the issue comment: lance-format#5102 (comment) ## Approach Following the maintainer guidance in lance-format#5102 (error, not panic), this adds two small guards at boundaries that already return `Result`, instead of changing `DataBlock::num_values()` to return `Result` (the approach that made lance-format#5159 balloon across the whole encoding crate): 1. **Write side**: `Schema::validate()` rejects zero-dimension fixed-size-list fields (including nested ones). `validate()` runs inside `Schema::try_from(&ArrowSchema)`, so every write entry point surfaces a clean schema error instead of a panic. Writes currently panic on every storage version, so no working flow changes behavior. 2. **Read side (defensive)**: the structural and legacy field-scheduler builders reject zero-dimension fixed-size lists with an invalid-input error, so datasets persisted by old writers fail cleanly at scheduling time instead of crashing the process. ## How the guards sit in the data flow  Two facts that shape the design: - `Schema::try_from(&ArrowSchema)` calls `validate()` internally and every write path performs this conversion, so guard 1 in one place covers all write entry points. - Guard 2 exists because writers up to ~2026-04 could still persist zero-dimension columns under the `stable` (2.0) storage version; reading those files must not crash the process. ## Tests - `lance-core`: `Schema::try_from` rejects zero-dim FSL at top level and nested in a struct; positive dimensions still validate. - `lance-encoding`: the scheduler guard rejects zero-dim FSL, including FSL-nested-in-FSL, and accepts positive dimensions. - Python: parametrized over `legacy`/`stable`/`2.1`, `write_dataset` now raises a clean `OSError` (same mapping as other schema validation errors) instead of `PanicException`. Co-authored-by: Daniel Mao <danielmao@danieldeMacBook-Pro.local>
…ormat#7287) ## What `BTreeIndex::search` rebuilt the full `col IN (...)` physical expression on every page it touched. For a large IN-list spanning many pages this is O(pages x values) -- the expression (and its hash set) is reconstructed per page even though it is identical across pages. This compiles the predicate once in `BTreeIndex::search` and reuses it across pages via a new `FlatIndex::search_prebuilt`. Membership is O(1) per row regardless of set size, so only the repeated build was wasted work. Resolves the existing `// TODO` in `search_page`. ## Why `col IN (<large set>)` is used to resolve big key sets to row ids. On a real 83M-row table, an `IsIn` of ~46K values took 13.4s, almost entirely per-page expression construction. ## Result Same table/query, index lookup **13.4s -> 3.2s**. Cost is now bounded by pages touched + rows scanned, independent of IN-list size (a local 80K-value lookup over a multi-page index runs ~130ms and is flat in the value count). ## Notes - No public API or behavior change -- the predicate and its evaluation are identical, just built once instead of per page. - `cargo test -p lance-index` passes (279 scalar tests); `cargo fmt` clean. Co-authored-by: Yuan Gao <yuang@xiaopeng.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…lance-format#7317) ## Summary `remap_index` — the catch-up that physically applies a fragment-reuse index after a deferred-remap compaction — applied the reuse index **one version at a time**, rebuilding the index file and committing **once per reuse version**. An index touched by K deferred compactions paid **K full index rebuilds + K commits** for a result identical to applying all K at once. This is worst exactly when the reuse index has accumulated many versions before a remap runs. ## Change Compose the whole chain and rebuild once: - **Row addresses:** `FragReuseIndex::remap_row_id` already chains every version (and passes through addresses a version does not touch), so mapping the union of all versions' keys yields a single **baseline → final** address map, applied in one rebuild. - **Coverage bitmap:** composed in one pass with the same all-or-nothing / straddle-error semantics (chaining is automatic — a version's new fragments are the next version's old fragments). `data_predates_version` is evaluated against the fixed baseline since there are no intermediate commits. - One `CreateIndex` commit instead of one per version. ## Why the composed map is not filtered by the fragment bitmap A tempting way to keep the composed map small is to drop keys whose fragment isn't in the index's current `fragment_bitmap`. That optimization is **not** safe — the per-version loop never did it, and this PR keeps it that way. In the sibling-coverage-remap case, remapping one index commits a manifest that coverage-remaps a *sibling* index's bitmap onto the new fragments and persists it *before* the sibling's own data is remapped. The sibling's on-disk bitmap then shows the new fragments while its data still holds old addresses. Filtering the map by that bitmap would drop exactly the keys the sibling needs, leaving an **empty** map — and `index::remap_index` treats an all-`None`/empty map as `RemapResult::Keep`, reusing the stale index files while the version is bumped and the reuse index trims, so the index would end up pointing at dead fragments. So the composed map maps every old address the reuse index touched; addresses an index doesn't store are simply never looked up (the map stays bounded by the rows the reuse index touched). ## Tests - `test_remap_index_batches_multiple_reuse_versions` — a multi-version reuse chain must rebuild + commit exactly once. - `test_cleanup_frag_reuse_index_multiple_indices` — extended with a post-remap data-correctness scan so it asserts each remapped index resolves to **live rows**, not just that versions advance and the reuse index trims. (A bitmap-filtered map would make the sibling index return 0 of 1000 rows here.) ``` cargo test -p lance --lib frag_reuse::tests # 3 passed cargo test -p lance --lib remap # 21 passed ```
…r_segments (lance-format#7339) ## What Make two existing scalar-index building blocks `pub` and re-export them from `index::scalar`: - `LogicalScalarIndex::try_new` — public constructor that merges several already-opened segments of one scalar index into a single searchable `ScalarIndex`. - `load_named_scalar_segments` — list the committed, dataset-intersecting segments of a named scalar index (length `1` = a single non-segmented index, `> 1` = an index split across multiple segments). ## Why A distributed query engine needs to (1) discover how many segments a named scalar index has and (2) open an explicit subset of those segments on each executor, then present them as one index. Both capabilities already exist inside lance — `load_named_scalar_segments` lists segments, and `LogicalScalarIndex` already unions per-segment search results and fragment coverage — they were just private. The actual "open this UUID subset" helper stays in the calling engine; it is pure glue over these two plus the already-public `Dataset::open_scalar_index`, so it does not need to live in lance. ## Notes - Purely additive. No behavior change to existing callers (`open_named_scalar_index` and `scalar_index_fragment_bitmap` already used both). - `index_intersects_dataset` and `Dataset::fragment_bitmap` remain private. - `cargo check`, `clippy`, and `fmt` clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ce-format#7343) > [!WARNING] > Dependabot will stop supporting `python v3.9`! > > Please upgrade to one of the following versions: `v3.9`, `v3.10`, `v3.11`, `v3.12`, `v3.13`, or `v3.14`. > Bumps [lance-namespace](https://github.com/lance-format/lance-namespace) from 0.8.5 to 0.8.6. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/lance-format/lance-namespace/releases">lance-namespace's releases</a>.</em></p> <blockquote> <h2>v0.8.6</h2> <!-- raw HTML omitted --> <h2>What's Changed</h2> <h3>New Features 🎉</h3> <ul> <li>feat(spec): add source_task_size to RefreshMaterializedViewRequest by <a href="https://github.com/justinrmiller"><code>@justinrmiller</code></a> in <a href="https://redirect.github.com/lance-format/lance-namespace/pull/355">lance-format/lance-namespace#355</a></li> <li>feat(java): propagate source_task_size to generated Java clients by <a href="https://github.com/justinrmiller"><code>@justinrmiller</code></a> in <a href="https://redirect.github.com/lance-format/lance-namespace/pull/356">lance-format/lance-namespace#356</a></li> </ul> <h3>Bug Fixes 🐛</h3> <ul> <li>fix: pin central-publishing-maven-plugin to an existing version by <a href="https://github.com/brendanclement"><code>@brendanclement</code></a> in <a href="https://redirect.github.com/lance-format/lance-namespace/pull/354">lance-format/lance-namespace#354</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/justinrmiller"><code>@justinrmiller</code></a> made their first contribution in <a href="https://redirect.github.com/lance-format/lance-namespace/pull/355">lance-format/lance-namespace#355</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/lance-format/lance-namespace/compare/v0.8.5...v0.8.6">https://github.com/lance-format/lance-namespace/compare/v0.8.5...v0.8.6</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lance-format/lance-namespace/commit/590a4eb7163a85f56e0622f9359efa33d5ad6941"><code>590a4eb</code></a> chore: release version 0.8.6</li> <li><a href="https://github.com/lance-format/lance-namespace/commit/f5ea0439fc81fc62e7344981321d6a83c4b5be9d"><code>f5ea043</code></a> feat(java): propagate source_task_size to generated Java clients (<a href="https://redirect.github.com/lance-format/lance-namespace/issues/356">#356</a>)</li> <li><a href="https://github.com/lance-format/lance-namespace/commit/89e0cab93528371403edc10aae0f7de55ade7415"><code>89e0cab</code></a> feat(spec): add source_task_size to RefreshMaterializedViewRequest (<a href="https://redirect.github.com/lance-format/lance-namespace/issues/355">#355</a>)</li> <li><a href="https://github.com/lance-format/lance-namespace/commit/be99cb674384e5ca54b7c1f4a8e71e388b3526e6"><code>be99cb6</code></a> fix: pin central-publishing-maven-plugin to an existing version (<a href="https://redirect.github.com/lance-format/lance-namespace/issues/354">#354</a>)</li> <li>See full diff in <a href="https://github.com/lance-format/lance-namespace/compare/v0.8.5...v0.8.6">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ance-format#7342) > [!WARNING] > Dependabot will stop supporting `python v3.9`! > > Please upgrade to one of the following versions: `v3.9`, `v3.10`, `v3.11`, `v3.12`, `v3.13`, or `v3.14`. > Bumps [geoarrow-rust-core](https://geoarrow.org/geoarrow-rs/) from 0.6.1 to 0.6.3. [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This PR extends FTS stop-word handling so ICU tokenization can remove built-in stop words across supported languages, while non-ICU tokenizers continue to use the configured language and existing custom stop-word override behavior. It also fills the missing built-in stop-word lists for languages already exposed by `Language`, without changing the public API or index protobuf format.
lance-format#7215) ## What Adds `FreshTierWatermark { active_generation, active_batch_count }` and `LsmScanner::contains_pks_at` (with `fresh_tier_block_list` threading the watermark) so a caller can evaluate fresh-tier PK membership against the **exact tier a prior scan observed**, instead of the live tier. ## Why The WAL block-list runs as two independent RPCs (the read arm and a supersession check) that each snapshot the live fresh tier at their own call time. Under concurrent writes the two snapshots disagree, so a base row can be dropped as "superseded" by the check while the arm never delivered a replacement — a transient missing row. The fix pins both phases to the same watermark; this PR is the lance half that lets the check reconstruct the arm's snapshot. ## How the watermark works The active memtable is the only fresh-tier source that grows between two reads; everything strictly below its generation (frozen memtables, flushed generations) is immutable. So the as-of filter: - includes in-memory/flushed sources **below** `active_generation` whole (immutable, fully observed), - bounds the **active** generation to its first `active_batch_count` batches (by append index), - excludes in-memory sources **above** `active_generation` and flushed generations `>= active_generation` (produced after the snapshot). It uses only `batch_store.len()` and the memtable generation — both always available on the read path — and only ever *excludes* rows the scan did not observe, so a stale watermark under-counts (a tolerable stale read) rather than over-counts (which would drop a row with no replacement). > Note: an earlier approach keyed the watermark on per-batch WAL positions (`wal_batch_mapping`), but that map is only populated by `mark_wal_flushed`, which is test/bench-only — empty in production. The generation + batch-count watermark avoids any write-path dependency. ## Grace-period pinning A flushed memtable could otherwise be evicted between the two reads, collapsing its per-batch boundaries and turning the active-generation bound into a wholesale `>=` exclusion (a stale read). Frozen memtables are now retained in memory for a configurable grace period (`frozen_memtable_grace`, default 3s) after flush and swept by an existing dispatcher ticker. The grace must be **strictly larger than the maximum query elapsed time** to guarantee snapshot isolation: while pinned, the generation is served batch-resolved from memory; once evicted, no in-flight read references it so the `>=` exclusion is safe. ## Tests `fresh_tier_block_list` as-of unit tests: active-memtable batch-count bound, newer-gen-excluded / lower-gen-included-whole, flushed-gen at/above active excluded; collector suppression of a flushed generation pinned in memory; frozen retained during grace then swept. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Performance Improvement This PR removes redundant `PostingIterator::doc()` calls in the WAND lead/tail scoring paths without changing `PostingIterator`, `HeadPosting`, heap ordering, public APIs, docs, or wire formats. The existing `HeadPosting` doc id cache remains unchanged. The refactor reuses already-read `DocInfo` values where the iterator position is known to be the same: - reuse the first lead posting's `DocInfo` in `Wand::next` for doc length, first lead score, and the returned candidate - call `posting.doc()` once in `advance_tail_top` after `posting.next(target)` and reuse it for target matching and scoring - apply the same single-`doc()` pattern per tail posting in `advance_all_tail` ## Benchmark Fresh GCP VM run with `search-benchmark` FTS static suite, `lance_fts`, `match`, `query_length=5`, `k=10`, `num_queries=1000`, `prewarm_index=true`, `with_position=false`, dataset `gs://fts-bench/yang-db/wikipedia-bench-v2-20260327.lance`. Search-benchmark commit: `8d05d5680f2fbac38c1642380dc98b8a2bb2140f`. Measured baseline/candidate SHAs from the pre-rebase benchmark branch: - baseline: `8743fbaf9ce0dce15373629a7574e14d5c6c9367` - candidate: `6ac7b2c061d34142838c340fb2dbda509506da63` Results from one run: - QPS: 274.768 -> 279.352 (+1.668%) - avg latency: 28.936 ms -> 28.495 ms (-1.523%) - p50: 21.703 ms -> 22.259 ms (+2.562% slower) - p90: 59.845 ms -> 59.059 ms (-1.313%) - p99: 131.076 ms -> 113.052 ms (-13.751%) The benchmark was run before rebasing this PR onto the latest `main`; the PR commit contains the same WAND code change rebased onto `c4e65645d75b17c572fee62809b013b3730370bb`. ## Validation On the rebased PR branch: - `cargo fmt --all --check` - `cargo test -p lance-index scalar::inverted::wand` - `cargo check -p lance-index --tests` - `cargo clippy --all --tests --benches -- -D warnings`
…ry (lance-format#7284) ## Summary Evolve `FlushedMemTableCache` into the unified warm/open interface for mem_wal flushed generations, and populate the caches **before** a generation is queryable so the first query sees zero cold reads. - `FlushedMemTableCache` now owns a required `Session` (the index `CacheBackend` seam) and an optional read-through `WrappingObjectStore` (page cache), threading both into every open. `get_or_open(path)` drops its per-call session arg. - New `warm(path, pk_columns)`: open + `prewarm_all_indexes` (FTS) + `get_or_build_pk_hashes` (vector block-list), bounded by a semaphore and idempotent via a `warmed` gate. `open_flushed_dataset` fires a warm-on-open backstop. - `retain_paths` is now async and actively evicts retired generations' index objects via the new `Session::invalidate_index_prefix`; the byte cache is left to LRU. - `MemTableFlusher` warms each generation pre-commit, **best-effort** (logged on error, never blocks `update_manifest`), threaded via `ShardWriterConfig.flushed_cache`. This is the Lance-side building block for WAL-pod flushed-generation caching (consumed by sophon, which supplies the backed `Session` + read-through pool). ## Test plan - `cargo test -p lance --lib mem_wal::scanner::flushed_cache` (7 tests, incl. warm/idempotency/pk-hash/retain) — pass - `cargo test -p lance --lib mem_wal::memtable::flush` (8 tests) — pass - `cargo clippy -p lance --tests --benches` — clean - `cargo fmt --all` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This allows miniblock writers to use up to 32K logical values per chunk when explicitly configured via `LANCE_MINIBLOCK_MAX_VALUES`, while keeping the default at 4096. The file format already stores `log_num_values` in 4 bits, so the writer-side guard can allow values up to 15 without requiring the large-chunk metadata path. The compressed byte-size limits remain enforced. Fixes lance-format#7326.
## Performance Improvement ### What is the performance issue or bottleneck? WAND search eagerly collected term frequencies for each scored candidate before checking whether that candidate would enter the top-k heap. Candidates rejected by the current kth score still paid for a `Vec` allocation and `PostingIterator::doc()` calls. ### How does this PR improve performance? This moves term-frequency collection into the two branches that actually insert or replace a top-k candidate, for both WAND search and flat search. Rejected candidates now avoid collecting frequencies entirely. The PR also adds focused regression tests that count test-only term-frequency collection calls and verify rejected equal-score candidates do not collect frequencies. ### Benchmark `wikipedia-40m-fts`, V2 `text_idx`, `match`, `query_length=5`, `stop_words=0`, `k=100`, `num_queries=1000`, `seed=0`. | build | QPS | avg | p50 | p90 | p99 | |---|---:|---:|---:|---:|---:| | before `838f78b9` | 127.46 | 62.56ms | 57.20ms | 113.27ms | 223.31ms | | after `4c05e1c` | 130.34 | 61.17ms | 54.65ms | 112.63ms | 219.22ms | | delta | +2.26% | -2.22% | -4.46% | -0.56% | -1.83% | Raw artifacts: - suite: `/mnt/benchmark-ssd/search-benchmark/bench_suites/wikipedia_fts_wand_freqs_compare_k100_20260618.json` - log: `/mnt/benchmark-ssd/logs/wikipedia_fts_wand_freqs_compare_k100_v2_20260618_091950.log` - csv: `/mnt/benchmark-ssd/search-benchmark/results/results_fts_static_20260618_093407.csv` ### Validation - `cargo fmt --all` - `CARGO_TARGET_DIR=/tmp/lance-target-21fd-pr-clippy cargo clippy --all --tests --benches -- -D warnings` - `CARGO_TARGET_DIR=/tmp/lance-target-21fd-pr-clippy cargo test -p lance-index scalar::inverted::wand::tests::` - `git diff --check` - `/Users/yang/.cache/uv/archive-v0/CK_YxmMYMk7DlRLAQr3It/bin/python -mpre_commit run --files rust/lance-index/src/scalar/inverted/wand.rs -v`
Streams FM index training and update input into partition-sized writes instead of materializing the full training stream first. This keeps large segmented FM builds bounded by one partition plus the current Arrow batch while preserving the existing partition file layout.
Unreleased version after creating v8.0.0-rc.1
…e-format#7072) ## Problem When a dataset has a JSON column and multiple JSON indices are created on different JSON paths of that same column (e.g. one index on `$.a` and another on `$.b`), query routing is incorrect. A query like `json_extract(json, '$.b') = 'foo'` may hit the `$.a` index instead of the `$.b` index, producing wrong results. ## Root Cause `maybe_indexed_column` obtains a parser from `IndexInformationProvider::get_index()`, which returns a `&dyn ScalarQueryParser` pointing to a `MultiQueryParser` that aggregates all sub-parsers for that column. The flow was: 1. `get_index()` returns `MultiQueryParser` as `&dyn ScalarQueryParser` 2. `parser.is_valid_reference(expr, data_type)` is called — `MultiQueryParser`'s impl iterates children and returns `Some(DataType)` from the **first** child that accepts, but discards **which** child matched 3. The same `MultiQueryParser` is then used for `visit_eq` / `visit_between` etc., which also iterate children and return the first non-`None` result — potentially a **different** child than the one that validated the reference This means the query can be dispatched to the wrong JSON index (e.g. the `$.a` index for a `$.b` query). ## Fix - **Change `IndexInformationProvider::get_index`** to return `(&DataType, &MultiQueryParser)` instead of `(&DataType, &dyn ScalarQueryParser)`, so callers can interact with the `MultiQueryParser` directly - **Add `MultiQueryParser::select(expr, data_type)`** — iterates child parsers and returns `(&dyn ScalarQueryParser, DataType)` from the first child whose `is_valid_reference` accepts the expression, preserving **which** child matched - **Update `maybe_indexed_column`** to call `multi.select(expr, data_type)` instead of `parser.is_valid_reference(expr, data_type)`, obtaining the precise sub-parser for all subsequent operations ## Test Added regression test `test_multi_json_indices_route_by_path` that: - Creates a `MultiQueryParser` with two `JsonQueryParser` sub-parsers (for `$.a` and `$.b`) - Verifies `json_extract(json, '$.b') = 'foo'` resolves to the `json_b_idx` index - Verifies `json_extract(json, '$.a') = 'foo'` resolves to the `json_a_idx` index - Verifies `json_extract(json, '$.c') = 'foo'` (unindexed path) does not bind to any index
…rmat#7049) ## Summary - remove the dedicated finite_value_may_be_in_zone helper - rely on ScalarValue total ordering for finite values against NaN zonemap max values - add a focused assertion covering finite targets below a stored NaN max ## Tests - cargo test -p lance-index scalar::zonemap::tests::test_nan_zonemap_index -- --nocapture - cargo test -p lance dataset::scanner::test::test_inexact_scalar_index_plans -- --nocapture
…updates (lance-format#7341) The Java publish workflow built the macOS native lib *and* published the multi-platform JAR in a single job, so the macOS build was gated behind the Linux builds (`needs: [linux-arm64, linux-x86]`) even though only the publish step depends on the Linux artifacts. ## Workflow restructure (`java-publish.yml`) - **`build-linux`** — the two near-identical Debian 10 builds collapsed into a matrix (x86-64, arm64). - **`build-macos`** — build-only, no `needs`, runs concurrently with the Linux builds. - **`publish`** — a separate `ubuntu-latest` job that gathers all three native libs and packages/deploys with `-Dskip.build.jni=true`. No longer holds the expensive macOS runner during publish. Critical path goes from `linux → (macos build + publish)` to `max(linux, macos) → publish`. ## Drop no-op `-P shade-jar` That profile doesn't exist; shading is an always-on plugin bound to the `package` phase. The flag only produced a `could not be activated` warning. ## `cargo update --workspace` in release scripts The release scripts ran bare `cargo update` after the version bump, which refreshed *all* transitive dependencies rather than just the local crates being re-versioned. This swept incidental dependency bumps into release commits — see [this run](https://github.com/lance-format/lance/actions/runs/26624032632/job/78456375994#step:5:2496), where the lock update pulled in `brotli`, `hyper`, `jiff`, `zerocopy`, and ~14 others on top of the intended `lance-*` version changes. `--workspace` re-pins only the workspace crates whose versions changed. Applied to all three release scripts (`create_release_branch.sh`, `release_common.sh`, `publish_beta.sh`) for consistency. > [!NOTE] > The PR's `pull_request` trigger exercises the full build + dry-run path. The `cargo update --workspace` change only takes effect on the next release-tooling run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…merge (lance-format#7320) Closes: lance-format#7230 Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
…st branch (lance-format#7246) ## Problem FTS `search()` combined with a `where(...)` prefilter on a `list<string>` / `large_list<large_string>` column silently drops matches when the query token sits at any position **other than the last** in a row's list. `.postfilter()` (FTS first, then filter) returns the correct rows. Reported as lancedb#3352 with a runnable Python repro. The plan is `MatchQuery > ScalarIndexQuery`, and the bug only surfaces when the planner picks the small-allowlist prefilter path (`index_comparisons ≈ allowlist size`): | Target row `keywords` | prefilter (default) | postfilter | |---|---|---| | `["needle", "synonym"]` | **0 rows (bug)** | 2 rows | | `["synonym", "needle"]` | 2 rows | 2 rows | ## Root cause A list column indexes every element as its own document, so one `row_id` owns several `doc_id`s: `DocSet.inv` (a `Vec<(row_id, doc_id)>` sorted by `row_id`) holds multiple entries per row. `DocSet::doc_id(row_id)` resolved a row to a **single** `doc_id` via `binary_search_by_key`, and its only caller is `Wand::flat_search`: the walk-the-allowlist prefilter branch. It therefore evaluated just one of the row's documents against the posting lists; when the query token lived in any other element, the row became a false negative. The regular WAND path is forward-driven (document -> `row_id`, with a per-document mask check), so it was always correct, only `flat_search` was affected, which is why the bug is specific to the prefilter branch. ## Fix - Replace `DocSet::doc_id` with `DocSet::doc_ids(row_id) -> impl Iterator`, which yields every `doc_id` in the contiguous equal-key run in `inv` (the legacy `row_id == doc_id` shape still resolves to a single document). - `flat_search` now expands each allow-listed `row_id` to **all** of its documents (`flat_map` over `doc_ids`) before sorting into doc-id order. This brings `flat_search` to parity with the WAND path, so it introduces no new duplicate-row behaviour: only documents actually present in the posting lists score. ## Tests - `test_doc_ids_resolves_every_document_a_row_owns`: unit coverage of the multi-valued resolution (list shape, legacy shape, and a missing row). - `test_flat_search_finds_list_row_with_match_at_non_last_position` (rstest, compressed + plain): reproduces the bug; it fails on the previous single-`doc_id` resolution and passes with the fix. All 143 `scalar::inverted` tests pass; `cargo fmt --all --check` and `cargo clippy -p lance-index --tests -- -D warnings` are clean. Closes lancedb#3352
Python 3.9 is reaching end-of-life. This removes it from CI test jobs and released binaries and makes Python 3.10 the minimum supported version. - Bump `requires-python` to `>=3.10` and drop the `Python :: 3.9` classifier in `python/pyproject.toml` and `memtest/pyproject.toml`. - Raise the PyO3 abi3 floor from `abi3-py39` to `abi3-py310`. - Update the `python.yml` test matrix (3.10 + 3.13) and the `pypi-publish.yml` release wheel matrices to build for 3.10. - Drop the now-redundant `python_version >= '3.10'` dependency markers and regenerate `uv.lock`. - Update `CONTRIBUTING.md` and the build-action input docs. `docker-compose.yml`'s `version: "3.9"` is the Compose file-format version (not Python) and is left unchanged. No `sys.version_info` checks for 3.9 exist in the source. Closes lance-format#7344 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds runtime SIMD dispatch to lance-linalg's hot distance kernels so a from-source build with a lower x86_64 baseline produces a working binary on pre-Haswell hardware (Sandy Bridge / Ivy Bridge / Steamroller). fast-by-default with a documented from-source override for legacy users.
Today,
import lancedbSIGILLs on AVX-without-AVX2 CPUs because the wheel bakes AVX2 into every compiled function with no runtime guard. numpy and pyarrow handle the same hardware via runtime dispatch. This PR brings lance to parity for the from-source legacy build path.Summary
match *SIMD_SUPPORT+mod x86 { #[target_feature(enable=...)] pub unsafe fn ... }shape asdot_u8.rs/cosine_u8.rs/l2_u8.rs. On the haswell baseline, dispatch always lands on AVX2 — modern compile output is unchanged from today.lance.simd_info()Python API mirroringpyarrow.runtime_info()for tier introspection.qemu-pre-haswellCI gate that builds withRUSTFLAGS="-C target-cpu=x86-64-v2"(env-var-scoped to that one job — workspace.cargo/config.tomlis unchanged) and runs lance-linalg tests under qemu Nehalem.CONTRIBUTING.mddocuments the legacy build:RUSTFLAGS="-C target-cpu=x86-64-v2" cargo build --release.Zero new external dependencies. Dispatch shape extends an idiom lance already uses for u8 and f16/bf16 kernels rather than introducing a new convention. Recent precedent: @justinrmiller's #6540, #6517, #6506, #6510.
Verification
cargo test -p lance-linalg --lib— 83/83 on aarch64 dev box.is_x86_feature_detected!()so each runs on hosts that can execute its tier.cargo clippy --all-targets -- -D warningsclean;cargo fmt --checkclean;Cargo.lockunchanged.RUSTFLAGSoverride (via companion lancedb wheel — seetobocop2/lancedb#2for the verification PASS output).change:lines once I find a host that can hold the fullcargo bench -p lance-linalg --bench {cosine,dot,l2,norm_l2}suite (Codespace's 30-min idle timeout killed my last attempt mid-run).Closes #1.