Skip to content

feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#2

Draft
tobocop2 wants to merge 410 commits into
mainfrom
fix/runtime-simd-multiversion
Draft

feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds#2
tobocop2 wants to merge 410 commits into
mainfrom
fix/runtime-simd-multiversion

Conversation

@tobocop2

@tobocop2 tobocop2 commented Apr 26, 2026

Copy link
Copy Markdown
Owner

Adds runtime SIMD dispatch to lance-linalg's hot distance kernels so a from-source build with a lower x86_64 baseline produces a working binary on pre-Haswell hardware (Sandy Bridge / Ivy Bridge / Steamroller). fast-by-default with a documented from-source override for legacy users.

Today, import lancedb SIGILLs on AVX-without-AVX2 CPUs because the wheel bakes AVX2 into every compiled function with no runtime guard. numpy and pyarrow handle the same hardware via runtime dispatch. This PR brings lance to parity for the from-source legacy build path.

Summary

  • 5-tier runtime SIMD dispatch (scalar / AVX / AVX+FMA / AVX2+FMA / AVX-512) on all 10 hot f32/f64 distance kernels in lance-linalg, using the same match *SIMD_SUPPORT + mod x86 { #[target_feature(enable=...)] pub unsafe fn ... } shape as dot_u8.rs / cosine_u8.rs / l2_u8.rs. On the haswell baseline, dispatch always lands on AVX2 — modern compile output is unchanged from today.
  • lance.simd_info() Python API mirroring pyarrow.runtime_info() for tier introspection.
  • qemu-pre-haswell CI gate that builds with RUSTFLAGS="-C target-cpu=x86-64-v2" (env-var-scoped to that one job — workspace .cargo/config.toml is unchanged) and runs lance-linalg tests under qemu Nehalem.
  • CONTRIBUTING.md documents the legacy build: RUSTFLAGS="-C target-cpu=x86-64-v2" cargo build --release.

Zero new external dependencies. Dispatch shape extends an idiom lance already uses for u8 and f16/bf16 kernels rather than introducing a new convention. Recent precedent: @justinrmiller's #6540, #6517, #6506, #6510.

Verification

  • cargo test -p lance-linalg --lib — 83/83 on aarch64 dev box.
  • 38 new proptest cases verifying scalar↔SIMD bit-for-bit equivalence per tier per kernel; gated on is_x86_feature_detected!() so each runs on hosts that can execute its tier.
  • cargo clippy --all-targets -- -D warnings clean; cargo fmt --check clean; Cargo.lock unchanged.
  • SIGILL gone on Sandy Bridge Xeon E5-2609 with the documented RUSTFLAGS override (via companion lancedb wheel — see tobocop2/lancedb#2 for the verification PASS output).
  • Modern-hardware bench delta still pending. The AVX2 path is preserved as one of the per-tier kernels and the workspace baseline still bakes AVX2 into surrounding code, so by construction the modern compile is unchanged — but I'll confirm with criterion change: lines once I find a host that can hold the full cargo bench -p lance-linalg --bench {cosine,dot,l2,norm_l2} suite (Codespace's 30-min idle timeout killed my last attempt mid-run).

Closes #1.

@tobocop2 tobocop2 changed the title Runtime SIMD dispatch: 5-tier coverage from Nehalem 2008 through Sapphire Rapids 2023, full numpy/pyarrow parity fix(lance-linalg): SIGILL on pre-Haswell x86_64 — add runtime SIMD dispatch Apr 26, 2026
@tobocop2 tobocop2 changed the title fix(lance-linalg): SIGILL on pre-Haswell x86_64 — add runtime SIMD dispatch fix(lance-linalg): lancedb unusable on pre-Haswell x86_64 (import SIGILL) — add runtime SIMD dispatch Apr 26, 2026
tobocop2 added a commit that referenced this pull request Apr 26, 2026
…ch dot_u8.rs convention

Per-function doc comments on `*_scalar`/`*_avx`/`*_avx_fma`/`*_avx2`/`*_avx512`
inner functions are now one-liners matching the existing convention in
`dot_u8.rs` (see e.g. its `pub unsafe fn dot_u8_avx2` comment). Drops the
redundant "Caller must ensure..." precondition lines — those are implicit
from `unsafe fn` + `#[target_feature]`. Module-level `//!` docs and
public-API `///` docs are left detailed per project convention.

Refs #1, #2.
@tobocop2 tobocop2 force-pushed the fix/runtime-simd-multiversion branch from a3df856 to 9193496 Compare April 28, 2026 04:59
@tobocop2 tobocop2 changed the title fix(lance-linalg): lancedb unusable on pre-Haswell x86_64 (import SIGILL) — add runtime SIMD dispatch feat(lance-linalg): runtime SIMD dispatch for pre-Haswell x86_64 from-source builds Apr 28, 2026
@tobocop2 tobocop2 force-pushed the fix/runtime-simd-multiversion branch from 58325e8 to 7db5171 Compare April 28, 2026 05:43
@github-actions github-actions Bot added enhancement New feature or request python labels Apr 28, 2026
@tobocop2 tobocop2 force-pushed the fix/runtime-simd-multiversion branch from 7db5171 to 26aa7f4 Compare April 28, 2026 05:46
@github-actions github-actions Bot added the java label Apr 28, 2026
hamersaw and others added 15 commits May 21, 2026 16:48
…ance-format#6901)

## Summary

The background memtable flush handler
(`MemTableFlushHandler::flush_memtable`) called `flush`, which persists
the data file and bloom filter but **builds no secondary indexes**. The
handler never even received the shard's index configs.

Two query-side consequences over flushed generations:

- **Point lookups** (`point_lookup.rs`) run a `filter_expr` scan. Lance
can route that through a scalar index — but none existed, so lookups
fell back to a full scan (perf regression).
- **Vector search** (`vector_search.rs`) uses index-only
`fast_search()`. Its doc comment assumes "each flushed memtable has its
own vector index built during flush", which was false — so flushed
vector rows were invisible to KNN. This is a **correctness** bug, not
just perf.

## Changes

- Thread the shard's `index_configs` into `MemTableFlushHandler`.
- Call `flush_with_indexes` when any indexes are configured so each
flushed generation carries the same secondary indexes as the active
memtable; keep plain `flush` when none are configured to avoid a
needless dataset open.
- Box the `flush_with_indexes` future to keep the flush async block
under the type-layout recursion limit.

## Testing

- New `test_flushed_generation_is_indexed`: writes through the real
`ShardWriter` path, forces a flush, and asserts the flushed generation
(a) carries the BTree index and (b) resolves `id = 5` via
`ScalarIndexQuery` rather than a scan. HNSW/FTS flush and `fast_search`
over an indexed flushed generation are already covered by existing
tests.
- `cargo test -p lance --lib dataset::mem_wal::` — 316 passed, 0 failed.
- `cargo fmt --all` and `cargo clippy -p lance --tests -- -D warnings` —
clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rks (lance-format#6882)

Final piece of the lance-format#6856 split. Replaces the duplicated criterion-based
`mem_wal_read.rs` and `mem_wal_vector.rs` benchmarks with two standalone
CLI benchmarks that emit JSON output for panel-style trend analysis:

- `mem_wal_vector_bench`: KNN search across LSM levels with
deterministic synthetic 384-dim embeddings, IVF-RQ base table index, and
recall verification against brute-force ground truth.
- `mem_wal_point_lookup_bench`: PK-based point lookups across the base
table, flushed generations, and active memtable.

Both accept `--flushed-generations` and `--max-memtable-rows` for
sweeping the full matrix; results are written as individual JSON files.

This is the reusable bench template I want to extend for FTS
benchmarking. Depends on the LSM vector-search API (`with_dataset` /
`refine_factor`) landed in lance-format#6881, so it's the last of the three split
PRs.

Part of splitting lance-format#6856 into focused PRs. Co-authored with @jackye1995.

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
…-format#6915)

The Lance-DuckDB documentation build is broken due to an incorrect URL.
This PR fixes that to render correctly in the sidebar via the `.pages`
card. The build scripts have been updated to only include the relevant
docs files from the `lance-duckdb/docs` folder.

Also adds a new `index.md` page that users will now land on, rather than
landing on the DataFusion page. The index page contains links to all the
other relevant integrations.
…ost-filter (lance-format#6899)

## What

Fixes a **stale read** in LSM MemWAL vector search: when a primary key
is updated and its *fresh* row falls out of its own source's top-k, the
superseded copy from an older generation could win the cross-source
dedup and be returned.

## Background

`LsmGlobalPkDedupExec` (introduced in lance-format#6881) is exact only over the
candidates each source *surfaces*. If a PK's fresh version is pushed out
of its source's top-k by closer rows, the dedup never sees it and cannot
suppress the stale copy from an older generation — so the stale row is
served. Repro:
`test_vector_search_stale_read_when_fresh_falls_out_of_top_k`.

## Approach

Make staleness a **per-source PK-hash post-filter** applied to each
source's KNN *before* the cross-source union, so a stale row never
reaches the merge.

- **Membership.** Each generation's membership is an `Arc<HashSet<u64>>`
of PK hashes (`compute_pk_hash` — the same hash the dedup nodes use).
Built once per generation; flushed generations' sets are cached on
`FlushedMemTableCache` and scanned **streaming** (one batch resident at
a time, no full PK-column buffer).
- **Per-source block set.** `compute_source_block_lists` gives each
source the membership sets of the generations newer than it — `NEWER(G)`
— as a `Vec<Arc<HashSet<u64>>>`, **referenced, never merged into a
per-query union**. Generations are **per-shard**, so the map is keyed
`(shard_id, generation)` and a source is only superseded by
strictly-newer generations of its *own* shard (the base table, shardless
and oldest, is blocked by every generation). The base table is **not
scanned** — it's filtered by hashing only its KNN candidates.
- **Execution.** `PkHashFilterExec` drops any candidate whose PK hash is
in any of its source's blocked sets. This handles only
**cross-generation** supersession: a PK in a newer generation makes
every copy of it stale, so dropping by hash needs no row address.
**Within-generation** duplicates (same PK twice in one generation) share
a hash and are left to the existing global dedup's `(generation,
freshness)` tiebreaker.

## Configuration

`LsmVectorSearchPlanner::plan_search` exposes two knobs (Rust + Python +
Java bindings):

- **`overfetch_factor: f64`** — a single knob controlling *both* stale
filtering and over-fetch:
- `< 1.0` (e.g. `0.0`): stale filtering **off** (no block-list /
`PkHashFilterExec`; the global dedup still runs).
- `== 1.0` (**default**): filtering **on**, no over-fetch — a source
with superseded rows fetches exactly `k` and may return fewer than `k`
live rows.
- `> 1.0`: filtering **on**, over-fetch `ceil(k * factor)` so dropping
the stale rows still leaves `k`.

There is intentionally no separate on/off flag — over-fetch is only
meaningful while filtering, so the factor encodes both.
- **`refine_base_table: bool`** (replaces the old `refine_factor:
Option<u32>`) — re-rank the base arm's approximate index distances to
exact (factor 1). Auto-enabled whenever stale filtering runs (over-fetch
widens the base's candidate pool, which must be exact before the merge).

## External API

`LsmScanner::contains_pks(&RecordBatch) -> Vec<bool>` — test which
primary keys have been (re)written in the WAL fresh tier (active +
frozen memtables + flushed generations), built like any query (construct
the scanner, then call). Hashing is internal, so callers never reproduce
`compute_pk_hash`.

## Scope / caveats

- **Post-filter, not a true prefilter.** It relies on over-fetch to
backfill dropped rows and does **not** guarantee `k` live results in the
adversarial case (more superseded rows near the query than the
over-fetch covers); `PkHashFilterExec` logs a per-source warning when
this happens. Promoting the same membership into the KNN as a true
prefilter (the index traverses until `k` rows pass, removing the
over-fetch) is the headline follow-up.
- **Within-generation top-k eviction** (a generation holds both a stale
and a fresh copy of a PK, and the fresh one is evicted) is *not*
pre-filtered — it shares a hash, so it can't be disambiguated by
membership. It's the same bug class as the cross-source case, currently
relying on the global dedup; closing it would need flush-time dedup (so
flushed generations are internally deduped like the base table). The
base table is assumed internally deduped.

## Tests

`test_vector_search_stale_read_when_fresh_falls_out_of_top_k` passes
(with positive + `overfetch_factor=0.0` toggle-off assertions); plus
over-fetch backfill, cross-flushed and composite-PK stale reads, same-L0
newest-wins, per-shard block-list isolation, and `PkHashFilterExec` /
membership unit tests (incl. null and composite PKs). `cargo fmt`,
`cargo clippy -p lance --tests -- -D warnings`, and the `lance` /
`lance-jni` / `pylance` crate checks are clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…at#6784)

Vector and FTS indexes already support the fast search mode. The current
PR enables fast search capability in scalar index scenarios.

```
dataset.create_scalar_index("filter", "BTREE")
fast = dataset.to_table(filter="filter >= 95", fast_search=True)
```
Only indexed(scalar index) fragments can be scanned in fast search mode.

**PAY ATTENTION**
**Scalar index fast search Not Support for Lance Legacy Version
LanceFileVersion::Legacy**

---------

Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
## Summary

Closes lance-format#6821.

- Extend `Scanner::nearest` for batched queries on fixed-size vector
columns (no separate `nearest_batch` API).
- Flat batch KNN in `KNNVectorDistanceExec`: one scan per data batch,
per-query top-k in a single stream (`m × k` rows max).
- Add `query_index` (0-based) so callers can group results per input
query.
- Indexed path (`use_index=true`): per-query ANN, union, and
`query_index` tagging when a vector index exists.
- `distance_range` applied before per-query top-k on the flat path;
indexed batch uses the same bounds.
- `fast_search` without an index returns an empty batch result that
still includes `query_index`.
- Batch vs multivector: `FixedSizeList` column + list-like query → batch
of single-vector queries; `List` multivector column → one multivector
query.

## API

| | |
| --- | --- |
| **Input** | List-like or 2-D query against a `FixedSizeList` embedding
column |
| **Output** | Up to `k` rows per query vector + `query_index` |
| **Flat** | Shared scan/decode; requires `_rowid` |
| **Indexed** | One indexed search per query vector, then merge |

## Benchmark

Local disk, float32 vectors, `use_index=false`. OS page cache accepted.
Medians over 15 timed rounds (3 warmup); separate then batch each round.

```bash
cd python && uv run --extra benchmarks pytest python/python/benchmarks/test_search.py::test_batch_flat_knn
```

At small `m`, the second separate query often hits warm page cache, so
speedup is modest (~1.1×). It grows with `m` as scan work is shared.

### Query count (1M rows, dim 512, k 10)

| m | separate | batch | saved | speedup |
| --: | --: | --: | --: | --: |
| 2 | 227.48 ms | 208.46 ms | 19.02 ms | 1.09× |
| 5 | 559.10 ms | 328.84 ms | 230.26 ms | 1.70× |
| 10 | 1.1125 s | 536.48 ms | 576.01 ms | 2.07× |

### Dataset size (m 10, dim 512, k 10)

| rows | separate | batch | saved | speedup |
| --: | --: | --: | --: | --: |
| 100,000 | 123.84 ms | 53.23 ms | 70.62 ms | 2.33× |
| 500,000 | 566.20 ms | 261.00 ms | 305.20 ms | 2.17× |
| 1,000,000 | 1.1104 s | 520.86 ms | 589.50 ms | 2.13× |

## Test plan

- [x] `cargo test -p lance --lib test_batch_knn`
- [x] `cargo test -p lance fast_search_without`
- [x] `uv run pytest python/tests/test_vector_index.py -k batch`
- [x] `cargo clippy -p lance --tests -- -D warnings`
- [x] `uv run ruff format --check python/lance/dataset.py
python/tests/test_vector_index.py`

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: BubbleCal <bubble-cal@outlook.com>
## Summary

Fix `Planner::optimize_expr` so expression type coercion runs before
simplification. This matches DataFusion's normal plan pipeline where
analyzer type coercion runs before optimizer simplification.

This matters for immutable UDFs with literal arguments. With the
previous order, simplification could evaluate a constant UDF call before
`Int64` literals were coerced to the UDF's expected `Float64` inputs.
For geospatial filters, `st_point(0, 0)` could reach the geo UDF as
integer arrays and panic during the UDF's strict float downcast.
`st_point(0.0, 0.0)` worked because the literals were already `Float64`.

## Repro Demo

This is a standalone Python repro script. It is not added to the repo.

```python
import tempfile

import lance
import numpy as np
import pyarrow as pa
from geoarrow.rust.core import point, points


def run_filter(ds, expr):
    table = ds.to_table(filter=expr)
    values = table.to_pydict()
    print(f"{expr} -> {table.num_rows} rows: {values}")
    assert table.num_rows == 1


with tempfile.TemporaryDirectory() as tmpdir:
    point_array = points([np.array([1.0, 10.0]), np.array([2.0, 10.0])])
    schema = pa.schema([pa.field(point("xy")).with_name("point")])
    table = pa.Table.from_arrays([point_array], schema=schema)
    ds = lance.write_dataset(table, f"{tmpdir}/geo_filter.lance")

    run_filter(ds, "st_distance(point, st_point(0.0, 0.0)) < 5")
    run_filter(ds, "st_distance(point, st_point(0, 0)) < 5")
```

Observed before the fix:

```text
st_distance(point, st_point(0.0, 0.0)) < 5 -> 1 rows: {'point': [{'x': 1.0, 'y': 2.0}]}
thread 'lance_background_thread' panicked at .../arrow-array-58.3.0/src/cast.rs:849:33:
primitive array
RuntimeError: Task was aborted
```

Observed after the fix:

```text
st_distance(point, st_point(0.0, 0.0)) < 5 -> 1 rows: {'point': [{'x': 1.0, 'y': 2.0}]}
st_distance(point, st_point(0, 0)) < 5 -> 1 rows: {'point': [{'x': 1.0, 'y': 2.0}]}
```
OpenDAL-backed object stores route Lance's bounded `CloudObjectReader`
reads through `get_opts`, and `object_store_opendal` resolves those
requests with a `stat_with` before reading bytes.

This changes the bounded read path to use `get_ranges` and forwards
`get_ranges` through `DynamicOpenDalStore`, letting OpenDAL issue range
reads without the extra metadata call. Full-object reads and stream
paths still use `get_opts` where metadata and stream semantics are
needed.
…ion-vector-on-flush) (lance-format#6929)

Unfortunately, we still have an issue where we can serve stale data from
within a single source in the `mem_wal` implementation. The crux is that
we dedup globally AFTER we run the filter at each source. So if the
newer data does not satisfy the filter, then the algorithm maintains the
stale data.

In this PR we updated the active memtable flush path the make a single
pass over the in-memory data and compute a delete vector that gets
written on the flushed memtable. Since we read using the native
`LanceScanner` that respects the delete vector we essentially dedup the
flushed memtable without having to invalidate and recompute the indicies
that are built during inserts on the active memtable. As part of this we
removed all of the now unnecessary `_generation` / `_freshness` field
addition and datafusion dedup nodes.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FilteredReadExec could spend a pushed-down LIMIT on scalar-index matches
before evaluating a remaining refine filter, which allowed filtered
scans to return fewer rows than requested even when enough rows matched
the full predicate.

This keeps scalar-index range truncation behind the full-filter
exactness boundary: indexed candidates are still narrowed by the scalar
index, but scan_range_after_filter is only consumed early when no refine
filter remains. Regression coverage exercises both the FilteredReadExec
path and the scanner(filter, limit) end-to-end behavior.

Fixes lancedb/lancedb#3436.
This updates the workspace `jieba-rs` dependency to 0.10.0 and refreshes
the resolved root and Python lockfiles.

The existing Lance Jieba tokenizer integration continues to use the same
upstream API surface, while the resolved `jieba-macros` package now
matches the new release. The Python Rust third-party license inventory
version entries were also updated to match the resolved dependencies.
jackye1995 and others added 30 commits June 17, 2026 11:24
…ce-format#7191)

## Summary

Adds forward-compatibility infrastructure to the directory-catalog
`__manifest` dataset, mirroring the Lance table format's reader/writer
feature flags but at the catalog-manifest layer.

- Persists two `u64` bitmasks in the `__manifest` dataset's
`table_metadata` (`lance.namespace.manifest.reader_feature_flags` /
`writer_feature_flags`). Absent keys parse as `0`, so every existing
manifest stays universally compatible.
- A build refuses to read or write a manifest that sets a flag it does
not understand, returning a clear "please upgrade" error instead of
misreading it. Reader and writer checks are enforced centrally: in the
manifest consistency wrapper, at catalog open, and on the copy-on-write
mutation path.
- Also stops the directory catalog from silently degrading to directory
listing when the manifest is incompatible — `build()` and the
per-operation fallbacks propagate the incompatibility instead of masking
it, so the check cannot be bypassed.

This is the **mechanism only**: no manifest feature is defined yet, so
the known masks are `0` and nothing is ever set — **zero behavior
change** today. It is the prerequisite so that a future `__manifest`
format change (e.g. a schema migration) can be shipped safely: that
change adds its bit to the known masks and stamps it on write, and from
then on older clients refuse the new format instead of misreading it.
Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
…-reuse window (lance-format#7325)

## Summary

A deferred-remap compaction that materializes deletions writes a
fragment-reuse index (FRI) that the inverted (full-text-search) index is
read through at load time. The load-time path dropped the deleted rows
and renumbered the surviving `doc_id`s — but the posting lists reference
`doc_id`s **positionally** (a `doc_id` is an index into the `DocSet`'s
`row_ids` / `num_tokens` arrays, fixed at build time) and are not
regenerated at load. Dropping rows shifted every later `doc_id` out from
under the posting lists, so a query would index `num_tokens` / `row_ids`
out of bounds (panic) or score/return the wrong document.

This is deletion-specific: merge-only deferred compaction remaps every
row to `Some(new_addr)`, so nothing is dropped and positions stay
aligned. It only breaks when deletions are materialized (`remap_row_id`
returns `None`).

## Fix: tombstone-preserve-positions

In `DocSet::from_columns` (the FRI load path), instead of dropping
deleted rows:

- keep every doc slot so `doc_id`s stay aligned with the posting lists;
- put `RowAddress::TOMBSTONE_ROW` in the deleted slots, and leave them
out of the `inv` reverse map so a `row_id` lookup never resolves to a
deleted doc;
- keep `num_tokens` full-length, so `num_tokens(doc_id)` can't go out of
bounds.

In `Wand::search`, skip docs whose resolved `row_id` is `TOMBSTONE_ROW`
— placed right beside the existing prefilter-mask skip and using the
same iterator-advance, so a tombstoned doc is stepped over exactly like
a prefilter-rejected one and never surfaces in results.

The heavyweight physical remap (`DocSet::remap`) still does the real
renumber + compact (and rebuilds the posting lists to match); this
load-time path only needs to stay consistent until then.

### Note on stats

Tombstoned slots are still counted in `total_tokens` / `len()`, so BM25
`avgdl` in the FRI window is effectively the pre-deletion average. This
only perturbs *scores* slightly, never the result set, and the physical
remap restores exact stats. Excluding tombstones would require changing
`len()` semantics (used by `idf`), which isn't worth it for a transient
window.

## Test

`test_read_inverted_index_with_defer_index_remap_and_deletions`: delete
a prefix, deferred-compact, then assert FTS returns exactly the
surviving rows — both in the FRI window and after physical remap + trim.
Without the fix it panics on the out-of-bounds `num_tokens` access.

## Scope

Independent change against `main` — touches only the inverted-index load
and query paths (`scalar/inverted/{index,wand}.rs`) plus one new test.
The analogous IVF_HNSW desync under deletions (lance-format#3993) is **not**
addressed here: the HNSW graph traverses node ids positionally to
compute distances *during* search (not just at result collection), so it
needs a different approach across the SQ/PQ/flat/RQ storage load paths —
a separate change.
…ance-format#7285)

## Problem

The dataset fixtures in `python/benchmarks/test_search.py` pass
deprecated parameters to the dataset APIs:

- `lance.write_dataset(..., use_legacy_format=False)`
- `lance.dataset(..., index_cache_size=64 * 1024)`

The test config sets `filterwarnings = ['error::DeprecationWarning',
...]`, so these emit `DeprecationWarning` **as errors during fixture
setup**. As a result every benchmark in `test_search.py` errors out
before running:

```
DeprecationWarning: use_legacy_format is deprecated, use data_storage_version instead
DeprecationWarning: The 'index_cache_size' parameter is deprecated. Use 'index_cache_size_bytes' instead.
```

## Fix

Switch to the current parameters:

- `use_legacy_format=False` → `data_storage_version="stable"` (the exact
mapping the deprecation shim applies).
- `index_cache_size=64 * 1024` → `index_cache_size_bytes=512 * 1024 *
1024` (512 MiB comfortably caches these 100k-row IVF_PQ indices).

## Verification

```
$ uv run --group benchmarks pytest python/benchmarks/test_search.py::test_ann_no_refine --benchmark-only
test_ann_no_refine[clean]               541.48 us   1813 ops
test_ann_no_refine[with_delete_files]   770.46 us   1285 ops
test_ann_no_refine[with_new_rows]      2843.17 us    349 ops
3 passed
```

`ruff check` / `ruff format --check` clean.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…TS) (lance-format#7067)

## What

Fixes a stale-read phantom shared by the **vector** and **FTS** index
search arms over the MemWAL **active memtable**, and routes the
in-memory newest-per-PK / membership decisions through a single
maintained MVCC index.

## The bug

The active memtable is an append log; a PK update is a later append with
the same key. The in-memory secondary indexes — HNSW (vector) and the
inverted index (FTS) — are **append-only**, so an updated row's old
entries stay live. Both arms deduped with `WithinSourceDedupExec`, which
only suppresses a stale row when the fresh version is **also in the
result set**. When an update moves a row out of the query's match set
(vector: far from the query; FTS: new text no longer matches), the fresh
version isn't returned, so the stale version leaks. (`point_lookup` was
immune — it already did the MVCC recency seek.)

## The fix

Maintain a per-memtable **MVCC PK-position index**: a lock-free arena
skiplist keyed on `(compute_pk_hash(pk_columns), row_position)`, enabled
on the active memtable and carried through freeze. The row position *is*
the version stamp, so this reuses the exact primitive point-lookup
trusts (`get_newest_visible`).

- **`NewestPkFilterExec`** keeps an index hit iff
`get_newest_visible(pk_hash, max_visible) == row_position` —
predicate-independent, snapshot-exact (keys on the scanner's latched
`max_visible`). Wired into the active vector arm (replacing
`WithinSourceDedupExec`) and the FTS arm (adding `with_row_id`).
- **point_lookup** falls back to the index (hash + value-equality
collision guard) when no scalar BTree exists; its plan-path active arm
uses `SortExec(_rowid DESC).fetch(1)` instead of
`WithinSourceDedupExec`.
- **Cross-source block-list** probes the index per candidate
(`GenMembership::Index`, snapshot-bounded) with no per-query set;
flushed/base keep cached sets. `contains_pks` probes too.
- **Cleanup:** `WithinSourceDedupExec` / `DedupDirection` and the
per-query PK-hash set builders (`pk_hashes()`, `in_memory_pk_hashes`)
are deleted. Net negative LOC.

Hash keying covers single **and composite** PKs uniformly. The
snapshot-bounded probe also closes a latent over-block where a
not-yet-visible newer write could shadow an older visible copy.

## Tests

Both `#[ignore]`d repros un-ignored and passing; new `PkPositionIndex`
unit tests, point-lookup-without-btree, index-sourced block-list, and
snapshot-bounded **vanished-row guard** tests (within- and
cross-source). Full `mem_wal` suite green; `cargo fmt` + `clippy -D
warnings` clean.

## Deferred follow-ups

- Migrate `MemTableDedupScanExec`'s reverse-walk `HashSet`
(filtered-read scan path) onto the same probe — the last within-source
mechanism off the index; benchmark-gated.
- In-graph HNSW within-gen eviction (perf end-game; correctness is now
exact).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ist columns (lance-format#7247)

Closes lance-format#5102

## Problem

A fixed-size-list column with dimension 0 panics with `attempt to divide
by zero`
(`rust/lance-encoding/src/data.rs`, `FixedSizeListBlock::num_values`).
As of pylance 7.0.0
the panic fires on **write** for every storage version
(`stable`/`2.1`/`2.2`), and reading
datasets persisted by older writers (which accepted such columns) panics
as well.

Reproduction details are in the issue comment:

lance-format#5102 (comment)

## Approach

Following the maintainer guidance in lance-format#5102 (error, not panic), this adds
two small guards at
boundaries that already return `Result`, instead of changing
`DataBlock::num_values()` to
return `Result` (the approach that made lance-format#5159 balloon across the whole
encoding crate):

1. **Write side**: `Schema::validate()` rejects zero-dimension
fixed-size-list fields
(including nested ones). `validate()` runs inside
`Schema::try_from(&ArrowSchema)`,
so every write entry point surfaces a clean schema error instead of a
panic. Writes
currently panic on every storage version, so no working flow changes
behavior.
2. **Read side (defensive)**: the structural and legacy field-scheduler
builders reject
zero-dimension fixed-size lists with an invalid-input error, so datasets
persisted by
old writers fail cleanly at scheduling time instead of crashing the
process.

## How the guards sit in the data flow


![guards](https://raw.githubusercontent.com/DanielMao1/lance/pr-assets/zero-dim-fsl-guards.png)

Two facts that shape the design:

- `Schema::try_from(&ArrowSchema)` calls `validate()` internally and
every write path performs
this conversion, so guard 1 in one place covers all write entry points.
- Guard 2 exists because writers up to ~2026-04 could still persist
zero-dimension columns
under the `stable` (2.0) storage version; reading those files must not
crash the process.

## Tests

- `lance-core`: `Schema::try_from` rejects zero-dim FSL at top level and
nested in a struct;
  positive dimensions still validate.
- `lance-encoding`: the scheduler guard rejects zero-dim FSL, including
FSL-nested-in-FSL,
  and accepts positive dimensions.
- Python: parametrized over `legacy`/`stable`/`2.1`, `write_dataset` now
raises a clean
`OSError` (same mapping as other schema validation errors) instead of
`PanicException`.

Co-authored-by: Daniel Mao <danielmao@danieldeMacBook-Pro.local>
…ormat#7287)

## What
`BTreeIndex::search` rebuilt the full `col IN (...)` physical expression
on
every page it touched. For a large IN-list spanning many pages this is
O(pages x values) -- the expression (and its hash set) is reconstructed
per
page even though it is identical across pages.

This compiles the predicate once in `BTreeIndex::search` and reuses it
across
pages via a new `FlatIndex::search_prebuilt`. Membership is O(1) per row
regardless of set size, so only the repeated build was wasted work.
Resolves
the existing `// TODO` in `search_page`.

## Why
`col IN (<large set>)` is used to resolve big key sets to row ids. On a
real
83M-row table, an `IsIn` of ~46K values took 13.4s, almost entirely
per-page
expression construction.

## Result
Same table/query, index lookup **13.4s -> 3.2s**. Cost is now bounded by
pages
touched + rows scanned, independent of IN-list size (a local 80K-value
lookup
over a multi-page index runs ~130ms and is flat in the value count).

## Notes
- No public API or behavior change -- the predicate and its evaluation
are
  identical, just built once instead of per page.
- `cargo test -p lance-index` passes (279 scalar tests); `cargo fmt`
clean.

Co-authored-by: Yuan Gao <yuang@xiaopeng.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…lance-format#7317)

## Summary

`remap_index` — the catch-up that physically applies a fragment-reuse
index after a deferred-remap compaction — applied the reuse index **one
version at a time**, rebuilding the index file and committing **once per
reuse version**. An index touched by K deferred compactions paid **K
full index rebuilds + K commits** for a result identical to applying all
K at once. This is worst exactly when the reuse index has accumulated
many versions before a remap runs.

## Change

Compose the whole chain and rebuild once:

- **Row addresses:** `FragReuseIndex::remap_row_id` already chains every
version (and passes through addresses a version does not touch), so
mapping the union of all versions' keys yields a single **baseline →
final** address map, applied in one rebuild.
- **Coverage bitmap:** composed in one pass with the same all-or-nothing
/ straddle-error semantics (chaining is automatic — a version's new
fragments are the next version's old fragments). `data_predates_version`
is evaluated against the fixed baseline since there are no intermediate
commits.
- One `CreateIndex` commit instead of one per version.

## Why the composed map is not filtered by the fragment bitmap

A tempting way to keep the composed map small is to drop keys whose
fragment isn't in the index's current `fragment_bitmap`. That
optimization is **not** safe — the per-version loop never did it, and
this PR keeps it that way.

In the sibling-coverage-remap case, remapping one index commits a
manifest that coverage-remaps a *sibling* index's bitmap onto the new
fragments and persists it *before* the sibling's own data is remapped.
The sibling's on-disk bitmap then shows the new fragments while its data
still holds old addresses. Filtering the map by that bitmap would drop
exactly the keys the sibling needs, leaving an **empty** map — and
`index::remap_index` treats an all-`None`/empty map as
`RemapResult::Keep`, reusing the stale index files while the version is
bumped and the reuse index trims, so the index would end up pointing at
dead fragments.

So the composed map maps every old address the reuse index touched;
addresses an index doesn't store are simply never looked up (the map
stays bounded by the rows the reuse index touched).

## Tests

- `test_remap_index_batches_multiple_reuse_versions` — a multi-version
reuse chain must rebuild + commit exactly once.
- `test_cleanup_frag_reuse_index_multiple_indices` — extended with a
post-remap data-correctness scan so it asserts each remapped index
resolves to **live rows**, not just that versions advance and the reuse
index trims. (A bitmap-filtered map would make the sibling index return
0 of 1000 rows here.)

```
cargo test -p lance --lib frag_reuse::tests   # 3 passed
cargo test -p lance --lib remap               # 21 passed
```
…r_segments (lance-format#7339)

## What

Make two existing scalar-index building blocks `pub` and re-export them
from `index::scalar`:

- `LogicalScalarIndex::try_new` — public constructor that merges several
already-opened segments of one scalar index into a single searchable
`ScalarIndex`.
- `load_named_scalar_segments` — list the committed,
dataset-intersecting segments of a named scalar index (length `1` = a
single non-segmented index, `> 1` = an index split across multiple
segments).

## Why

A distributed query engine needs to (1) discover how many segments a
named scalar index has and (2) open an explicit subset of those segments
on each executor, then present them as one index. Both capabilities
already exist inside lance — `load_named_scalar_segments` lists
segments, and `LogicalScalarIndex` already unions per-segment search
results and fragment coverage — they were just private.

The actual "open this UUID subset" helper stays in the calling engine;
it is pure glue over these two plus the already-public
`Dataset::open_scalar_index`, so it does not need to live in lance.

## Notes

- Purely additive. No behavior change to existing callers
(`open_named_scalar_index` and `scalar_index_fragment_bitmap` already
used both).
- `index_intersects_dataset` and `Dataset::fragment_bitmap` remain
private.
- `cargo check`, `clippy`, and `fmt` clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ce-format#7343)

> [!WARNING]
> Dependabot will stop supporting `python v3.9`!
> 
> Please upgrade to one of the following versions: `v3.9`, `v3.10`,
`v3.11`, `v3.12`, `v3.13`, or `v3.14`.
>

Bumps [lance-namespace](https://github.com/lance-format/lance-namespace)
from 0.8.5 to 0.8.6.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/lance-format/lance-namespace/releases">lance-namespace's
releases</a>.</em></p>
<blockquote>
<h2>v0.8.6</h2>
<!-- raw HTML omitted -->
<h2>What's Changed</h2>
<h3>New Features 🎉</h3>
<ul>
<li>feat(spec): add source_task_size to RefreshMaterializedViewRequest
by <a
href="https://github.com/justinrmiller"><code>@​justinrmiller</code></a>
in <a
href="https://redirect.github.com/lance-format/lance-namespace/pull/355">lance-format/lance-namespace#355</a></li>
<li>feat(java): propagate source_task_size to generated Java clients by
<a
href="https://github.com/justinrmiller"><code>@​justinrmiller</code></a>
in <a
href="https://redirect.github.com/lance-format/lance-namespace/pull/356">lance-format/lance-namespace#356</a></li>
</ul>
<h3>Bug Fixes 🐛</h3>
<ul>
<li>fix: pin central-publishing-maven-plugin to an existing version by
<a
href="https://github.com/brendanclement"><code>@​brendanclement</code></a>
in <a
href="https://redirect.github.com/lance-format/lance-namespace/pull/354">lance-format/lance-namespace#354</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/justinrmiller"><code>@​justinrmiller</code></a>
made their first contribution in <a
href="https://redirect.github.com/lance-format/lance-namespace/pull/355">lance-format/lance-namespace#355</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/lance-format/lance-namespace/compare/v0.8.5...v0.8.6">https://github.com/lance-format/lance-namespace/compare/v0.8.5...v0.8.6</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/lance-format/lance-namespace/commit/590a4eb7163a85f56e0622f9359efa33d5ad6941"><code>590a4eb</code></a>
chore: release version 0.8.6</li>
<li><a
href="https://github.com/lance-format/lance-namespace/commit/f5ea0439fc81fc62e7344981321d6a83c4b5be9d"><code>f5ea043</code></a>
feat(java): propagate source_task_size to generated Java clients (<a
href="https://redirect.github.com/lance-format/lance-namespace/issues/356">#356</a>)</li>
<li><a
href="https://github.com/lance-format/lance-namespace/commit/89e0cab93528371403edc10aae0f7de55ade7415"><code>89e0cab</code></a>
feat(spec): add source_task_size to RefreshMaterializedViewRequest (<a
href="https://redirect.github.com/lance-format/lance-namespace/issues/355">#355</a>)</li>
<li><a
href="https://github.com/lance-format/lance-namespace/commit/be99cb674384e5ca54b7c1f4a8e71e388b3526e6"><code>be99cb6</code></a>
fix: pin central-publishing-maven-plugin to an existing version (<a
href="https://redirect.github.com/lance-format/lance-namespace/issues/354">#354</a>)</li>
<li>See full diff in <a
href="https://github.com/lance-format/lance-namespace/compare/v0.8.5...v0.8.6">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lance-namespace&package-manager=uv&previous-version=0.8.5&new-version=0.8.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ance-format#7342)

> [!WARNING]
> Dependabot will stop supporting `python v3.9`!
> 
> Please upgrade to one of the following versions: `v3.9`, `v3.10`,
`v3.11`, `v3.12`, `v3.13`, or `v3.14`.
>

Bumps [geoarrow-rust-core](https://geoarrow.org/geoarrow-rs/) from 0.6.1
to 0.6.3.


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=geoarrow-rust-core&package-manager=uv&previous-version=0.6.1&new-version=0.6.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This PR extends FTS stop-word handling so ICU tokenization can remove
built-in stop words across supported languages, while non-ICU tokenizers
continue to use the configured language and existing custom stop-word
override behavior. It also fills the missing built-in stop-word lists
for languages already exposed by `Language`, without changing the public
API or index protobuf format.
lance-format#7215)

## What

Adds `FreshTierWatermark { active_generation, active_batch_count }` and
`LsmScanner::contains_pks_at` (with `fresh_tier_block_list` threading
the watermark) so a caller can evaluate fresh-tier PK membership against
the **exact tier a prior scan observed**, instead of the live tier.

## Why

The WAL block-list runs as two independent RPCs (the read arm and a
supersession check) that each snapshot the live fresh tier at their own
call time. Under concurrent writes the two snapshots disagree, so a base
row can be dropped as "superseded" by the check while the arm never
delivered a replacement — a transient missing row. The fix pins both
phases to the same watermark; this PR is the lance half that lets the
check reconstruct the arm's snapshot.

## How the watermark works

The active memtable is the only fresh-tier source that grows between two
reads; everything strictly below its generation (frozen memtables,
flushed generations) is immutable. So the as-of filter:

- includes in-memory/flushed sources **below** `active_generation` whole
(immutable, fully observed),
- bounds the **active** generation to its first `active_batch_count`
batches (by append index),
- excludes in-memory sources **above** `active_generation` and flushed
generations `>= active_generation` (produced after the snapshot).

It uses only `batch_store.len()` and the memtable generation — both
always available on the read path — and only ever *excludes* rows the
scan did not observe, so a stale watermark under-counts (a tolerable
stale read) rather than over-counts (which would drop a row with no
replacement).

> Note: an earlier approach keyed the watermark on per-batch WAL
positions (`wal_batch_mapping`), but that map is only populated by
`mark_wal_flushed`, which is test/bench-only — empty in production. The
generation + batch-count watermark avoids any write-path dependency.

## Grace-period pinning

A flushed memtable could otherwise be evicted between the two reads,
collapsing its per-batch boundaries and turning the active-generation
bound into a wholesale `>=` exclusion (a stale read). Frozen memtables
are now retained in memory for a configurable grace period
(`frozen_memtable_grace`, default 3s) after flush and swept by an
existing dispatcher ticker. The grace must be **strictly larger than the
maximum query elapsed time** to guarantee snapshot isolation: while
pinned, the generation is served batch-resolved from memory; once
evicted, no in-flight read references it so the `>=` exclusion is safe.

## Tests

`fresh_tier_block_list` as-of unit tests: active-memtable batch-count
bound, newer-gen-excluded / lower-gen-included-whole, flushed-gen
at/above active excluded; collector suppression of a flushed generation
pinned in memory; frozen retained during grace then swept.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Performance Improvement

This PR removes redundant `PostingIterator::doc()` calls in the WAND
lead/tail scoring paths without changing `PostingIterator`,
`HeadPosting`, heap ordering, public APIs, docs, or wire formats.

The existing `HeadPosting` doc id cache remains unchanged. The refactor
reuses already-read `DocInfo` values where the iterator position is
known to be the same:

- reuse the first lead posting's `DocInfo` in `Wand::next` for doc
length, first lead score, and the returned candidate
- call `posting.doc()` once in `advance_tail_top` after
`posting.next(target)` and reuse it for target matching and scoring
- apply the same single-`doc()` pattern per tail posting in
`advance_all_tail`

## Benchmark

Fresh GCP VM run with `search-benchmark` FTS static suite, `lance_fts`,
`match`, `query_length=5`, `k=10`, `num_queries=1000`,
`prewarm_index=true`, `with_position=false`, dataset
`gs://fts-bench/yang-db/wikipedia-bench-v2-20260327.lance`.

Search-benchmark commit: `8d05d5680f2fbac38c1642380dc98b8a2bb2140f`.

Measured baseline/candidate SHAs from the pre-rebase benchmark branch:

- baseline: `8743fbaf9ce0dce15373629a7574e14d5c6c9367`
- candidate: `6ac7b2c061d34142838c340fb2dbda509506da63`

Results from one run:

- QPS: 274.768 -> 279.352 (+1.668%)
- avg latency: 28.936 ms -> 28.495 ms (-1.523%)
- p50: 21.703 ms -> 22.259 ms (+2.562% slower)
- p90: 59.845 ms -> 59.059 ms (-1.313%)
- p99: 131.076 ms -> 113.052 ms (-13.751%)

The benchmark was run before rebasing this PR onto the latest `main`;
the PR commit contains the same WAND code change rebased onto
`c4e65645d75b17c572fee62809b013b3730370bb`.

## Validation

On the rebased PR branch:

- `cargo fmt --all --check`
- `cargo test -p lance-index scalar::inverted::wand`
- `cargo check -p lance-index --tests`
- `cargo clippy --all --tests --benches -- -D warnings`
…ry (lance-format#7284)

## Summary

Evolve `FlushedMemTableCache` into the unified warm/open interface for
mem_wal flushed generations, and populate the caches **before** a
generation is queryable so the first query sees zero cold reads.

- `FlushedMemTableCache` now owns a required `Session` (the index
`CacheBackend` seam) and an optional read-through `WrappingObjectStore`
(page cache), threading both into every open. `get_or_open(path)` drops
its per-call session arg.
- New `warm(path, pk_columns)`: open + `prewarm_all_indexes` (FTS) +
`get_or_build_pk_hashes` (vector block-list), bounded by a semaphore and
idempotent via a `warmed` gate. `open_flushed_dataset` fires a
warm-on-open backstop.
- `retain_paths` is now async and actively evicts retired generations'
index objects via the new `Session::invalidate_index_prefix`; the byte
cache is left to LRU.
- `MemTableFlusher` warms each generation pre-commit, **best-effort**
(logged on error, never blocks `update_manifest`), threaded via
`ShardWriterConfig.flushed_cache`.

This is the Lance-side building block for WAL-pod flushed-generation
caching (consumed by sophon, which supplies the backed `Session` +
read-through pool).

## Test plan

- `cargo test -p lance --lib mem_wal::scanner::flushed_cache` (7 tests,
incl. warm/idempotency/pk-hash/retain) — pass
- `cargo test -p lance --lib mem_wal::memtable::flush` (8 tests) — pass
- `cargo clippy -p lance --tests --benches` — clean
- `cargo fmt --all`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This allows miniblock writers to use up to 32K logical values per chunk
when explicitly configured via `LANCE_MINIBLOCK_MAX_VALUES`, while
keeping the default at 4096.

The file format already stores `log_num_values` in 4 bits, so the
writer-side guard can allow values up to 15 without requiring the
large-chunk metadata path. The compressed byte-size limits remain
enforced.

Fixes lance-format#7326.
## Performance Improvement

### What is the performance issue or bottleneck?

WAND search eagerly collected term frequencies for each scored candidate
before checking whether that candidate would enter the top-k heap.
Candidates rejected by the current kth score still paid for a `Vec`
allocation and `PostingIterator::doc()` calls.

### How does this PR improve performance?

This moves term-frequency collection into the two branches that actually
insert or replace a top-k candidate, for both WAND search and flat
search. Rejected candidates now avoid collecting frequencies entirely.

The PR also adds focused regression tests that count test-only
term-frequency collection calls and verify rejected equal-score
candidates do not collect frequencies.

### Benchmark

`wikipedia-40m-fts`, V2 `text_idx`, `match`, `query_length=5`,
`stop_words=0`, `k=100`, `num_queries=1000`, `seed=0`.

| build | QPS | avg | p50 | p90 | p99 |
|---|---:|---:|---:|---:|---:|
| before `838f78b9` | 127.46 | 62.56ms | 57.20ms | 113.27ms | 223.31ms |
| after `4c05e1c` | 130.34 | 61.17ms | 54.65ms | 112.63ms | 219.22ms |
| delta | +2.26% | -2.22% | -4.46% | -0.56% | -1.83% |

Raw artifacts:

- suite:
`/mnt/benchmark-ssd/search-benchmark/bench_suites/wikipedia_fts_wand_freqs_compare_k100_20260618.json`
- log:
`/mnt/benchmark-ssd/logs/wikipedia_fts_wand_freqs_compare_k100_v2_20260618_091950.log`
- csv:
`/mnt/benchmark-ssd/search-benchmark/results/results_fts_static_20260618_093407.csv`

### Validation

- `cargo fmt --all`
- `CARGO_TARGET_DIR=/tmp/lance-target-21fd-pr-clippy cargo clippy --all
--tests --benches -- -D warnings`
- `CARGO_TARGET_DIR=/tmp/lance-target-21fd-pr-clippy cargo test -p
lance-index scalar::inverted::wand::tests::`
- `git diff --check`
- `/Users/yang/.cache/uv/archive-v0/CK_YxmMYMk7DlRLAQr3It/bin/python
-mpre_commit run --files rust/lance-index/src/scalar/inverted/wand.rs
-v`
Streams FM index training and update input into partition-sized writes
instead of materializing the full training stream first. This keeps
large segmented FM builds bounded by one partition plus the current
Arrow batch while preserving the existing partition file layout.
Unreleased version after creating v8.0.0-rc.1
…e-format#7072)

## Problem

When a dataset has a JSON column and multiple JSON indices are created
on different JSON paths of that same column (e.g. one index on `$.a` and
another on `$.b`), query routing is incorrect. A query like
`json_extract(json, '$.b') = 'foo'` may hit the `$.a` index instead of
the `$.b` index, producing wrong results.

## Root Cause

`maybe_indexed_column` obtains a parser from
`IndexInformationProvider::get_index()`, which returns a `&dyn
ScalarQueryParser` pointing to a `MultiQueryParser` that aggregates all
sub-parsers for that column.

The flow was:

1. `get_index()` returns `MultiQueryParser` as `&dyn ScalarQueryParser`
2. `parser.is_valid_reference(expr, data_type)` is called —
`MultiQueryParser`'s impl iterates children and returns `Some(DataType)`
from the **first** child that accepts, but discards **which** child
matched
3. The same `MultiQueryParser` is then used for `visit_eq` /
`visit_between` etc., which also iterate children and return the first
non-`None` result — potentially a **different** child than the one that
validated the reference

This means the query can be dispatched to the wrong JSON index (e.g. the
`$.a` index for a `$.b` query).

## Fix

- **Change `IndexInformationProvider::get_index`** to return
`(&DataType, &MultiQueryParser)` instead of `(&DataType, &dyn
ScalarQueryParser)`, so callers can interact with the `MultiQueryParser`
directly
- **Add `MultiQueryParser::select(expr, data_type)`** — iterates child
parsers and returns `(&dyn ScalarQueryParser, DataType)` from the first
child whose `is_valid_reference` accepts the expression, preserving
**which** child matched
- **Update `maybe_indexed_column`** to call `multi.select(expr,
data_type)` instead of `parser.is_valid_reference(expr, data_type)`,
obtaining the precise sub-parser for all subsequent operations

## Test

Added regression test `test_multi_json_indices_route_by_path` that:
- Creates a `MultiQueryParser` with two `JsonQueryParser` sub-parsers
(for `$.a` and `$.b`)
- Verifies `json_extract(json, '$.b') = 'foo'` resolves to the
`json_b_idx` index
- Verifies `json_extract(json, '$.a') = 'foo'` resolves to the
`json_a_idx` index
- Verifies `json_extract(json, '$.c') = 'foo'` (unindexed path) does not
bind to any index
…rmat#7049)

## Summary
- remove the dedicated finite_value_may_be_in_zone helper
- rely on ScalarValue total ordering for finite values against NaN
zonemap max values
- add a focused assertion covering finite targets below a stored NaN max

## Tests
- cargo test -p lance-index
scalar::zonemap::tests::test_nan_zonemap_index -- --nocapture
- cargo test -p lance
dataset::scanner::test::test_inexact_scalar_index_plans -- --nocapture
…updates (lance-format#7341)

The Java publish workflow built the macOS native lib *and* published the
multi-platform JAR in a single job, so the macOS build was gated behind
the Linux builds (`needs: [linux-arm64, linux-x86]`) even though only
the publish step depends on the Linux artifacts.

## Workflow restructure (`java-publish.yml`)

- **`build-linux`** — the two near-identical Debian 10 builds collapsed
into a matrix (x86-64, arm64).
- **`build-macos`** — build-only, no `needs`, runs concurrently with the
Linux builds.
- **`publish`** — a separate `ubuntu-latest` job that gathers all three
native libs and packages/deploys with `-Dskip.build.jni=true`. No longer
holds the expensive macOS runner during publish.

Critical path goes from `linux → (macos build + publish)` to `max(linux,
macos) → publish`.

## Drop no-op `-P shade-jar`

That profile doesn't exist; shading is an always-on plugin bound to the
`package` phase. The flag only produced a `could not be activated`
warning.

## `cargo update --workspace` in release scripts

The release scripts ran bare `cargo update` after the version bump,
which refreshed *all* transitive dependencies rather than just the local
crates being re-versioned. This swept incidental dependency bumps into
release commits — see [this
run](https://github.com/lance-format/lance/actions/runs/26624032632/job/78456375994#step:5:2496),
where the lock update pulled in `brotli`, `hyper`, `jiff`, `zerocopy`,
and ~14 others on top of the intended `lance-*` version changes.
`--workspace` re-pins only the workspace crates whose versions changed.
Applied to all three release scripts (`create_release_branch.sh`,
`release_common.sh`, `publish_beta.sh`) for consistency.

> [!NOTE]
> The PR's `pull_request` trigger exercises the full build + dry-run
path. The `cargo update --workspace` change only takes effect on the
next release-tooling run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…merge (lance-format#7320)

Closes: lance-format#7230

Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
…st branch (lance-format#7246)

## Problem

FTS `search()` combined with a `where(...)` prefilter on a
`list<string>` / `large_list<large_string>` column silently drops
matches when the query token sits at any position **other than the
last** in a row's list. `.postfilter()` (FTS first, then filter) returns
the correct rows.

Reported as lancedb#3352 with a runnable Python repro. The plan is
`MatchQuery > ScalarIndexQuery`, and the bug only surfaces when the
planner picks the small-allowlist prefilter path (`index_comparisons ≈
allowlist size`):

| Target row `keywords` | prefilter (default) | postfilter |
|---|---|---|
| `["needle", "synonym"]` | **0 rows (bug)** | 2 rows |
| `["synonym", "needle"]` | 2 rows | 2 rows |

## Root cause

A list column indexes every element as its own document, so one `row_id`
owns several `doc_id`s: `DocSet.inv` (a `Vec<(row_id, doc_id)>` sorted
by `row_id`) holds multiple entries per row.

`DocSet::doc_id(row_id)` resolved a row to a **single** `doc_id` via
`binary_search_by_key`, and its only caller is `Wand::flat_search`: the
walk-the-allowlist prefilter branch. It therefore evaluated just one of
the row's
documents against the posting lists; when the query token lived in any
other element, the row became a false negative.

The regular WAND path is forward-driven (document -> `row_id`, with a
per-document mask check), so it was always correct, only `flat_search`
was affected, which is why the bug is specific to the prefilter branch.

## Fix

- Replace `DocSet::doc_id` with `DocSet::doc_ids(row_id) -> impl
Iterator`, which yields every `doc_id` in the contiguous equal-key run
in `inv` (the legacy `row_id == doc_id` shape still resolves to a single
document).
- `flat_search` now expands each allow-listed `row_id` to **all** of its
documents (`flat_map` over `doc_ids`) before sorting into doc-id order.

This brings `flat_search` to parity with the WAND path, so it introduces
no new duplicate-row behaviour: only documents actually present in the
posting lists score.

## Tests

- `test_doc_ids_resolves_every_document_a_row_owns`: unit coverage of
the multi-valued resolution (list shape, legacy shape, and a missing
row).
- `test_flat_search_finds_list_row_with_match_at_non_last_position`
(rstest, compressed + plain): reproduces the bug; it fails on the
previous single-`doc_id` resolution and passes with the fix.

All 143 `scalar::inverted` tests pass; `cargo fmt --all --check` and
`cargo clippy -p lance-index --tests -- -D warnings` are clean.

Closes lancedb#3352
Python 3.9 is reaching end-of-life. This removes it from CI test jobs
and released binaries and makes Python 3.10 the minimum supported
version.

- Bump `requires-python` to `>=3.10` and drop the `Python :: 3.9`
classifier in `python/pyproject.toml` and `memtest/pyproject.toml`.
- Raise the PyO3 abi3 floor from `abi3-py39` to `abi3-py310`.
- Update the `python.yml` test matrix (3.10 + 3.13) and the
`pypi-publish.yml` release wheel matrices to build for 3.10.
- Drop the now-redundant `python_version >= '3.10'` dependency markers
and regenerate `uv.lock`.
- Update `CONTRIBUTING.md` and the build-action input docs.

`docker-compose.yml`'s `version: "3.9"` is the Compose file-format
version (not Python) and is left unchanged. No `sys.version_info` checks
for 3.9 exist in the source.

Closes lance-format#7344

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

import lance / lancedb crashes with SIGILL on x86_64 CPUs without AVX2 (Sandy Bridge, Ivy Bridge, FX-7500-class AMDs)