feat: add configurable vector-search distance metric#77
Merged
Conversation
5334971 to
9bda813
Compare
Vector search previously always ranked by Euclidean (L2) distance, with no way to choose a metric. Most modern embedding models are trained for cosine similarity, so for unnormalized vectors the hardcoded L2 default could silently return lower-quality results and the returned `distance` did not match the caller's model. Add a store-level `distance_metric` option (`l2` default, `cosine`, `dot`), threaded through every layer like the existing `id_index_type`. Search method signatures are unchanged; only the distance computation in `search_filtered_with_options` changes. All metrics are normalized to "smaller is better" so the ascending ranking is preserved. Closes lance-format#74 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9bda813 to
bcc554b
Compare
This was referenced Jun 12, 2026
dcfocus
added a commit
that referenced
this pull request
Jun 12, 2026
## Summary Persists the configurable vector-search `distance_metric` (added in #74/#77) in the dataset so it round-trips on `open` instead of being re-specified every time. **Closes #80.** Before this change `distance_metric` was a runtime-only option: a caller had to pass the same metric on every `open`, and if they forgot, the store silently fell back to the default `l2` and ranked results differently from how the dataset was intended to be queried — `embedding_dim` already recovers from the schema, but the metric did not. ## What changed - **Persist on create**: the metric is written into the Lance **schema metadata** under `lance-context:distance_metric` (the same mechanism already used for `lance-encoding:blob`). - **Recover on open**: `distance_metric_from_schema` reads it back and uses it as the store's metric, mirroring `embedding_dim_from_schema`. - **Mismatch guard**: an explicitly passed metric that disagrees with the persisted one errors, reusing the existing `embedding_dim` mismatch-validation pattern in `open_with_options`. - **Backward compatible**: datasets created before this change carry no key and default to `l2`. - `ContextStoreOptions.distance_metric` is now `Option<DistanceMetric>` (`None` = use the persisted/default metric), matching `embedding_dim: Option<i32>`. Threaded through unified / server / PyO3 accordingly. ## Tests - **Rust** (`store.rs`): `distance_metric_persists_across_reopen` (create `cosine`, reopen **without** the option → cosine ranking), a mismatch-errors test, and a unit test that a metadata-less schema defaults to `l2`. - **Python** (`test_distance_metric.py`): reopen-without-option keeps cosine ranking; reopening with a conflicting metric raises. ## Verification - `cargo test -p lance-context-core` — 41 passed - `cargo clippy --all-targets` — clean; `cargo fmt --check` — clean - `pytest python/tests/` — 109 passed (the 2 S3 failures are pre-existing and environmental: no live bucket), `ruff check` / `ruff format --check` clean ## Relationship Follow-up to #74 / #77. Does not change the metric set or ranking math.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
distancedidn't match the caller's model.distance_metricoption —l2(default),cosine,dot— threaded through every layer exactly like the existingid_index_typeoption.search_filtered_with_optionschanges. All metrics are normalized so smaller is better, preserving the existing ascending ranking (cosine =1 - cosine_similaritywith a zero-norm guard; dot = negated inner product for max-inner-product search).Context.create(uri, distance_metric="cosine")(Python) ordistance_metricon the RESTCreateContextRequest.Closes #74
Layers touched
store.rs,lib.rs,record.rs):DistanceMetricenum +parse()+distance(); field onContextStoreOptions/ContextStore; documented metric-dependentSearchResult.distance.lance-context-api):distance_metriconCreateContextRequest.distance_metric=onContext.create/__init__(andAsyncContext.createvia kwargs).Testing
cargo test -p lance-context-core— 34 passed, incl. newsearch_metric_changes_ranking(cosine/dot reorder vs L2 on identical data) anddistance_metric_parse_and_math.cargo clippy --workspace --all-targets— clean.cargo fmt --check— clean.uv run --project python --extra tests pytest— newpython/tests/test_distance_metric.py(4 tests) pass; full suite green except pre-existing S3 round-trip tests that require a live S3 endpoint (unrelated to this change).ruff check/ruff format --check— clean;pyright— 0 errors.Notes
open, likeblob_columnstoday); persisting it in dataset metadata is a possible follow-up.🤖 Generated with Claude Code