feat: add configurable vector-search distance metric by dcfocus · Pull Request #77 · lance-format/lance-context

dcfocus · 2026-06-12T02:45:49Z

Summary

Vector search was hardcoded to Euclidean (L2) distance. Most modern embedding models (OpenAI, Cohere, Voyage, most sentence-transformers) are trained for cosine similarity, so for unnormalized vectors the L2 default could silently return lower-quality results and the returned distance didn't match the caller's model.
Adds a store-level distance_metric option — l2 (default), cosine, dot — threaded through every layer exactly like the existing id_index_type option.
Search method signatures are unchanged; only the distance computation inside search_filtered_with_options changes. All metrics are normalized so smaller is better, preserving the existing ascending ranking (cosine = 1 - cosine_similarity with a zero-norm guard; dot = negated inner product for max-inner-product search).
Default stays L2 → fully backward-compatible. Opt in via Context.create(uri, distance_metric="cosine") (Python) or distance_metric on the REST CreateContextRequest.

Closes #74

Layers touched

Core (store.rs, lib.rs, record.rs): DistanceMetric enum + parse() + distance(); field on ContextStoreOptions/ContextStore; documented metric-dependent SearchResult.distance.
API DTO (lance-context-api): distance_metric on CreateContextRequest.
Unified + Server: parse string → enum, return 400 on invalid value.
PyO3 + Python API: distance_metric= on Context.create/__init__ (and AsyncContext.create via kwargs).

Testing

cargo test -p lance-context-core — 34 passed, incl. new search_metric_changes_ranking (cosine/dot reorder vs L2 on identical data) and distance_metric_parse_and_math.
cargo clippy --workspace --all-targets — clean.
cargo fmt --check — clean.
uv run --project python --extra tests pytest — new python/tests/test_distance_metric.py (4 tests) pass; full suite green except pre-existing S3 round-trip tests that require a live S3 endpoint (unrelated to this change).
ruff check / ruff format --check — clean; pyright — 0 errors.

Notes

The metric is runtime-only (re-specified on open, like blob_columns today); persisting it in dataset metadata is a possible follow-up.
Overlaps the option-plumbing structs with Make embedding dimension configurable (currently hardcoded to 1536) #73 (configurable embedding dim), so a trivial adjacent-field merge conflict is expected for whichever lands second.

🤖 Generated with Claude Code

Vector search previously always ranked by Euclidean (L2) distance, with no way to choose a metric. Most modern embedding models are trained for cosine similarity, so for unnormalized vectors the hardcoded L2 default could silently return lower-quality results and the returned `distance` did not match the caller's model. Add a store-level `distance_metric` option (`l2` default, `cosine`, `dot`), threaded through every layer like the existing `id_index_type`. Search method signatures are unchanged; only the distance computation in `search_filtered_with_options` changes. All metrics are normalized to "smaller is better" so the ascending ranking is preserved. Closes lance-format#74 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

## Summary Persists the configurable vector-search `distance_metric` (added in #74/#77) in the dataset so it round-trips on `open` instead of being re-specified every time. **Closes #80.** Before this change `distance_metric` was a runtime-only option: a caller had to pass the same metric on every `open`, and if they forgot, the store silently fell back to the default `l2` and ranked results differently from how the dataset was intended to be queried — `embedding_dim` already recovers from the schema, but the metric did not. ## What changed - **Persist on create**: the metric is written into the Lance **schema metadata** under `lance-context:distance_metric` (the same mechanism already used for `lance-encoding:blob`). - **Recover on open**: `distance_metric_from_schema` reads it back and uses it as the store's metric, mirroring `embedding_dim_from_schema`. - **Mismatch guard**: an explicitly passed metric that disagrees with the persisted one errors, reusing the existing `embedding_dim` mismatch-validation pattern in `open_with_options`. - **Backward compatible**: datasets created before this change carry no key and default to `l2`. - `ContextStoreOptions.distance_metric` is now `Option<DistanceMetric>` (`None` = use the persisted/default metric), matching `embedding_dim: Option<i32>`. Threaded through unified / server / PyO3 accordingly. ## Tests - **Rust** (`store.rs`): `distance_metric_persists_across_reopen` (create `cosine`, reopen **without** the option → cosine ranking), a mismatch-errors test, and a unit test that a metadata-less schema defaults to `l2`. - **Python** (`test_distance_metric.py`): reopen-without-option keeps cosine ranking; reopening with a conflicting metric raises. ## Verification - `cargo test -p lance-context-core` — 41 passed - `cargo clippy --all-targets` — clean; `cargo fmt --check` — clean - `pytest python/tests/` — 109 passed (the 2 S3 failures are pre-existing and environmental: no live bucket), `ruff check` / `ruff format --check` clean ## Relationship Follow-up to #74 / #77. Does not change the metric set or ranking math.

dcfocus force-pushed the feat/issue-74-distance-metric branch from 5334971 to 9bda813 Compare June 12, 2026 02:58

dcfocus force-pushed the feat/issue-74-distance-metric branch from 9bda813 to bcc554b Compare June 12, 2026 03:27

dcfocus merged commit 1383732 into lance-format:main Jun 12, 2026
9 checks passed

This was referenced Jun 12, 2026

Persist configured distance_metric in dataset metadata (not just runtime kwarg) #80

Closed

Cut next release for post-0.3.3 changes #82

Closed

feat: persist distance_metric in dataset schema metadata #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add configurable vector-search distance metric#77

feat: add configurable vector-search distance metric#77
dcfocus merged 1 commit into
lance-format:mainfrom
dcfocus:feat/issue-74-distance-metric

dcfocus commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dcfocus commented Jun 12, 2026

Summary

Layers touched

Testing

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant