Skip to content

feat: add configurable vector-search distance metric#77

Merged
dcfocus merged 1 commit into
lance-format:mainfrom
dcfocus:feat/issue-74-distance-metric
Jun 12, 2026
Merged

feat: add configurable vector-search distance metric#77
dcfocus merged 1 commit into
lance-format:mainfrom
dcfocus:feat/issue-74-distance-metric

Conversation

@dcfocus

@dcfocus dcfocus commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Vector search was hardcoded to Euclidean (L2) distance. Most modern embedding models (OpenAI, Cohere, Voyage, most sentence-transformers) are trained for cosine similarity, so for unnormalized vectors the L2 default could silently return lower-quality results and the returned distance didn't match the caller's model.
  • Adds a store-level distance_metric option — l2 (default), cosine, dot — threaded through every layer exactly like the existing id_index_type option.
  • Search method signatures are unchanged; only the distance computation inside search_filtered_with_options changes. All metrics are normalized so smaller is better, preserving the existing ascending ranking (cosine = 1 - cosine_similarity with a zero-norm guard; dot = negated inner product for max-inner-product search).
  • Default stays L2 → fully backward-compatible. Opt in via Context.create(uri, distance_metric="cosine") (Python) or distance_metric on the REST CreateContextRequest.

Closes #74

Layers touched

  • Core (store.rs, lib.rs, record.rs): DistanceMetric enum + parse() + distance(); field on ContextStoreOptions/ContextStore; documented metric-dependent SearchResult.distance.
  • API DTO (lance-context-api): distance_metric on CreateContextRequest.
  • Unified + Server: parse string → enum, return 400 on invalid value.
  • PyO3 + Python API: distance_metric= on Context.create/__init__ (and AsyncContext.create via kwargs).

Testing

  • cargo test -p lance-context-core — 34 passed, incl. new search_metric_changes_ranking (cosine/dot reorder vs L2 on identical data) and distance_metric_parse_and_math.
  • cargo clippy --workspace --all-targets — clean.
  • cargo fmt --check — clean.
  • uv run --project python --extra tests pytest — new python/tests/test_distance_metric.py (4 tests) pass; full suite green except pre-existing S3 round-trip tests that require a live S3 endpoint (unrelated to this change).
  • ruff check / ruff format --check — clean; pyright — 0 errors.

Notes

  • The metric is runtime-only (re-specified on open, like blob_columns today); persisting it in dataset metadata is a possible follow-up.
  • Overlaps the option-plumbing structs with Make embedding dimension configurable (currently hardcoded to 1536) #73 (configurable embedding dim), so a trivial adjacent-field merge conflict is expected for whichever lands second.

🤖 Generated with Claude Code

@dcfocus dcfocus force-pushed the feat/issue-74-distance-metric branch from 5334971 to 9bda813 Compare June 12, 2026 02:58
Vector search previously always ranked by Euclidean (L2) distance, with no
way to choose a metric. Most modern embedding models are trained for cosine
similarity, so for unnormalized vectors the hardcoded L2 default could
silently return lower-quality results and the returned `distance` did not
match the caller's model.

Add a store-level `distance_metric` option (`l2` default, `cosine`, `dot`),
threaded through every layer like the existing `id_index_type`. Search method
signatures are unchanged; only the distance computation in
`search_filtered_with_options` changes. All metrics are normalized to
"smaller is better" so the ascending ranking is preserved.

Closes lance-format#74

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dcfocus dcfocus force-pushed the feat/issue-74-distance-metric branch from 9bda813 to bcc554b Compare June 12, 2026 03:27
@dcfocus dcfocus merged commit 1383732 into lance-format:main Jun 12, 2026
9 checks passed
dcfocus added a commit that referenced this pull request Jun 12, 2026
## Summary

Persists the configurable vector-search `distance_metric` (added in
#74/#77) in
the dataset so it round-trips on `open` instead of being re-specified
every
time. **Closes #80.**

Before this change `distance_metric` was a runtime-only option: a caller
had to
pass the same metric on every `open`, and if they forgot, the store
silently
fell back to the default `l2` and ranked results differently from how
the
dataset was intended to be queried — `embedding_dim` already recovers
from the
schema, but the metric did not.

## What changed

- **Persist on create**: the metric is written into the Lance **schema
metadata** under `lance-context:distance_metric` (the same mechanism
already
  used for `lance-encoding:blob`).
- **Recover on open**: `distance_metric_from_schema` reads it back and
uses it
  as the store's metric, mirroring `embedding_dim_from_schema`.
- **Mismatch guard**: an explicitly passed metric that disagrees with
the
persisted one errors, reusing the existing `embedding_dim`
mismatch-validation
  pattern in `open_with_options`.
- **Backward compatible**: datasets created before this change carry no
key and
  default to `l2`.
- `ContextStoreOptions.distance_metric` is now `Option<DistanceMetric>`
(`None`
= use the persisted/default metric), matching `embedding_dim:
Option<i32>`.
  Threaded through unified / server / PyO3 accordingly.

## Tests

- **Rust** (`store.rs`): `distance_metric_persists_across_reopen`
(create
`cosine`, reopen **without** the option → cosine ranking), a
mismatch-errors
  test, and a unit test that a metadata-less schema defaults to `l2`.
- **Python** (`test_distance_metric.py`): reopen-without-option keeps
cosine
  ranking; reopening with a conflicting metric raises.

## Verification

- `cargo test -p lance-context-core` — 41 passed
- `cargo clippy --all-targets` — clean; `cargo fmt --check` — clean
- `pytest python/tests/` — 109 passed (the 2 S3 failures are
pre-existing and
environmental: no live bucket), `ruff check` / `ruff format --check`
clean

## Relationship

Follow-up to #74 / #77. Does not change the metric set or ranking math.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make vector-search distance metric configurable (cosine/dot-product; currently hardcoded L2)

1 participant