Motivation
#74 / #77 add a configurable vector-search distance_metric (l2 / cosine /
dot) as a runtime option on ContextStoreOptions. Unlike embedding_dim
— which is recovered automatically from the persisted Arrow schema on open
(embedding_dim_from_schema, crates/lance-context-core/src/store.rs) — the
distance metric is not persisted. A caller must pass the same
distance_metric on every open, and if they forget, the store silently falls
back to the default l2 and ranks results differently from how the dataset was
intended to be queried.
Proposal
Persist the chosen metric in the dataset so it round-trips without being
re-specified.
- On create, write the metric into Lance schema metadata (the same
mechanism already used for blob encoding, e.g. the
lance-encoding:blob keys in store.rs), e.g.
lance-context:distance_metric = "cosine".
- On open, read it back and use it as the store's metric. An explicitly passed
distance_metric option may override (or, preferably, error on mismatch,
mirroring the existing embedding_dim mismatch check in
open_with_options).
- Default remains
l2 for datasets created before this change (absent key).
Scope
- Core write path (schema metadata on create) + read-back on open.
- Reuse/mirror the
embedding_dim mismatch-validation pattern.
- Tests: create with
cosine, reopen without passing the option, confirm
ranking still uses cosine; legacy dataset (no key) still defaults to l2.
Non-goals
Relationship
Follow-up to #74 / #77. embedding_dim already persists via the column schema,
so this issue is specifically about the metric setting.
Motivation
#74 / #77 add a configurable vector-search
distance_metric(l2/cosine/dot) as a runtime option onContextStoreOptions. Unlikeembedding_dim— which is recovered automatically from the persisted Arrow schema on open
(
embedding_dim_from_schema,crates/lance-context-core/src/store.rs) — thedistance metric is not persisted. A caller must pass the same
distance_metricon everyopen, and if they forget, the store silently fallsback to the default
l2and ranks results differently from how the dataset wasintended to be queried.
Proposal
Persist the chosen metric in the dataset so it round-trips without being
re-specified.
mechanism already used for blob encoding, e.g. the
lance-encoding:blobkeys instore.rs), e.g.lance-context:distance_metric = "cosine".distance_metricoption may override (or, preferably, error on mismatch,mirroring the existing
embedding_dimmismatch check inopen_with_options).l2for datasets created before this change (absent key).Scope
embedding_dimmismatch-validation pattern.cosine, reopen without passing the option, confirmranking still uses cosine; legacy dataset (no key) still defaults to
l2.Non-goals
Relationship
Follow-up to #74 / #77.
embedding_dimalready persists via the column schema,so this issue is specifically about the metric setting.