Skip to content

Persist configured distance_metric in dataset metadata (not just runtime kwarg) #80

Description

@dcfocus

Motivation

#74 / #77 add a configurable vector-search distance_metric (l2 / cosine /
dot) as a runtime option on ContextStoreOptions. Unlike embedding_dim
— which is recovered automatically from the persisted Arrow schema on open
(embedding_dim_from_schema, crates/lance-context-core/src/store.rs) — the
distance metric is not persisted. A caller must pass the same
distance_metric on every open, and if they forget, the store silently falls
back to the default l2 and ranks results differently from how the dataset was
intended to be queried.

Proposal

Persist the chosen metric in the dataset so it round-trips without being
re-specified.

  • On create, write the metric into Lance schema metadata (the same
    mechanism already used for blob encoding, e.g. the
    lance-encoding:blob keys in store.rs), e.g.
    lance-context:distance_metric = "cosine".
  • On open, read it back and use it as the store's metric. An explicitly passed
    distance_metric option may override (or, preferably, error on mismatch,
    mirroring the existing embedding_dim mismatch check in
    open_with_options).
  • Default remains l2 for datasets created before this change (absent key).

Scope

  • Core write path (schema metadata on create) + read-back on open.
  • Reuse/mirror the embedding_dim mismatch-validation pattern.
  • Tests: create with cosine, reopen without passing the option, confirm
    ranking still uses cosine; legacy dataset (no key) still defaults to l2.

Non-goals

Relationship

Follow-up to #74 / #77. embedding_dim already persists via the column schema,
so this issue is specifically about the metric setting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions