Skip to content

Vector search support function in Datafusion #29

@jamals86

Description

@jamals86

Embedding storage inside Parquet and managed HNSW indexing (with delete handling):

  1. Parquet Storage: All embeddings are stored as regular columns in the Parquet file alongside other table columns to keep data unified and versioned per batch.
  2. Temp Indexing: On each row insert/update, serialize embeddings into a temporary .hnsw file under /kalamdb/{namespace}/{table}/{column}-hot_index.hnsw for fast incremental indexing.
  3. Flush Behavior: During table flush, if {table}/{column}-index.hnsw doesn’t exist, create it from all embeddings in the Parquet batches; otherwise, load and append new vectors while marking any deleted rows in the index.
  4. Search Integration: Register a DataFusion scalar function vector_search(column, query_vector, top_k) that loads the HNSW index, filters out deleted entries, and returns nearest row IDs + distances.
  5. Job System Hook: Add an async background IndexUpdateJob triggered post-flush to merge temporary indexes, apply deletions, and update last_indexed_batch metadata for each table column.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions