Vector search support function in Datafusion

Embedding storage inside Parquet and managed HNSW indexing (with delete handling):
1.	Parquet Storage: All embeddings are stored as regular columns in the Parquet file alongside other table columns to keep data unified and versioned per batch.
2.	Temp Indexing: On each row insert/update, serialize embeddings into a temporary .hnsw file under /kalamdb/{namespace}/{table}/{column}-hot_index.hnsw for fast incremental indexing.
3.	Flush Behavior: During table flush, if {table}/{column}-index.hnsw doesn’t exist, create it from all embeddings in the Parquet batches; otherwise, load and append new vectors while marking any deleted rows in the index.
4.	Search Integration: Register a DataFusion scalar function vector_search(column, query_vector, top_k) that loads the HNSW index, filters out deleted entries, and returns nearest row IDs + distances.
5.	Job System Hook: Add an async background IndexUpdateJob triggered post-flush to merge temporary indexes, apply deletions, and update last_indexed_batch metadata for each table column.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector search support function in Datafusion #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vector search support function in Datafusion #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions