Embedding storage inside Parquet and managed HNSW indexing (with delete handling):
- Parquet Storage: All embeddings are stored as regular columns in the Parquet file alongside other table columns to keep data unified and versioned per batch.
- Temp Indexing: On each row insert/update, serialize embeddings into a temporary .hnsw file under /kalamdb/{namespace}/{table}/{column}-hot_index.hnsw for fast incremental indexing.
- Flush Behavior: During table flush, if {table}/{column}-index.hnsw doesn’t exist, create it from all embeddings in the Parquet batches; otherwise, load and append new vectors while marking any deleted rows in the index.
- Search Integration: Register a DataFusion scalar function vector_search(column, query_vector, top_k) that loads the HNSW index, filters out deleted entries, and returns nearest row IDs + distances.
- Job System Hook: Add an async background IndexUpdateJob triggered post-flush to merge temporary indexes, apply deletions, and update last_indexed_batch metadata for each table column.
Embedding storage inside Parquet and managed HNSW indexing (with delete handling):