vector-store: preparation for multi-target-column indexes by knowack1 · Pull Request #449 · scylladb/vector-store

knowack1 · 2026-05-13T14:09:40Z

In preparation for full-text search (FTS) indexes, which are planned to support multiple target columns in Milestone 3—we need to ensure the existing architecture is ready for this extension.

As validation, we are generalizing the concept of target columns to support multiple columns even for vector indexes. This ensures that the same code path will work for both vector and FTS indexes moving forward.

This approach prevents significant future refactoring when multi-target-column FTS indexes are introduced.

Fixes VECTOR-676

Copilot

Pull request overview

This PR prepares vector-store for upcoming multi-target-column indexes (notably for planned FTS support) by generalizing index metadata from a single target_column to target_columns, and by adjusting the embedding update path to carry per-column values/timestamps.

Changes:

Replace target_column with target_columns: Vec<ColumnName> across index metadata flows (DB discovery, routing/grouping, tests, benches).
Refactor DbEmbedding from { embedding, timestamp } into { columns: Vec<Option<ColumnValue>> } and update producers/consumers accordingly.
Update DB range-scan query generation to accept a slice of target columns (currently still using the first column).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
crates/vector-store/src/lib.rs	Public API updates: `target_columns` on metadata + new `ColumnValue` + `DbEmbedding` shape change.
crates/vector-store/src/table.rs	Consumes new `DbEmbedding.columns` shape and applies add/update/delete logic based on first column.
crates/vector-store/src/monitor_indexes.rs	Uses `target_columns` during index discovery and dimension lookup.
crates/vector-store/src/monitor_items.rs	Test updates for the new `DbEmbedding.columns` representation.
crates/vector-store/src/indexes.rs	Routing key updated to group by `target_columns` rather than a single column.
crates/vector-store/src/db.rs	DB index discovery now populates `DbCustomIndex.target_columns` (currently single-element vec).
crates/vector-store/src/db_index.rs	Range scan now passes `target_columns`; emits `DbEmbedding` using `ColumnValue`.
crates/vector-store/src/db_index_backend.rs	Backend representation updated to store `target_columns`; range-scan query takes a slice.
crates/vector-store/src/db_cdc.rs	CDC consumer now emits `DbEmbedding.columns` with `ColumnValue`.
crates/vector-store/src/node_state.rs	Test fixtures updated for `target_columns`.
crates/vector-store/tests/integration/usearch.rs	Integration tests updated to use `target_columns` and new embedding representation.
crates/vector-store/tests/integration/routing.rs	Integration tests updated to build indexes with `target_columns`.
crates/vector-store/tests/integration/opensearch.rs	Integration tests updated for `target_columns` and embedding representation.
crates/vector-store/tests/integration/memory_limit.rs	Integration test updated for `target_columns` and dimensions mapping.
crates/vector-store/tests/integration/db_basic.rs	Test DB shim updated for `target_columns` and `DbEmbedding.columns`.
crates/vector-store/benches/pipeline.rs	Bench updated for `target_columns` and `DbEmbedding.columns`.

Comments suppressed due to low confidence (2)

crates/vector-store/src/db_index_backend.rs:63

extract_vector indexes target_columns[0] without checking for emptiness, which can panic if IndexMetadata::target_columns is empty. Consider enforcing a non-empty invariant for target_columns or handling the empty case explicitly.

    pub fn extract_vector(&self, value: CqlValue) -> anyhow::Result<Option<Vector>> {
        match self {
            Self::Cql { .. } => Vector::try_from(value).map(Some),
            Self::Alternator { target_columns } => vector::AlternatorAttrs {
                attrs: value,
                target_column: target_columns[0].as_ref(),
            }
            .try_into(),

crates/vector-store/src/db_index_backend.rs:82

range_scan_query assumes target_columns is non-empty and indexes [0]. If an empty slice is ever passed (now possible with the generalized API), this will panic. Consider validating target_columns at the call site or changing this helper to return an error on empty input.

pub(crate) fn range_scan_query(
    keyspace: &KeyspaceIdentifier,
    table: &TableIdentifier,
    target_columns: &[ColumnName],
    primary_key_list: &str,
    partition_key_list: &str,
) -> String {
    if keyspace.is_alternator() {
        let attributes = CqlIdentifier::new(":attrs");
        let vector = CqlLiteral::new(target_columns[0].as_ref());
        format!(

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

crates/vector-store/src/table.rs:1247

In the Vacant branch, when an incoming embedding is Some(EmbeddingValue { embedding: None, timestamp }) (a tombstone for a column whose row didn't yet exist), no entry is written to vector_timestamps[col_idx]. The slot retains the resize default of ETValue::None(Epoch::new(), Timestamp::UNIX_EPOCH), so the tombstone's timestamp is lost. A subsequent older insert for that column (with vector_timestamp > UNIX_EPOCH but < original tombstone timestamp) will pass the stored_timestamp >= vector_timestamp check in the Occupied branch and be applied, violating tombstone ordering. Recording ETValue::None(epoch, ev.timestamp) for the column when ev.embedding is None would preserve correctness here.

                        for (col_idx, embedding) in embeddings.iter().enumerate() {
                            let Some(ev) = embedding else { continue };
                            if let Some(vector) = &ev.embedding {
                                index.vector_timestamps[col_idx].update(
                                    primary_id,
                                    ETValue::Some(primary_id.epoch(), ev.timestamp, ()),
                                )?;
                                all_operations[col_idx].push(Operation::AddVector {
                                    primary_id,
                                    partition_id,
                                    vector: vector.clone(),
                                    is_update: false,
                                });
                            }
                        }

crates/vector-store/src/db_index_backend.rs:103

Linear search over entries for every target column makes this O(N_columns × N_attrs) per CDC event. With multi-target-column FTS indexes (the stated motivation for this PR) and rows that have many attributes, this can become a hot path. Building a small lookup (e.g. a HashMap<&[u8], &CqlValue> over entries once, then probing per target column) keeps it linear in the larger of the two sizes.

                target_columns
                    .iter()
                    .map(|col| {
                        let target = col.as_ref().as_bytes();
                        let value = entries.iter().find_map(|(key, val)| {
                            let matches = match key {
                                CqlValue::Blob(b) => b.as_slice() == target,
                                CqlValue::Text(s) => s.as_bytes() == target,
                                _ => false,
                            };
                            matches.then_some(val)
                        });
                        match value {
                            Some(v) => Ok(Some(Some(Vector::try_from(v.clone())?))),
                            None => Ok(Some(None)),
                        }
                    })
                    .collect()
            }

ewienik

I've already checked the first commit. Needs some refactoring.
Consider running clippy/fmt on the intermediate commit - for better reviewer experience :-)

ewienik · 2026-05-18T15:44:49Z

    pub fn vector_column_name(&self) -> &str {
        match self {
-            Self::Cql { target_column } => target_column.as_ref(),
+            Self::Cql { target_columns } => target_columns[0].as_ref(),


target_columns[0] could panic (here and in other places), we should avoid this. Consider creating a newtype TargetColumns that would be responsible that only len > 0 are possible.
As this type TargetColumns will be invariant you can define it using Arc<Box[_]> to limit memory usage.

This issue was later fixed in commit 727c7b1 After this commit, db_index_backend is also capable of handling multi-target columns, eliminating calls like target_columns[0]. So, with this, I think there is no need to introduce TargetColumns.

I could merge this and the 727c7b1 into single commit, but I tried not to put too much into one commit.

For the maintenance it would be better to introduce newtype which will hold contract invariant. The issue might be solved by the later commit but we should limit similar problems in the future.
Additionally newtype gives better semantic by the type directly, not by the variable name.
You can introduce this newtype before this commit. IMHO it is better to fix issues in the commit when they appears, not in the following ones.

Similar pattern of a non-empty list of columns we have in primary_key_columns and local key_columns. Such newtype can be used also for them. So the name could be different - maybe NonEmptyColumns? Or another name?

Makes sense. I will introduce it.

ewienik · 2026-05-18T15:49:59Z

 #[derive(Clone, Debug, PartialEq)]
-pub struct DbEmbedding {
-    pub primary_key: PrimaryKey,
+pub struct EmbeddingValue {


Consider rename it to DbIndexedValue as for the future it would be more meaningful

Done in dedicated commit 69397c5

Could you move that refactor before this commit?

ewienik · 2026-05-18T15:50:45Z

 }

+#[derive(Clone, Debug, PartialEq)]
+pub struct DbEmbedding {


Consider rename it to DbIndexedRow

Done in dedicated commit: 69397c5

Could you move that refactor before this commit?

ewienik · 2026-05-18T15:51:05Z

+    pub primary_key: PrimaryKey,
+    /// List of embeddings for each indexed column.
+    /// The order corresponds to the order of target columns in IndexMetadata.
+    pub embeddings: Vec<Option<EmbeddingValue>>,


Consider rename it into values

Done in dedicated commit: 971086e

Could you move this refactoring before this commit? It would be shorter context of changes in commits.

ewienik · 2026-05-18T15:54:03Z

        _index_key: &IndexKey,
        db_embedding: DbEmbedding,
-    ) -> anyhow::Result<Vec<Operation>> {
+    ) -> anyhow::Result<Vec<Vec<Operation>>> {


We don't need to create a hierarchy of vectors. Can we use only single vector?

The returned type has the positional pattern. Every entry in the most upper vector corresponds the the operations for a given target column value. Added comment to the returned type also extracted Vec<Operation> into Operations alias.

The returned type has the positional pattern. Every entry in the most upper vector corresponds the the operations for a given target column value. Added comment to the returned type also extracted Vec<Operation> into Operations alias.

This is not always truth - sometimes one add to table produces two operations for index.

We don't need this hierarchy information - we don't need vector of vectors - we don't need this information further in a pipeline, do we? add can produce flat vector.

The current structure of Vec<Operations> works as follows. Assuming 4 target columns, add() can produce a Vec<Operations> like this:

[ [0] = [AddVector{...}], // New value for target_columns[0] [1] = [RemoveBeforeAddVector{...}, AddVector{...}], // Updated value for target_columns[1] [2] = [RemoveVector{...}], // Removed value for target_columns[2] [3] = [] // No update (nothing changed) for target_columns[3] ]

I think this positional pattern is consistent with what we have in DBEmbedding (DbIndexedRow).

At this stage, I don't see how to effectively apply a bitset in this case - could you please advise on this? Another approach I see would be to add a target_column_index into the Operation variants.

Also, depending on the chosen approach, I would apply the same pattern to DBEmbedding (DbIndexedRow) to keep the pattern consistent across the application.

The current structure of Vec<Operations> works as follows. Assuming 4 target columns, add() can produce a Vec<Operations> like this:

[ [0] = [AddVector{...}], // New value for target_columns[0] [1] = [RemoveBeforeAddVector{...}, AddVector{...}], // Updated value for target_columns[1] [2] = [RemoveVector{...}], // Removed value for target_columns[2] [3] = [] // No update (nothing changed) for target_columns[3] ]

We have also RemovePartition operation. In the future we would have more operations.
Additionally each Operation is independent of target column - add method should tell what should be done after the insert of DbIndexRow.

I think this positional pattern is consistent with what we have in DBEmbedding (DbIndexedRow).

At this stage, I don't see how to effectively apply a bitset in this case - could you please advise on this? Another approach I see would be to add a target_column_index into the Operation variants.

Also, depending on the chosen approach, I would apply the same pattern to DBEmbedding (DbIndexedRow) to keep the pattern consistent across the application.

The position in Vec shouldn't be coordinated with position of column in DbIndexRow - these two are independent.

In preparation for full-text search (FTS) indexes, which are planned to support multiple target columns in Milestone 3—we need to ensure the existing architecture is ready for this extension. As validation, we are generalizing the concept of target columns to support multiple columns even for vector indexes. This ensures that the same code path will work for both vector and FTS indexes moving forward. This approach prevents significant future refactoring when multi-target-column FTS indexes are introduced. Fixes VECTOR-676

Move the match on Operation variants into a dedicated async function, keeping add() as a slim orchestrator.

Replace single-column extraction APIs with multi-column equivalents: - vector_column_name() + extract_vector(value) replaced by extract_cdc_embeddings(&mut row) returning Vec<Option<Option<Vector>>> - range_scan_query now selects N (value, writetime) pairs - range_scan_stream uses extract_embeddings_from_row() for N columns Refactor CDC consumer to work with multiple embeddings per row and extract primary key extraction into a dedicated helper. Add unit tests for multi-column range scan queries (CQL + Alternator) and Alternator CDC extraction (vector updated, not touched, deleted).

Extract inline logic from the Occupied and Vacant branches of TableAdd::add into dedicated methods on Index: - vector_timestamps_for: retrieve vector timestamps column by index - read_column_state: read epoch, timestamp, and existence flag - require_partition_id: resolve partition_id with error context - resolve_or_add_partition: resolve global or add local partition

Generalize index pipeline types to support non-embedding index backends (e.g. FTS). The previous names were tightly coupled to vector embeddings, making the abstraction unclear for other index types. Renames: - DbEmbedding -> DbIndexedRow - EmbeddingValue -> DbIndexedValue - DbIndexedRow::embeddings field -> DbIndexedRow::values No functional changes — pure rename across the codebase.

ewienik

I see an issue with multi-column index values - what should we do when only single column has value? We should provide this information to the pipeline in single operation - monitor_items should now which column is newer and update index only with this value. I think we should provide some bitset to the Add or Remove operation - which columns are affected with this operation. So we shouldn't build Vec<Vec> but extend Operation with bitset about columns affected.

ewienik · 2026-05-20T15:32:07Z

-                                primary_id,
-                                partition_id,
-                            });
+                    for (col_idx, embedding) in embeddings.iter().enumerate() {


We should check all embedding if they have something to change, and we should produce single Operation for one row (for all values in targets) - should we update or not.

knowack1 · 2026-05-21T07:54:37Z

I see an issue with multi-column index values - what should we do when only single column has value? We should provide this information to the pipeline in single operation - monitor_items should now which column is newer and update index only with this value. I think we should provide some bitset to the Add or Remove operation - which columns are affected with this operation. So we shouldn't build Vec but extend Operation with bitset about columns affected.

I think this is currently handled correctly (see my comment #449 (comment)). When value of specific column was not updated the Operations are empty for this specific position (column) in the Vec<Operations>.

ewienik · 2026-05-21T09:16:26Z

I see an issue with multi-column index values - what should we do when only single column has value? We should provide this information to the pipeline in single operation - monitor_items should now which column is newer and update index only with this value. I think we should provide some bitset to the Add or Remove operation - which columns are affected with this operation. So we shouldn't build Vec but extend Operation with bitset about columns affected.

I think this is currently handled correctly (see my comment #449 (comment)). When value of specific column was not updated the Operations are empty for this specific position (column) in the Vec<Operations>.

But this is not performant way - out of cache. We will have a dynamic vector of dynamic vectors - it is hard to check on monitor_items - you have to do several iteration over vector of vectors. Data are not easily grasped as there could be two operations per one cdc insert. It is better to store bitset in Add/Delete operations indicating which column was changed. Additionally there could be more operations - no only add/remove vector - it could be add/remove partition. In the future there could be even more of them.

knowack1 · 2026-05-21T12:41:00Z

These changes requires revisited approach according to https://scylladb.atlassian.net/browse/VECTOR-676?focusedCommentId=60373.

knowack1 requested a review from Copilot May 13, 2026 14:09

github-actions Bot added P2 area/vector labels May 13, 2026

Copilot started reviewing on behalf of knowack1 May 13, 2026 14:10 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread crates/vector-store/src/table.rs Outdated

Comment thread crates/vector-store/src/monitor_indexes.rs

Comment thread crates/vector-store/src/db_index_backend.rs Outdated

knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from 6f8b526 to 17bcde3 Compare May 14, 2026 12:04

knowack1 requested a review from Copilot May 14, 2026 12:08

Copilot started reviewing on behalf of knowack1 May 14, 2026 12:13 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from cfca3b5 to bc6339b Compare May 14, 2026 13:36

knowack1 requested a review from Copilot May 14, 2026 13:37

Copilot started reviewing on behalf of knowack1 May 14, 2026 13:38 View session

knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from bc6339b to 1816bc4 Compare May 14, 2026 13:44

Copilot AI reviewed May 14, 2026

View reviewed changes

knowack1 force-pushed the VECTOR-676-multi-column-index-support branch 2 times, most recently from 29c61db to cd4c8da Compare May 15, 2026 16:40

knowack1 requested a review from Copilot May 15, 2026 16:58

Copilot started reviewing on behalf of knowack1 May 15, 2026 16:59 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread crates/vector-store/src/table.rs

Comment thread crates/vector-store/src/db_index_backend.rs

Comment thread crates/vector-store/src/db_index.rs Outdated

Comment thread crates/vector-store/src/table.rs Outdated

knowack1 force-pushed the VECTOR-676-multi-column-index-support branch 5 times, most recently from c82a7a3 to 965729b Compare May 18, 2026 14:24

knowack1 requested a review from ewienik May 18, 2026 14:30

knowack1 marked this pull request as ready for review May 18, 2026 14:30

ewienik requested changes May 18, 2026

View reviewed changes

knowack1 added 3 commits May 20, 2026 12:15

vector-store: refactor monitor_items::add()

3fa7319

Move the match on Operation variants into a dedicated async function, keeping add() as a slim orchestrator.

knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from 965729b to 69397c5 Compare May 20, 2026 10:47

knowack1 added 2 commits May 20, 2026 13:12

knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from 69397c5 to 971086e Compare May 20, 2026 11:12

ewienik requested changes May 20, 2026

View reviewed changes

ewienik mentioned this pull request May 21, 2026

support storing, filtering and retrieving non-key columns #444

Open

Conversation

knowack1 commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ewienik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knowack1 May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ewienik left a comment •

edited

Loading

knowack1 May 20, 2026 •

edited

Loading

knowack1 commented May 21, 2026 •

edited

Loading