Skip to content

vector-store: preparation for multi-target-column indexes#449

Open
knowack1 wants to merge 5 commits into
scylladb:masterfrom
knowack1:VECTOR-676-multi-column-index-support
Open

vector-store: preparation for multi-target-column indexes#449
knowack1 wants to merge 5 commits into
scylladb:masterfrom
knowack1:VECTOR-676-multi-column-index-support

Conversation

@knowack1
Copy link
Copy Markdown
Collaborator

In preparation for full-text search (FTS) indexes, which are planned to support multiple target columns in Milestone 3—we need to ensure the existing architecture is ready for this extension.

As validation, we are generalizing the concept of target columns to support multiple columns even for vector indexes. This ensures that the same code path will work for both vector and FTS indexes moving forward.

This approach prevents significant future refactoring when multi-target-column FTS indexes are introduced.

Fixes VECTOR-676

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prepares vector-store for upcoming multi-target-column indexes (notably for planned FTS support) by generalizing index metadata from a single target_column to target_columns, and by adjusting the embedding update path to carry per-column values/timestamps.

Changes:

  • Replace target_column with target_columns: Vec<ColumnName> across index metadata flows (DB discovery, routing/grouping, tests, benches).
  • Refactor DbEmbedding from { embedding, timestamp } into { columns: Vec<Option<ColumnValue>> } and update producers/consumers accordingly.
  • Update DB range-scan query generation to accept a slice of target columns (currently still using the first column).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/vector-store/src/lib.rs Public API updates: target_columns on metadata + new ColumnValue + DbEmbedding shape change.
crates/vector-store/src/table.rs Consumes new DbEmbedding.columns shape and applies add/update/delete logic based on first column.
crates/vector-store/src/monitor_indexes.rs Uses target_columns during index discovery and dimension lookup.
crates/vector-store/src/monitor_items.rs Test updates for the new DbEmbedding.columns representation.
crates/vector-store/src/indexes.rs Routing key updated to group by target_columns rather than a single column.
crates/vector-store/src/db.rs DB index discovery now populates DbCustomIndex.target_columns (currently single-element vec).
crates/vector-store/src/db_index.rs Range scan now passes target_columns; emits DbEmbedding using ColumnValue.
crates/vector-store/src/db_index_backend.rs Backend representation updated to store target_columns; range-scan query takes a slice.
crates/vector-store/src/db_cdc.rs CDC consumer now emits DbEmbedding.columns with ColumnValue.
crates/vector-store/src/node_state.rs Test fixtures updated for target_columns.
crates/vector-store/tests/integration/usearch.rs Integration tests updated to use target_columns and new embedding representation.
crates/vector-store/tests/integration/routing.rs Integration tests updated to build indexes with target_columns.
crates/vector-store/tests/integration/opensearch.rs Integration tests updated for target_columns and embedding representation.
crates/vector-store/tests/integration/memory_limit.rs Integration test updated for target_columns and dimensions mapping.
crates/vector-store/tests/integration/db_basic.rs Test DB shim updated for target_columns and DbEmbedding.columns.
crates/vector-store/benches/pipeline.rs Bench updated for target_columns and DbEmbedding.columns.
Comments suppressed due to low confidence (2)

crates/vector-store/src/db_index_backend.rs:63

  • extract_vector indexes target_columns[0] without checking for emptiness, which can panic if IndexMetadata::target_columns is empty. Consider enforcing a non-empty invariant for target_columns or handling the empty case explicitly.
    pub fn extract_vector(&self, value: CqlValue) -> anyhow::Result<Option<Vector>> {
        match self {
            Self::Cql { .. } => Vector::try_from(value).map(Some),
            Self::Alternator { target_columns } => vector::AlternatorAttrs {
                attrs: value,
                target_column: target_columns[0].as_ref(),
            }
            .try_into(),

crates/vector-store/src/db_index_backend.rs:82

  • range_scan_query assumes target_columns is non-empty and indexes [0]. If an empty slice is ever passed (now possible with the generalized API), this will panic. Consider validating target_columns at the call site or changing this helper to return an error on empty input.
pub(crate) fn range_scan_query(
    keyspace: &KeyspaceIdentifier,
    table: &TableIdentifier,
    target_columns: &[ColumnName],
    primary_key_list: &str,
    partition_key_list: &str,
) -> String {
    if keyspace.is_alternator() {
        let attributes = CqlIdentifier::new(":attrs");
        let vector = CqlLiteral::new(target_columns[0].as_ref());
        format!(

Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/monitor_indexes.rs
Comment thread crates/vector-store/src/db_index_backend.rs Outdated
@knowack1 knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from 6f8b526 to 17bcde3 Compare May 14, 2026 12:04
@knowack1 knowack1 requested a review from Copilot May 14, 2026 12:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Comment thread crates/vector-store/src/lib.rs Outdated
Comment thread crates/vector-store/src/monitor_indexes.rs Outdated
Comment thread crates/vector-store/src/db_index_backend.rs Outdated
Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/indexes.rs
Comment thread crates/vector-store/src/table.rs
@knowack1 knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from cfca3b5 to bc6339b Compare May 14, 2026 13:36
@knowack1 knowack1 requested a review from Copilot May 14, 2026 13:37
@knowack1 knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from bc6339b to 1816bc4 Compare May 14, 2026 13:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/db_index.rs
Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/db_index_backend.rs
Comment thread crates/vector-store/src/monitor_indexes.rs
@knowack1 knowack1 force-pushed the VECTOR-676-multi-column-index-support branch 2 times, most recently from 29c61db to cd4c8da Compare May 15, 2026 16:40
@knowack1 knowack1 requested a review from Copilot May 15, 2026 16:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

crates/vector-store/src/table.rs:1247

  • In the Vacant branch, when an incoming embedding is Some(EmbeddingValue { embedding: None, timestamp }) (a tombstone for a column whose row didn't yet exist), no entry is written to vector_timestamps[col_idx]. The slot retains the resize default of ETValue::None(Epoch::new(), Timestamp::UNIX_EPOCH), so the tombstone's timestamp is lost. A subsequent older insert for that column (with vector_timestamp > UNIX_EPOCH but < original tombstone timestamp) will pass the stored_timestamp >= vector_timestamp check in the Occupied branch and be applied, violating tombstone ordering. Recording ETValue::None(epoch, ev.timestamp) for the column when ev.embedding is None would preserve correctness here.
                        for (col_idx, embedding) in embeddings.iter().enumerate() {
                            let Some(ev) = embedding else { continue };
                            if let Some(vector) = &ev.embedding {
                                index.vector_timestamps[col_idx].update(
                                    primary_id,
                                    ETValue::Some(primary_id.epoch(), ev.timestamp, ()),
                                )?;
                                all_operations[col_idx].push(Operation::AddVector {
                                    primary_id,
                                    partition_id,
                                    vector: vector.clone(),
                                    is_update: false,
                                });
                            }
                        }

crates/vector-store/src/db_index_backend.rs:103

  • Linear search over entries for every target column makes this O(N_columns × N_attrs) per CDC event. With multi-target-column FTS indexes (the stated motivation for this PR) and rows that have many attributes, this can become a hot path. Building a small lookup (e.g. a HashMap<&[u8], &CqlValue> over entries once, then probing per target column) keeps it linear in the larger of the two sizes.
                target_columns
                    .iter()
                    .map(|col| {
                        let target = col.as_ref().as_bytes();
                        let value = entries.iter().find_map(|(key, val)| {
                            let matches = match key {
                                CqlValue::Blob(b) => b.as_slice() == target,
                                CqlValue::Text(s) => s.as_bytes() == target,
                                _ => false,
                            };
                            matches.then_some(val)
                        });
                        match value {
                            Some(v) => Ok(Some(Some(Vector::try_from(v.clone())?))),
                            None => Ok(Some(None)),
                        }
                    })
                    .collect()
            }

Comment thread crates/vector-store/src/table.rs
Comment thread crates/vector-store/src/db_index_backend.rs
Comment thread crates/vector-store/src/db_index.rs Outdated
Comment thread crates/vector-store/src/table.rs Outdated
@knowack1 knowack1 force-pushed the VECTOR-676-multi-column-index-support branch 5 times, most recently from c82a7a3 to 965729b Compare May 18, 2026 14:24
@knowack1 knowack1 requested a review from ewienik May 18, 2026 14:30
@knowack1 knowack1 marked this pull request as ready for review May 18, 2026 14:30
Copy link
Copy Markdown
Collaborator

@ewienik ewienik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already checked the first commit. Needs some refactoring.
Consider running clippy/fmt on the intermediate commit - for better reviewer experience :-)

pub fn vector_column_name(&self) -> &str {
match self {
Self::Cql { target_column } => target_column.as_ref(),
Self::Cql { target_columns } => target_columns[0].as_ref(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_columns[0] could panic (here and in other places), we should avoid this. Consider creating a newtype TargetColumns that would be responsible that only len > 0 are possible.
As this type TargetColumns will be invariant you can define it using Arc<Box[_]> to limit memory usage.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue was later fixed in commit 727c7b1 After this commit, db_index_backend is also capable of handling multi-target columns, eliminating calls like target_columns[0]. So, with this, I think there is no need to introduce TargetColumns.

I could merge this and the 727c7b1 into single commit, but I tried not to put too much into one commit.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the maintenance it would be better to introduce newtype which will hold contract invariant. The issue might be solved by the later commit but we should limit similar problems in the future.
Additionally newtype gives better semantic by the type directly, not by the variable name.
You can introduce this newtype before this commit. IMHO it is better to fix issues in the commit when they appears, not in the following ones.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar pattern of a non-empty list of columns we have in primary_key_columns and local key_columns. Such newtype can be used also for them. So the name could be different - maybe NonEmptyColumns? Or another name?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will introduce it.

Comment thread crates/vector-store/src/lib.rs Outdated
#[derive(Clone, Debug, PartialEq)]
pub struct DbEmbedding {
pub primary_key: PrimaryKey,
pub struct EmbeddingValue {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider rename it to DbIndexedValue as for the future it would be more meaningful

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dedicated commit 69397c5

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move that refactor before this commit?

Comment thread crates/vector-store/src/lib.rs Outdated
}

#[derive(Clone, Debug, PartialEq)]
pub struct DbEmbedding {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider rename it to DbIndexedRow

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dedicated commit: 69397c5

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move that refactor before this commit?

Comment thread crates/vector-store/src/lib.rs Outdated
pub primary_key: PrimaryKey,
/// List of embeddings for each indexed column.
/// The order corresponds to the order of target columns in IndexMetadata.
pub embeddings: Vec<Option<EmbeddingValue>>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider rename it into values

Copy link
Copy Markdown
Collaborator Author

@knowack1 knowack1 May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dedicated commit: 971086e

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this refactoring before this commit? It would be shorter context of changes in commits.

Comment thread crates/vector-store/src/table.rs Outdated
_index_key: &IndexKey,
db_embedding: DbEmbedding,
) -> anyhow::Result<Vec<Operation>> {
) -> anyhow::Result<Vec<Vec<Operation>>> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to create a hierarchy of vectors. Can we use only single vector?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned type has the positional pattern. Every entry in the most upper vector corresponds the the operations for a given target column value. Added comment to the returned type also extracted Vec<Operation> into Operations alias.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned type has the positional pattern. Every entry in the most upper vector corresponds the the operations for a given target column value. Added comment to the returned type also extracted Vec<Operation> into Operations alias.

This is not always truth - sometimes one add to table produces two operations for index.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this hierarchy information - we don't need vector of vectors - we don't need this information further in a pipeline, do we? add can produce flat vector.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current structure of Vec<Operations> works as follows. Assuming 4 target columns, add() can produce a Vec<Operations> like this:

[
  [0] = [AddVector{...}],                                                        // New value for target_columns[0]
  [1] = [RemoveBeforeAddVector{...}, AddVector{...}],   // Updated value for target_columns[1]
  [2] = [RemoveVector{...}],                                                 // Removed value for target_columns[2]
  [3] = []                                                                                  // No update (nothing changed) for target_columns[3]
]

I think this positional pattern is consistent with what we have in DBEmbedding (DbIndexedRow).

At this stage, I don't see how to effectively apply a bitset in this case - could you please advise on this? Another approach I see would be to add a target_column_index into the Operation variants.

Also, depending on the chosen approach, I would apply the same pattern to DBEmbedding (DbIndexedRow) to keep the pattern consistent across the application.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current structure of Vec<Operations> works as follows. Assuming 4 target columns, add() can produce a Vec<Operations> like this:

[
  [0] = [AddVector{...}],                                                        // New value for target_columns[0]
  [1] = [RemoveBeforeAddVector{...}, AddVector{...}],   // Updated value for target_columns[1]
  [2] = [RemoveVector{...}],                                                 // Removed value for target_columns[2]
  [3] = []                                                                                  // No update (nothing changed) for target_columns[3]
]

We have also RemovePartition operation. In the future we would have more operations.
Additionally each Operation is independent of target column - add method should tell what should be done after the insert of DbIndexRow.

I think this positional pattern is consistent with what we have in DBEmbedding (DbIndexedRow).

At this stage, I don't see how to effectively apply a bitset in this case - could you please advise on this? Another approach I see would be to add a target_column_index into the Operation variants.

Also, depending on the chosen approach, I would apply the same pattern to DBEmbedding (DbIndexedRow) to keep the pattern consistent across the application.

The position in Vec shouldn't be coordinated with position of column in DbIndexRow - these two are independent.

Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/table.rs Outdated
Comment thread crates/vector-store/src/table.rs Outdated
knowack1 added 3 commits May 20, 2026 12:15
In preparation for full-text search (FTS) indexes, which are planned
to support multiple target columns in Milestone 3—we need to ensure
the existing architecture is ready for this extension.

As validation, we are generalizing the concept of target columns to
support multiple columns even for vector indexes. This ensures that
the same code path will work for both vector and FTS indexes moving forward.

This approach prevents significant future refactoring when
multi-target-column FTS indexes are introduced.

Fixes VECTOR-676
Move the match on Operation variants into a dedicated async function,
keeping add() as a slim orchestrator.
Replace single-column extraction APIs with multi-column equivalents:
- vector_column_name() + extract_vector(value) replaced by
  extract_cdc_embeddings(&mut row) returning Vec<Option<Option<Vector>>>
- range_scan_query now selects N (value, writetime) pairs
- range_scan_stream uses extract_embeddings_from_row() for N columns

Refactor CDC consumer to work with multiple embeddings per row and
extract primary key extraction into a dedicated helper.

Add unit tests for multi-column range scan queries (CQL + Alternator)
and Alternator CDC extraction (vector updated, not touched, deleted).
@knowack1 knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from 965729b to 69397c5 Compare May 20, 2026 10:47
knowack1 added 2 commits May 20, 2026 13:12
Extract inline logic from the Occupied and Vacant branches of
TableAdd::add into dedicated methods on Index:

- vector_timestamps_for: retrieve vector timestamps column by index
- read_column_state: read epoch, timestamp, and existence flag
- require_partition_id: resolve partition_id with error context
- resolve_or_add_partition: resolve global or add local partition
Generalize index pipeline types to support non-embedding index backends
(e.g. FTS). The previous names were tightly coupled to vector embeddings,
making the abstraction unclear for other index types.

Renames:
- DbEmbedding -> DbIndexedRow
- EmbeddingValue -> DbIndexedValue
- DbIndexedRow::embeddings field -> DbIndexedRow::values

No functional changes — pure rename across the codebase.
@knowack1 knowack1 force-pushed the VECTOR-676-multi-column-index-support branch from 69397c5 to 971086e Compare May 20, 2026 11:12
Copy link
Copy Markdown
Collaborator

@ewienik ewienik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see an issue with multi-column index values - what should we do when only single column has value? We should provide this information to the pipeline in single operation - monitor_items should now which column is newer and update index only with this value. I think we should provide some bitset to the Add or Remove operation - which columns are affected with this operation. So we shouldn't build Vec<Vec> but extend Operation with bitset about columns affected.

Comment thread crates/vector-store/src/table.rs Outdated
primary_id,
partition_id,
});
for (col_idx, embedding) in embeddings.iter().enumerate() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check all embedding if they have something to change, and we should produce single Operation for one row (for all values in targets) - should we update or not.

@knowack1
Copy link
Copy Markdown
Collaborator Author

knowack1 commented May 21, 2026

I see an issue with multi-column index values - what should we do when only single column has value? We should provide this information to the pipeline in single operation - monitor_items should now which column is newer and update index only with this value. I think we should provide some bitset to the Add or Remove operation - which columns are affected with this operation. So we shouldn't build Vec but extend Operation with bitset about columns affected.

I think this is currently handled correctly (see my comment #449 (comment)). When value of specific column was not updated the Operations are empty for this specific position (column) in the Vec<Operations>.

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented May 21, 2026

I see an issue with multi-column index values - what should we do when only single column has value? We should provide this information to the pipeline in single operation - monitor_items should now which column is newer and update index only with this value. I think we should provide some bitset to the Add or Remove operation - which columns are affected with this operation. So we shouldn't build Vec but extend Operation with bitset about columns affected.

I think this is currently handled correctly (see my comment #449 (comment)). When value of specific column was not updated the Operations are empty for this specific position (column) in the Vec<Operations>.

But this is not performant way - out of cache. We will have a dynamic vector of dynamic vectors - it is hard to check on monitor_items - you have to do several iteration over vector of vectors. Data are not easily grasped as there could be two operations per one cdc insert. It is better to store bitset in Add/Delete operations indicating which column was changed. Additionally there could be more operations - no only add/remove vector - it could be add/remove partition. In the future there could be even more of them.

@knowack1
Copy link
Copy Markdown
Collaborator Author

These changes requires revisited approach according to https://scylladb.atlassian.net/browse/VECTOR-676?focusedCommentId=60373.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants