vector-store: preparation for multi-target-column indexes#449
Conversation
There was a problem hiding this comment.
Pull request overview
This PR prepares vector-store for upcoming multi-target-column indexes (notably for planned FTS support) by generalizing index metadata from a single target_column to target_columns, and by adjusting the embedding update path to carry per-column values/timestamps.
Changes:
- Replace
target_columnwithtarget_columns: Vec<ColumnName>across index metadata flows (DB discovery, routing/grouping, tests, benches). - Refactor
DbEmbeddingfrom{ embedding, timestamp }into{ columns: Vec<Option<ColumnValue>> }and update producers/consumers accordingly. - Update DB range-scan query generation to accept a slice of target columns (currently still using the first column).
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/vector-store/src/lib.rs | Public API updates: target_columns on metadata + new ColumnValue + DbEmbedding shape change. |
| crates/vector-store/src/table.rs | Consumes new DbEmbedding.columns shape and applies add/update/delete logic based on first column. |
| crates/vector-store/src/monitor_indexes.rs | Uses target_columns during index discovery and dimension lookup. |
| crates/vector-store/src/monitor_items.rs | Test updates for the new DbEmbedding.columns representation. |
| crates/vector-store/src/indexes.rs | Routing key updated to group by target_columns rather than a single column. |
| crates/vector-store/src/db.rs | DB index discovery now populates DbCustomIndex.target_columns (currently single-element vec). |
| crates/vector-store/src/db_index.rs | Range scan now passes target_columns; emits DbEmbedding using ColumnValue. |
| crates/vector-store/src/db_index_backend.rs | Backend representation updated to store target_columns; range-scan query takes a slice. |
| crates/vector-store/src/db_cdc.rs | CDC consumer now emits DbEmbedding.columns with ColumnValue. |
| crates/vector-store/src/node_state.rs | Test fixtures updated for target_columns. |
| crates/vector-store/tests/integration/usearch.rs | Integration tests updated to use target_columns and new embedding representation. |
| crates/vector-store/tests/integration/routing.rs | Integration tests updated to build indexes with target_columns. |
| crates/vector-store/tests/integration/opensearch.rs | Integration tests updated for target_columns and embedding representation. |
| crates/vector-store/tests/integration/memory_limit.rs | Integration test updated for target_columns and dimensions mapping. |
| crates/vector-store/tests/integration/db_basic.rs | Test DB shim updated for target_columns and DbEmbedding.columns. |
| crates/vector-store/benches/pipeline.rs | Bench updated for target_columns and DbEmbedding.columns. |
Comments suppressed due to low confidence (2)
crates/vector-store/src/db_index_backend.rs:63
extract_vectorindexestarget_columns[0]without checking for emptiness, which can panic ifIndexMetadata::target_columnsis empty. Consider enforcing a non-empty invariant fortarget_columnsor handling the empty case explicitly.
pub fn extract_vector(&self, value: CqlValue) -> anyhow::Result<Option<Vector>> {
match self {
Self::Cql { .. } => Vector::try_from(value).map(Some),
Self::Alternator { target_columns } => vector::AlternatorAttrs {
attrs: value,
target_column: target_columns[0].as_ref(),
}
.try_into(),
crates/vector-store/src/db_index_backend.rs:82
range_scan_queryassumestarget_columnsis non-empty and indexes[0]. If an empty slice is ever passed (now possible with the generalized API), this will panic. Consider validatingtarget_columnsat the call site or changing this helper to return an error on empty input.
pub(crate) fn range_scan_query(
keyspace: &KeyspaceIdentifier,
table: &TableIdentifier,
target_columns: &[ColumnName],
primary_key_list: &str,
partition_key_list: &str,
) -> String {
if keyspace.is_alternator() {
let attributes = CqlIdentifier::new(":attrs");
let vector = CqlLiteral::new(target_columns[0].as_ref());
format!(
6f8b526 to
17bcde3
Compare
cfca3b5 to
bc6339b
Compare
bc6339b to
1816bc4
Compare
29c61db to
cd4c8da
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
crates/vector-store/src/table.rs:1247
- In the
Vacantbranch, when an incoming embedding isSome(EmbeddingValue { embedding: None, timestamp })(a tombstone for a column whose row didn't yet exist), no entry is written tovector_timestamps[col_idx]. The slot retains the resize default ofETValue::None(Epoch::new(), Timestamp::UNIX_EPOCH), so the tombstone's timestamp is lost. A subsequent older insert for that column (withvector_timestamp > UNIX_EPOCHbut< original tombstone timestamp) will pass thestored_timestamp >= vector_timestampcheck in theOccupiedbranch and be applied, violating tombstone ordering. RecordingETValue::None(epoch, ev.timestamp)for the column whenev.embeddingisNonewould preserve correctness here.
for (col_idx, embedding) in embeddings.iter().enumerate() {
let Some(ev) = embedding else { continue };
if let Some(vector) = &ev.embedding {
index.vector_timestamps[col_idx].update(
primary_id,
ETValue::Some(primary_id.epoch(), ev.timestamp, ()),
)?;
all_operations[col_idx].push(Operation::AddVector {
primary_id,
partition_id,
vector: vector.clone(),
is_update: false,
});
}
}
crates/vector-store/src/db_index_backend.rs:103
- Linear search over
entriesfor every target column makes this O(N_columns × N_attrs) per CDC event. With multi-target-column FTS indexes (the stated motivation for this PR) and rows that have many attributes, this can become a hot path. Building a small lookup (e.g. aHashMap<&[u8], &CqlValue>overentriesonce, then probing per target column) keeps it linear in the larger of the two sizes.
target_columns
.iter()
.map(|col| {
let target = col.as_ref().as_bytes();
let value = entries.iter().find_map(|(key, val)| {
let matches = match key {
CqlValue::Blob(b) => b.as_slice() == target,
CqlValue::Text(s) => s.as_bytes() == target,
_ => false,
};
matches.then_some(val)
});
match value {
Some(v) => Ok(Some(Some(Vector::try_from(v.clone())?))),
None => Ok(Some(None)),
}
})
.collect()
}
c82a7a3 to
965729b
Compare
| pub fn vector_column_name(&self) -> &str { | ||
| match self { | ||
| Self::Cql { target_column } => target_column.as_ref(), | ||
| Self::Cql { target_columns } => target_columns[0].as_ref(), |
There was a problem hiding this comment.
target_columns[0] could panic (here and in other places), we should avoid this. Consider creating a newtype TargetColumns that would be responsible that only len > 0 are possible.
As this type TargetColumns will be invariant you can define it using Arc<Box[_]> to limit memory usage.
There was a problem hiding this comment.
This issue was later fixed in commit 727c7b1 After this commit, db_index_backend is also capable of handling multi-target columns, eliminating calls like target_columns[0]. So, with this, I think there is no need to introduce TargetColumns.
I could merge this and the 727c7b1 into single commit, but I tried not to put too much into one commit.
There was a problem hiding this comment.
For the maintenance it would be better to introduce newtype which will hold contract invariant. The issue might be solved by the later commit but we should limit similar problems in the future.
Additionally newtype gives better semantic by the type directly, not by the variable name.
You can introduce this newtype before this commit. IMHO it is better to fix issues in the commit when they appears, not in the following ones.
There was a problem hiding this comment.
Similar pattern of a non-empty list of columns we have in primary_key_columns and local key_columns. Such newtype can be used also for them. So the name could be different - maybe NonEmptyColumns? Or another name?
There was a problem hiding this comment.
Makes sense. I will introduce it.
| #[derive(Clone, Debug, PartialEq)] | ||
| pub struct DbEmbedding { | ||
| pub primary_key: PrimaryKey, | ||
| pub struct EmbeddingValue { |
There was a problem hiding this comment.
Consider rename it to DbIndexedValue as for the future it would be more meaningful
There was a problem hiding this comment.
Could you move that refactor before this commit?
| } | ||
|
|
||
| #[derive(Clone, Debug, PartialEq)] | ||
| pub struct DbEmbedding { |
There was a problem hiding this comment.
Consider rename it to DbIndexedRow
There was a problem hiding this comment.
Could you move that refactor before this commit?
| pub primary_key: PrimaryKey, | ||
| /// List of embeddings for each indexed column. | ||
| /// The order corresponds to the order of target columns in IndexMetadata. | ||
| pub embeddings: Vec<Option<EmbeddingValue>>, |
There was a problem hiding this comment.
Consider rename it into values
There was a problem hiding this comment.
Could you move this refactoring before this commit? It would be shorter context of changes in commits.
| _index_key: &IndexKey, | ||
| db_embedding: DbEmbedding, | ||
| ) -> anyhow::Result<Vec<Operation>> { | ||
| ) -> anyhow::Result<Vec<Vec<Operation>>> { |
There was a problem hiding this comment.
We don't need to create a hierarchy of vectors. Can we use only single vector?
There was a problem hiding this comment.
The returned type has the positional pattern. Every entry in the most upper vector corresponds the the operations for a given target column value. Added comment to the returned type also extracted Vec<Operation> into Operations alias.
There was a problem hiding this comment.
The returned type has the positional pattern. Every entry in the most upper vector corresponds the the operations for a given target column value. Added comment to the returned type also extracted
Vec<Operation>intoOperationsalias.
This is not always truth - sometimes one add to table produces two operations for index.
There was a problem hiding this comment.
We don't need this hierarchy information - we don't need vector of vectors - we don't need this information further in a pipeline, do we? add can produce flat vector.
There was a problem hiding this comment.
The current structure of Vec<Operations> works as follows. Assuming 4 target columns, add() can produce a Vec<Operations> like this:
[
[0] = [AddVector{...}], // New value for target_columns[0]
[1] = [RemoveBeforeAddVector{...}, AddVector{...}], // Updated value for target_columns[1]
[2] = [RemoveVector{...}], // Removed value for target_columns[2]
[3] = [] // No update (nothing changed) for target_columns[3]
]
I think this positional pattern is consistent with what we have in DBEmbedding (DbIndexedRow).
At this stage, I don't see how to effectively apply a bitset in this case - could you please advise on this? Another approach I see would be to add a target_column_index into the Operation variants.
Also, depending on the chosen approach, I would apply the same pattern to DBEmbedding (DbIndexedRow) to keep the pattern consistent across the application.
There was a problem hiding this comment.
The current structure of
Vec<Operations>works as follows. Assuming 4 target columns,add()can produce aVec<Operations>like this:[ [0] = [AddVector{...}], // New value for target_columns[0] [1] = [RemoveBeforeAddVector{...}, AddVector{...}], // Updated value for target_columns[1] [2] = [RemoveVector{...}], // Removed value for target_columns[2] [3] = [] // No update (nothing changed) for target_columns[3] ]
We have also RemovePartition operation. In the future we would have more operations.
Additionally each Operation is independent of target column - add method should tell what should be done after the insert of DbIndexRow.
I think this positional pattern is consistent with what we have in
DBEmbedding(DbIndexedRow).At this stage, I don't see how to effectively apply a bitset in this case - could you please advise on this? Another approach I see would be to add a target_column_index into the Operation variants.
Also, depending on the chosen approach, I would apply the same pattern to
DBEmbedding(DbIndexedRow) to keep the pattern consistent across the application.
The position in Vec shouldn't be coordinated with position of column in DbIndexRow - these two are independent.
In preparation for full-text search (FTS) indexes, which are planned to support multiple target columns in Milestone 3—we need to ensure the existing architecture is ready for this extension. As validation, we are generalizing the concept of target columns to support multiple columns even for vector indexes. This ensures that the same code path will work for both vector and FTS indexes moving forward. This approach prevents significant future refactoring when multi-target-column FTS indexes are introduced. Fixes VECTOR-676
Move the match on Operation variants into a dedicated async function, keeping add() as a slim orchestrator.
Replace single-column extraction APIs with multi-column equivalents: - vector_column_name() + extract_vector(value) replaced by extract_cdc_embeddings(&mut row) returning Vec<Option<Option<Vector>>> - range_scan_query now selects N (value, writetime) pairs - range_scan_stream uses extract_embeddings_from_row() for N columns Refactor CDC consumer to work with multiple embeddings per row and extract primary key extraction into a dedicated helper. Add unit tests for multi-column range scan queries (CQL + Alternator) and Alternator CDC extraction (vector updated, not touched, deleted).
965729b to
69397c5
Compare
Extract inline logic from the Occupied and Vacant branches of TableAdd::add into dedicated methods on Index: - vector_timestamps_for: retrieve vector timestamps column by index - read_column_state: read epoch, timestamp, and existence flag - require_partition_id: resolve partition_id with error context - resolve_or_add_partition: resolve global or add local partition
Generalize index pipeline types to support non-embedding index backends (e.g. FTS). The previous names were tightly coupled to vector embeddings, making the abstraction unclear for other index types. Renames: - DbEmbedding -> DbIndexedRow - EmbeddingValue -> DbIndexedValue - DbIndexedRow::embeddings field -> DbIndexedRow::values No functional changes — pure rename across the codebase.
69397c5 to
971086e
Compare
ewienik
left a comment
There was a problem hiding this comment.
I see an issue with multi-column index values - what should we do when only single column has value? We should provide this information to the pipeline in single operation - monitor_items should now which column is newer and update index only with this value. I think we should provide some bitset to the Add or Remove operation - which columns are affected with this operation. So we shouldn't build Vec<Vec> but extend Operation with bitset about columns affected.
| primary_id, | ||
| partition_id, | ||
| }); | ||
| for (col_idx, embedding) in embeddings.iter().enumerate() { |
There was a problem hiding this comment.
We should check all embedding if they have something to change, and we should produce single Operation for one row (for all values in targets) - should we update or not.
I think this is currently handled correctly (see my comment #449 (comment)). When value of specific column was not updated the Operations are empty for this specific position (column) in the |
But this is not performant way - out of cache. We will have a dynamic vector of dynamic vectors - it is hard to check on monitor_items - you have to do several iteration over vector of vectors. Data are not easily grasped as there could be two operations per one cdc insert. It is better to store bitset in Add/Delete operations indicating which column was changed. Additionally there could be more operations - no only add/remove vector - it could be add/remove partition. In the future there could be even more of them. |
|
These changes requires revisited approach according to https://scylladb.atlassian.net/browse/VECTOR-676?focusedCommentId=60373. |
In preparation for full-text search (FTS) indexes, which are planned to support multiple target columns in Milestone 3—we need to ensure the existing architecture is ready for this extension.
As validation, we are generalizing the concept of target columns to support multiple columns even for vector indexes. This ensures that the same code path will work for both vector and FTS indexes moving forward.
This approach prevents significant future refactoring when multi-target-column FTS indexes are introduced.
Fixes VECTOR-676